NVFP4: Stable 4-Bit Training at 10 Trillion Tokens

Read on the following paper:

Pretraining Large Language Models with NVFP4.

The rapid scaling of LLMs has increased the demand for computational resources. Training a frontier model today requires a massive investment of time, compute, and energy, on the order of tens to hundreds of yottaflops. Improving pretraining efficiency is essential to enable the next generation of highly capable LLMs.

While 8-bit floating point (FP8) training is now widely adopted, transitioning to even narrower precision, such as 4-bit floating point (FP4), is the next logical step. Moving to FP4 could deliver a two- to three-fold boost in arithmetic performance and reduce memory usage by half compared to FP8. However, quantization at this narrow precision poses challenges to training stability, convergence, and implementation, notably for large-scale models trained on long token horizons.

In a major step forward for narrow-precision LLM training, NV researchers introduced a novel approach for stable and accurate training of LLMs using the NVFP4 format. They validated this approach by pretraining a 12-billion-parameter model on 10 trillion tokens—documented as the longest publicly documented training run in 4-bit precision to date.

The NVFP4 Advantage

NVFP4 (Alvarez et al., 2025) is an enhanced 4-bit data format that extends the “microscaling” approach. This format provides improved numerical properties over previous 4-bit microscaling formats like MXFP4. NVFP4 achieves this improvement through several key design choices:

Smaller Block Size: NVFP4 utilizes a smaller micro-block structure, reducing the size from 32 elements (used in MXFP4) to 16 elements. This change more effectively captures the local dynamic range in the data.
Precise Scaling: It uses an FP8 scale factor format (E4M3), which incorporates fractional precision, providing more precise scaling compared to the power-of-two rounding used by MXFP4’s UE8M0 scale factors.
Two-Level Scaling: NVFP4 employs a two-level scaling strategy, combining the fine-grained FP8 scale factor with an FP32 scale applied at the tensor level. This two-level approach ensures that the NVFP4 format encodes at least 6.25% of the values in a block (the absolute maximum, or amax, values) at near-FP8 precision, while the remaining values are stored in FP4.

These improvements increase the accuracy of outliers while minimizing the amount of small values quantized to zero, resulting in consistently better training behavior compared to MXFP4. In comparison studies, MXFP4 pretraining required training on 36% more tokens (1.36T tokens) to match the loss achieved by NVFP4 (1T tokens), highlighting NVFP4’s efficiency benefits.

The Four Pillars of Stable 4-Bit Training

Leveraging the NVFP4 format, the researchers introduced a 4-bit training methodology that achieves accuracies comparable to FP8 on very strong language models. The success of training the 12B model over the 10T-token horizon confirmed that each component of this methodology is essential for ensuring convergence and stability:

Component	Function/Mechanism	Source Rationale/Details
1. Selective High-Precision Layers	Retention of specific numerically sensitive linear layers in higher precision (BF16 or MXFP8), often fewer than 15% of the network.	The final few linear layers require more dynamic range and mantissa than FP4 provides, and quantizing them to FP4 can cause training to diverge.
2. Random Hadamard Transforms (RHT)	Applied to the inputs of weight gradient (Wgrad) GEMMs. These transforms disperse large-magnitude outliers.	RHTs redistribute outliers into an approximately Gaussian distribution, making them easier to represent in narrower formats. They restricted RHTs to Wgrad tensors, finding them sufficient for training the models. They used a 16x16 matrix size, which they found had better convergence than 4x4 and similar results as 128x128.
3. Two-Dimensional (2D) Block Scaling	Weights are quantized using 16×16 blocks to maintain the same quantized representations across both the forward and backward passes. Activations and gradients use standard 1x16 blocks.	This consistency mitigates “chain rule violations” caused by different scaling in the forward versus backward passes, which the researchers hypothesized contributes to reduced model accuracy. Maintaining consistent quantized weights led to improved training loss for the 12B model.
4. Stochastic Rounding (SR)	Used exclusively for gradient tensors (Wgrad and Dgrad) during quantization from high precision to FP4. Round-to-nearest-even is used for weights and activations.	SR prevents deterministic rounding bias, which is typically more pronounced in gradient tensors and can impact training convergence. Applying SR to gradient tensors was essential for convergence in the 12B model, while applying it to forward pass tensors was detrimental.

Ablation studies confirmed that removing any of these components led to worse convergence for the 12B model trained over the 10T-token horizon.

Validation of Scale and Accuracy

To validate their methodology, the researchers pretrained a 12-billion-parameter hybrid Mamba-Transformer model (from the Nemotron-H family) using NVFP4 on 10 trillion tokens and compared its performance against an FP8 baseline.

The results indicated that NVFP4 training was stable and accurate:

Training Loss: The validation loss of the NVFP4 model closely tracked its FP8 counterpart throughout the 10T token training run. During the stable phase, the relative loss error of NVFP4 remained consistently below 1%.
Downstream Accuracy: The small loss gap resulted in downstream task accuracies that were largely unaffected. For example, the NVFP4 model attained an MMLU-pro accuracy of 62.58%, nearly matching the 62.62% accuracy achieved through FP8 pretraining. This trend held across various domains, including knowledge-intensive reasoning, mathematics, coding (with a slight dip), and commonsense reasoning tasks.

The findings established that this NVFP4-based pretraining technique offers a practical pathway for scalable 4-bit training. Furthermore, the paper suggested that the loss gap could be reduced by switching to higher precision (like BF16 or MXFP8) shortly before the learning rate decay phase.

Conclusion and Outlook

The paper concluded that large-scale pretraining with NVFP4 is both stable and accurate when paired with a targeted methodology designed for stability and convergence, establishing the first public evidence of sustained 4-bit pretraining at a multi-trillion-token scale. NVFP4 also reached comparable loss with fewer tokens than MXFP4, indicating inherent efficiency gains.

This success lays the foundation for faster and more efficient training of future frontier models. NVFP4 training is now fully supported on NVIDIA Blackwell GPUs via a recent update to the Transformer Engine. Future work, they suggested, will focus on refining the methodology to quantize all linear layers, reducing remaining high-precision layers, and extending NVFP4 to attention and communication paths, as well as evaluating it on larger models and different architectures such as mixture-of-experts.

This work in narrow-precision training is analogous to discovering a more efficient fuel source for a massive engine. The previous fuel (FP8) was reliable, but the new fuel (NVFP4) allows the engine to operate with half the memory and much higher arithmetic performance, while the specialized engineering techniques ensure the engine runs reliably over massive distances (10 trillion tokens) without compromising its output quality.