Adaptive NVFP4 Quantization

Reading the following paper:

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Standard NVFP4 (NVIDIA’s block-scaled FP4 format) often leads to training divergence and inference degradation due to high quantization error on “near-maximal” values.

Four Over Six (4/6): adaptive quantization technique. Instead of always mapping the largest value in a block to the maximum representable FP4 value (6), the algorithm dynamically chooses whether to scale the block to 4 or 6 based on which yields lower error. This method prevents training divergence and significantly improves post-training quantization (PTQ) accuracy with minimal computational overhead.

Technical Motivation: The “Dead Zone” in FP4

To understand the innovation, one must understand the limitations of the FP4 datatype and block scaling:

The FP4 Datatype: It only has 16 representable values: $\pm {0, 0.5, 1, 1.5, 2, 3, 4, 6}$.
Non-Uniformity: Unlike integer formats (INT4), FP4 step sizes are dynamic. There is a precision gap between values: the step size is 0.5 (between 0–2), 1.0 (between 2–4), and jumps to 2.0 (between 4–6).
The Problem: Standard NVFP4 quantization calculates a scale factor such that the largest value in a block of 16 maps to 6 (the maximum FP4 value).
- Consequently, any value in that block falling between 66.6% and 100% of the maximum (mapping to the range 4 to 6) suffers high quantization error because there are no representable values between 4 and 6.
- For example, if a value scales to $4.62$, it must be rounded down to 4, introducing a 13.4% error.

Key Insight: The authors discovered that error on these near-maximal values is the primary driver of performance degradation in NVFP4 models.

The Solution: Adaptive Block Scaling (4/6)

The authors propose evaluating two potential scaling strategies for every block of values (weights, activations, or gradients) and selecting the best one.

The Algorithm:

Candidate A (Scale to 6): Calculate the scale factor assuming the max value maps to 6. Quantize and dequantize to measure error.
Candidate B (Scale to 4): Calculate the scale factor assuming the max value maps to 4. Quantize and dequantize to measure error.
- Benefit: Scaling to 4 distributes values over the range $\pm 4$. While it sacrifices the ability to represent the value 6, it provides finer resolution for values that represent 75% of the block’s maximum (mapping to 3), which standard scaling cannot represent accurately.
Selection: Compare the errors (using Mean Squared Error) and store the superior FP8 scale factor and FP4 values.

Why Only NVFP4? This technique works for NVFP4 but not other formats like MXFP4. NVFP4 uses FP8 E4M3 for scale factors, which has enough precision to represent the 50% magnitude difference required to switch between scaling to 4 and 6. MXFP4 uses E8M0 (powers of 2), which lacks the necessary granularity.

Implementation & Efficiency

A major concern with adaptive scaling is the computational cost of “trying” two quantization paths.

Hardware Optimization: The method is implemented using PTX instructions on NVIDIA Blackwell GPUs. It uses the cvt family of instructions for packing/unpacking FP4.
Online Calculation: The decision logic runs in the GPU register file.
Overhead:
- Inference: < 2% overhead (sequence lengths $\le$ 16,384).
- Training: < 15% overhead (sequence lengths $\le$ 131,072).
Training Recipe: The authors integrate 4/6 into a pipeline that includes Stochastic Rounding (SR) for gradients and Random Hadamard Transforms (RHT) to mitigate outliers.

Experimental Results

A. Pre-Training Stability Standard NVFP4 training recipes often fail to converge or diverge completely.

Divergence: In experiments with Transformers (340M) and Hybrid architectures (1.3B), standard NVFP4 training diverged.
Recovery: Applying 4/6 prevented divergence in all cases, producing loss curves significantly closer to BF16 baselines.
2D Block Scaling: Training requires “2D block scaling” (sharing scales between forward and backward passes). Standard NVFP4 fails catastrophically with 2D scaling (perplexity explodes), but 4/6 recovers usable performance.

B. Post-Training Quantization (PTQ) 4/6 acts as an enhancement layer to existing quantization methods (AWQ, SmoothQuant, GPTQ).

Perplexity: When combined with AWQ and SmoothQuant, 4/6 improved WikiText-2 and C4 perplexity across all Llama 3 and Qwen 3 models tested.
Downstream Tasks: It improved average zero-shot accuracy on BoolQ, ARC, and HellaSwag. For example, Llama-3.1-8B with AWQ+4/6 achieves 73.1% average accuracy versus 72.2% for standard AWQ.

It highlights a subtle but critical reality of low-precision computing: dynamic range isn’t everything; distribution matters.

By default, quantization aims to maximize dynamic range (fitting the largest number into the largest bucket). For distributions common in LLMs (activations and gradients), it is often better to clip the range (scaling to 4) to gain precision (hitting the values 1, 2, and 3 accurately) rather than stretching the range to 6 and landing in the “dead zones” of the FP4 format.