Quantization-Aware Distillation (QAD) for NVFP4
Reading the following paper:
Quantization-Aware Distillation (QAD) as a superior alternative to standard Quantization-Aware Training (QAT) for recovering the accuracy of Large Language Models (LLMs) and Vision-Language Models (VLMs) quantized to the NVFP4 format.
While Post-Training Quantization (PTQ) is sufficient for massive models (e.g., DeepSeek R1 671B), it causes non-negligible accuracy drops in smaller models (e.g., <15B parameters). QAD addresses this by using the original full-precision model as a teacher to guide the quantized student via KL divergence, bypassing the complexity and instability of replicating original training pipelines.
1. Technical Context: The NVFP4 Format
To understand the difficulty of the task, it is essential to note the aggressive nature of the target format. NVFP4 is a 4-bit floating-point format designed for the Blackwell architecture.
- Structure: It extends MXFP4 by reducing the block size from 32 to 16 and using two-level scaling: per-block E4M3 scales plus a per-tensor FP32 scale.
- Performance: It offers $2\text{-}3\times$ arithmetic throughput and $\approx 1.8\times$ memory reduction compared to FP8.
- The Challenge: While effective for large models, the small block size neutralizes traditional outlier mitigation techniques in PTQ for smaller models, necessitating training-based recovery.
2. The Failure Mode of Standard QAT
Critical engineering bottleneck in applying standard Quantization-Aware Training (QAT) to modern LLMs.
- Pipeline Complexity: Modern models undergo multi-stage post-training (SFT $\to$ RL $\to$ Model Merging). Replicating this exact pipeline for QAT is often impractical or impossible if data is unavailable.
- Distribution Drift (The “RL Breakage”): A key insight is that QAT, which uses Cross-Entropy (Next-Token Prediction) loss, effectively acts as an additional training stage.
- In experiments with RL-heavy models (e.g., Nemotron 3 Nano, AceReason), QAT significantly degraded performance (e.g., AceReason AIME25 score dropped from 63.5 to 46.1).
- Even if QAT achieves a validation loss (Cross-Entropy) similar to the BF16 baseline, the KL divergence remains high, indicating the model’s output distribution has drifted from the specific capabilities learned during RL.
3. The Solution: Quantization-Aware Distillation (QAD)
QAD treats accuracy recovery as a distribution matching problem rather than a re-training problem.
- Objective Function: It minimizes the KL divergence between the full-precision teacher ($P_{\text{teacher}}$) and the quantized student ($P_{\text{student}}$). \(\mathcal{L}_{\text{QAD}} = D_{\text{KL}}(P_{\text{teacher}} \| P_{\text{student}})\)
- Outcome: QAD achieves nearly zero KL divergence relative to the teacher, effectively “freezing” the capabilities learned during complex RL or merging stages without needing to replicate those stages.
4. Key Empirical Insights
A. Superiority in Reasoning and RL
QAD consistently outperforms QAT on reasoning benchmarks (MATH, GPQA-D, AIME).
- SFT Models: On Llama Nemotron Super V1, QAD recovered AIME25 performance to 45.6 (near the BF16 baseline of 46.0), whereas PTQ dropped to 32.3.
- RL Models: This is the strongest use case. QAD recovers near-BF16 performance on RL-tuned models where QAT causes regression. For example, on Nemotron 3 Nano, QAD achieved 34.3 on AA-LCR (vs. BF16 35.9), while QAT collapsed to 24.8.
B. Extreme Robustness to Data Quality
The most surprising finding is QAD’s insensitivity to the training data source. Because the student only needs to match the teacher’s distribution (not learn ground truth from data), the data merely needs to activate the network.
- Partial Domain Coverage: Training AceReason (Math + Code) using only code data successfully recovered Math accuracy.
- Synthetic & Random Data: QAD works effectively with:
- Original SFT data.
- Data generated by the model itself (even incorrect generations).
- Random tokens (performs comparable to PTQ baseline without breaking the model).
5. Implementation Best Practices
Technical guidance for practitioners implementing NVFP4 QAD:
- Learning Rate Sensitivity:
- SFT-Trained Models: Use a low LR (e.g., $1\text{e-}6$), typically matching or below the original annealing point. These models are already converged; high LR causes divergence.
- RL-Trained Models: Use a higher LR (e.g., $1\text{e-}5$). The RL stage shifts weights significantly from the SFT initialization, requiring larger steps for the quantized student to realign.
- Teacher Selection: Use the exact same model in BF16 as the teacher. Using a larger teacher (e.g., distilling 12B into 9B) performed worse, likely due to the need for more data to bridge the distribution gap.
- Loss Metric: KL Divergence consistently outperforms MSE on logits.
- Data Volume: Convergence requires significantly less data than pre-training (e.g., $\sim 0.3$B tokens for a 49B model).
6. Insights
Shift in how we handle quantization for “smart” models (Reasoning/RL). The traditional view—that QAT is simply “fine-tuning with quantized weights”—is dangerous for RL-tuned models because the original training objective (Cross Entropy) does not capture the subtle policy optimizations achieved via RL.
QAD decouples the quantization process from the training pipeline. By relying on the teacher’s logits rather than ground-truth labels, QAD allows engineers to quantize complex, merged, or RL-tuned models without access to the proprietary or complex pipelines that created them. This makes it a critical technique for deploying advanced reasoning models on next-generation hardware like Blackwell.