Reading the following paper and blogs:

Blackwell GPUs introduces native hardware support for Microscaling (MX) formats. While theoretically offering 2x throughput over BF16, realizing these gains requires a fundamental departure from standard FP8 recipes. Successful implementation relies on a “bimodal” strategy: a numerical recipe that defies OCP v1.0 standards to ensure convergence, and a systems recipe that completely redesigns memory hierarchy usage to prevent quantization overhead from erasing compute gains.


The Numerical Recipe: Stability via Precision

Standard FP8 training (e.g., on H100s) typically relies on “per-tensor” scaling, necessitating the E5M2 format for gradients to handle high dynamic range. MXFP8 shifts this paradigm by using fine-grained block scaling (groups of 32 elements), which locally absorbs dynamic range requirements.

  • Universal E4M3 is Mandatory: Contrary to conventional wisdom, E4M3 should be used for all tensors: weights, activations, and notably, activation gradients.
    • Insight: Because the shared block scale ($E8M0$) handles the magnitude, the element type ($E4M3$) can dedicate its bits to precision (3 mantissa bits) rather than range. Ablation studies show that using E5M2 for gradients in MXFP8 leads to perplexity degradation, while E4M3 matches BF16 parity.
  • The “Ceiling” Rounding Protocol: The OCP v1.0 specification suggests a rounding method that effectively floors the scale factor. This is numerically dangerous for MXFP8.
    • The Flaw: If the scale is rounded down, $V/Scale$ may overflow the maximum representable value of the 8-bit format.
    • The Fix: Use a Ceiling/Round-Up algorithm for scale calculation. This ensures the scale is always large enough to map the input values within the valid quantization range, eliminating training instability.

The Systems Recipe: The “Tensor Memory” Constraint

Achieving the theoretical hardware speedups requires navigating the Blackwell architecture’s new memory hierarchy, specifically Tensor Memory (TMEM). A direct port of Hopper kernels will likely result in performance worse than BF16.

  • The Trap of Dequantization: Blackwell Tensor Cores accumulate results in TMEM, not registers. Attempting to dequantize results using CUDA cores requires a round-trip (TMEM $\to$ Reg $\to$ CUDA Core $\to$ TMEM) that creates massive pipeline bubbles, potentially taking 1.76x longer than the matrix math itself.
    • Solution: Use the tcgen05.mma instruction. This hardware intrinsic handles block scaling and accumulation entirely within the Tensor Core/TMEM pipeline, avoiding the CUDA core bottleneck.
  • Memory Bandwidth & Quantization Overhead: Quantization is memory-bound. Naive kernels (e.g., standard libraries) often cap at ~4.5 TB/s, which can consume up to 40% of the total step time.
    • Optimization: Custom kernels are required to push bandwidth to 6.2+ TB/s. Crucially, these kernels must output scale factors directly in the specific, swizzled layout expected by tcgen05, avoiding runtime reshapes that kill performance.
  • The “Persistent Grid” Data Flow: The optimal data path for scale factors avoids the register file entirely. The pipeline must move scales from HBM $\to$ Shared Memory (SMEM) $\to$ TMEM using asynchronous copy instructions (cp.async.bulk and tcgen05.cp). This preserves the “tensor vibe” and maximizes occupancy.

Domain Specifics: Mixture-of-Experts (MoE)

MoE models introduce unique challenges due to irregular memory access patterns.

  • Grouped GEMMs: Standard kernels fail here. Need Grouped Wgrad/Dgrad kernels.
  • Supergrouping: To prevent cache thrashing, “supergrouping” heuristics must be applied per expert. This organizes blocks to ensure the output matrix region computed by SMs is as square as possible, maximizing L2 cache reuse.

Technical Synthesis & “Gotchas”

Feature Standard FP8 (Hopper) MXFP8 Recipe (Blackwell) Reason
Gradient Format E5M2 E4M3 Block scaling handles range; gradients need precision.
Scale Rounding Floor (Standard) Ceil / Round Up Prevents overflow in quantized blocks.
Accumulation Registers TMEM Hardware architecture change; CUDA cores are too slow for dequant.
Kernel Mode Warp-synchronous 2-CTA / Async tcgen05 allows 2 SMs to share B-matrix, reducing HBM traffic.