Reading Note on SonicMoE
Reading the following paper:
SonicMoE is a hardware-software co-design framework tailored for NVIDIA Hopper and Blackwell GPUs. It mitigates the IO bottlenecks inherent in fine-grained MoEs through three primary contributions:
- Gradient Flow Redesign: A mathematical reformulation of the backward pass that eliminates the need to cache the large down-projection output $Y$, decoupling activation memory from expert granularity.
- IO-Aware Kernels: Custom CuTe-DSL kernels that utilize “Ping-Pong” scheduling and fused gather operations to overlap high-volume memory transactions with computation.
- Token Rounding (TR): A routing algorithm that aligns token assignment with hardware tile sizes (e.g., 128), eliminating padding waste in sparse regimes.
1.86x throughput improvement and 45% activation memory reduction on a fine-grained 7B MoE compared to ScatterMoE. On 64 H100s, it matches the training throughput of ScatterMoE running on 96 H100s.
Motivation: The Granularity-Efficiency Paradox
Scaling laws indicate that increasing expert granularity ($G = d/n$, where $d$ is embedding dim and $n$ is expert intermediate dim) improves model quality per FLOP. However, this shift creates severe hardware inefficiencies.
A. Declining Arithmetic Intensity
As experts become smaller (higher $G$) and sparser (lower activation ratio $\rho = K/E$), the ratio of FLOPs to IO degrades. The Arithmetic Intensity (AI) for a forward pass is derived as: \(\text{AI} = \frac{3}{2 + 2G + 3/\rho}\) Insight: As granularity $G$ increases (e.g., from 4 in Mixtral to 16 in DeepSeek-V3), the denominator grows, causing AI to drop monotonically. This pushes training into a strictly memory-bound regime, where IO costs scale linearly with granularity.
B. Activation Memory Bloat
Standard implementations (ScatterMoE, MoMoE) cache the down-projection output $Y$ (size $T \times K \times d$) to compute gradients. In fine-grained models where $d \gg n$, this results in massive memory footprints that scale linearly with the number of active experts.
C. Tile Quantization Waste
GPU GEMMs operate on fixed tile sizes (e.g., $M_{tile}=128$). In highly sparse models (large $E$), the number of tokens routed to any single expert often fails to fill a tile, leading to significant compute waste due to padding.
Technical Innovations
A. Memory-Efficient Backward Pass (The “No-Cache” Derivation)
To break the memory dependency on $d$, SonicMoE re-derives the gradient computation for the router scores ($dS$) to avoid caching $Y$.
- Standard Formulation (High Memory): \(dS_{t,e} = \langle dO_t, Y_{t,e} \rangle\) Requires caching $Y$ ($TKd$ bytes).
- SonicMoE Formulation (Low Memory):
SonicMoE exploits the linearity of the down-projection $Y_e = A_e W_{2,e}$.
- Compute intermediate gradient w.r.t. expert output: \(dA'_e = dO_e W_{2,e}^T \quad (\text{Size: } T_e \times n)\)
- Compute $dS$ via inner product with up-projection output $A$: \(dS_{t,e} = \langle dA'_{t,e}, A_{t,e} \rangle\)
- Impact: The memory requirement now scales with $n$ rather than $d$. Since fine-grained models utilize $n \ll d$, this drastically reduces the footprint.
B. IO-Aware Kernel Design (Hopper/Blackwell)
SonicMoE implements kernels that maximize asynchronous overlap.
- Gather Fusion: Fuses the token gather operation ($2TKd$ bytes) directly into the GEMM prologue. On Hopper, this uses asynchronous prefetching; on Blackwell, it employs a relay warp mechanism to handle 2-CTA cluster synchronization limitations.
- Epilogue Fusion: Fuses
dSwiGLU, $dS$, and $dH$ computation into the GEMM epilogue. This avoids launching separate kernels that would require reading/writing to HBM. - Ping-Pong Scheduling: Utilizing Hopper’s asynchronous nature, the kernel overlaps the Matrix Multiply-Accumulate (MMA) of one warpgroup with the heavy epilogue IO (TMA store) of another. This effectively hides the latency of the high-volume memory transactions required by the heavy epilogue in fine-grained MoEs.
C. Token Rounding (TR) Routing
To solve tile quantization, TR modifies the routing logic to ensure expert load is always a multiple of $M_{tile}$.
- Algorithm:
- Calculate Top-K scores.
- Sort tokens by score within experts.
- Round: Adjust the token count for each expert to the nearest multiple of $M_{tile}$.
- Sparsify/Pad: If rounding down, drop lowest scoring tokens. If rounding up, pad with high-scoring tokens that would have been selected by an Expert Choice router.
- Constraint: The maximum deviation from the original assignment is bounded by exactly one tile per expert.
Quantitative Analysis & Benchmarks
A. Throughput (H100)
- Forward Pass: For a 7B model ($n=256$), SonicMoE achieves 43% higher throughput than DeepGEMM and 83% higher than ScatterMoE.
- Backward Pass: Achieves 115% speedup over MoMoE and 83% over ScatterMoE. The baseline methods suffer from launching separate kernels for gradients, whereas SonicMoE fuses $dSwiGLU$, $dS$, and $dH$ into the GEMM epilogue.
- Sparse Regimes: For a Qwen3-Next configuration ($K/E = 10/512$), SonicMoE with Token Rounding provides a 19.6% forward pass speedup over standard Top-K routing.
B. Memory Efficiency
- Granularity Robustness: As granularity increases (experts get smaller, e.g., $d/n$ goes from 0.5 to 4.0), SonicMoE’s memory footprint remains constant. In contrast, ScatterMoE and MoMoE show linear memory growth.
- Absolute Savings: At 120B scale, SonicMoE saves >3 GiB per layer compared to MoMoE.
C. Token Rounding Ablation
- Waste Elimination: In high sparsity settings ($E=256, K=8$), standard routing results in significant padding waste. TR recovers this, yielding a 15.9% end-to-end TFLOPS improvement.
- Model Quality: Experiments on OLMoE (1.4B) confirm that TR is a drop-in replacement. Training with “Nearest Rounding” maintains perplexity (difference < 0.02) and downstream accuracy while boosting throughput.