Reading the following paper:

SonicMoE is a hardware-software co-design framework tailored for NVIDIA Hopper and Blackwell GPUs. It mitigates the IO bottlenecks inherent in fine-grained MoEs through three primary contributions:

  1. Gradient Flow Redesign: A mathematical reformulation of the backward pass that eliminates the need to cache the large down-projection output $Y$, decoupling activation memory from expert granularity.
  2. IO-Aware Kernels: Custom CuTe-DSL kernels that utilize “Ping-Pong” scheduling and fused gather operations to overlap high-volume memory transactions with computation.
  3. Token Rounding (TR): A routing algorithm that aligns token assignment with hardware tile sizes (e.g., 128), eliminating padding waste in sparse regimes.

1.86x throughput improvement and 45% activation memory reduction on a fine-grained 7B MoE compared to ScatterMoE. On 64 H100s, it matches the training throughput of ScatterMoE running on 96 H100s.


Motivation: The Granularity-Efficiency Paradox

Scaling laws indicate that increasing expert granularity ($G = d/n$, where $d$ is embedding dim and $n$ is expert intermediate dim) improves model quality per FLOP. However, this shift creates severe hardware inefficiencies.

A. Declining Arithmetic Intensity

As experts become smaller (higher $G$) and sparser (lower activation ratio $\rho = K/E$), the ratio of FLOPs to IO degrades. The Arithmetic Intensity (AI) for a forward pass is derived as: \(\text{AI} = \frac{3}{2 + 2G + 3/\rho}\) Insight: As granularity $G$ increases (e.g., from 4 in Mixtral to 16 in DeepSeek-V3), the denominator grows, causing AI to drop monotonically. This pushes training into a strictly memory-bound regime, where IO costs scale linearly with granularity.

B. Activation Memory Bloat

Standard implementations (ScatterMoE, MoMoE) cache the down-projection output $Y$ (size $T \times K \times d$) to compute gradients. In fine-grained models where $d \gg n$, this results in massive memory footprints that scale linearly with the number of active experts.

C. Tile Quantization Waste

GPU GEMMs operate on fixed tile sizes (e.g., $M_{tile}=128$). In highly sparse models (large $E$), the number of tokens routed to any single expert often fails to fill a tile, leading to significant compute waste due to padding.


Technical Innovations

A. Memory-Efficient Backward Pass (The “No-Cache” Derivation)

To break the memory dependency on $d$, SonicMoE re-derives the gradient computation for the router scores ($dS$) to avoid caching $Y$.

  • Standard Formulation (High Memory): \(dS_{t,e} = \langle dO_t, Y_{t,e} \rangle\) Requires caching $Y$ ($TKd$ bytes).
  • SonicMoE Formulation (Low Memory): SonicMoE exploits the linearity of the down-projection $Y_e = A_e W_{2,e}$.
    1. Compute intermediate gradient w.r.t. expert output: \(dA'_e = dO_e W_{2,e}^T \quad (\text{Size: } T_e \times n)\)
    2. Compute $dS$ via inner product with up-projection output $A$: \(dS_{t,e} = \langle dA'_{t,e}, A_{t,e} \rangle\)
  • Impact: The memory requirement now scales with $n$ rather than $d$. Since fine-grained models utilize $n \ll d$, this drastically reduces the footprint.

B. IO-Aware Kernel Design (Hopper/Blackwell)

SonicMoE implements kernels that maximize asynchronous overlap.

  • Gather Fusion: Fuses the token gather operation ($2TKd$ bytes) directly into the GEMM prologue. On Hopper, this uses asynchronous prefetching; on Blackwell, it employs a relay warp mechanism to handle 2-CTA cluster synchronization limitations.
  • Epilogue Fusion: Fuses dSwiGLU, $dS$, and $dH$ computation into the GEMM epilogue. This avoids launching separate kernels that would require reading/writing to HBM.
  • Ping-Pong Scheduling: Utilizing Hopper’s asynchronous nature, the kernel overlaps the Matrix Multiply-Accumulate (MMA) of one warpgroup with the heavy epilogue IO (TMA store) of another. This effectively hides the latency of the high-volume memory transactions required by the heavy epilogue in fine-grained MoEs.

C. Token Rounding (TR) Routing

To solve tile quantization, TR modifies the routing logic to ensure expert load is always a multiple of $M_{tile}$.

  • Algorithm:
    1. Calculate Top-K scores.
    2. Sort tokens by score within experts.
    3. Round: Adjust the token count for each expert to the nearest multiple of $M_{tile}$.
    4. Sparsify/Pad: If rounding down, drop lowest scoring tokens. If rounding up, pad with high-scoring tokens that would have been selected by an Expert Choice router.
  • Constraint: The maximum deviation from the original assignment is bounded by exactly one tile per expert.

Quantitative Analysis & Benchmarks

A. Throughput (H100)

  • Forward Pass: For a 7B model ($n=256$), SonicMoE achieves 43% higher throughput than DeepGEMM and 83% higher than ScatterMoE.
  • Backward Pass: Achieves 115% speedup over MoMoE and 83% over ScatterMoE. The baseline methods suffer from launching separate kernels for gradients, whereas SonicMoE fuses $dSwiGLU$, $dS$, and $dH$ into the GEMM epilogue.
  • Sparse Regimes: For a Qwen3-Next configuration ($K/E = 10/512$), SonicMoE with Token Rounding provides a 19.6% forward pass speedup over standard Top-K routing.

B. Memory Efficiency

  • Granularity Robustness: As granularity increases (experts get smaller, e.g., $d/n$ goes from 0.5 to 4.0), SonicMoE’s memory footprint remains constant. In contrast, ScatterMoE and MoMoE show linear memory growth.
  • Absolute Savings: At 120B scale, SonicMoE saves >3 GiB per layer compared to MoMoE.

C. Token Rounding Ablation

  • Waste Elimination: In high sparsity settings ($E=256, K=8$), standard routing results in significant padding waste. TR recovers this, yielding a 15.9% end-to-end TFLOPS improvement.
  • Model Quality: Experiments on OLMoE (1.4B) confirm that TR is a drop-in replacement. Training with “Nearest Rounding” maintains perplexity (difference < 0.02) and downstream accuracy while boosting throughput.