Scalable Training of Mixture-of-Experts Models with Megatron Core

Reading the following paper:

Scalable Training of Mixture-of-Experts Models with Megatron Core

This paper presents Megatron-Core MoE, the MoE training stack within NVIDIA’s Megatron-Core framework. It addresses the fundamental systems challenges of training trillion-parameter-class MoE models at high throughput on NVIDIA GPU clusters. The headline results: 1,233 TFLOPS/GPU on GB300 and 1,048 TFLOPS/GPU on GB200 for DeepSeek-V3-685B.

1. MoE Fundamentals

Architecture

Given an input token representation x, the router computes:

\[\mathbf{p}(\mathbf{x}) = \text{Softmax}(\mathbf{W}_r \mathbf{x})\]

The MoE layer output is:

\[\text{MoE}(\mathbf{x}) = \sum_{i \in \text{TopK}(\mathbf{p}(\mathbf{x}))} p_i(\mathbf{x}) \cdot E_i(\mathbf{x})\]

where $E_i$ is the $i$-th expert network. Three key advantages:

Scalable capacity: model size grows independently of per-token compute
Computational efficiency: only $K$ of $E$ experts activate per token
Specialization: different experts learn different input patterns

The Parameter-Compute Mismatch

This is the central insight of the paper. In a dense transformer with $N_\text{total}$ parameters, FLOPs per token is approximately $6N_\text{total}$, so parameters and computation scale in lockstep.

For MoE, per-token computation is approximately $6N_\text{active}$ where $N_\text{active} \propto K$ while $N_\text{total} \propto E$, and $K \ll E$.

Concrete example: DeepSeek-V3 has 685B total parameters but only 37B active per token, an 18x gap.

This creates the Three Walls:

Wall	Root Cause	Manifestation
Memory	All $E$ experts’ params/grads/optimizer states in memory, but only $K$ activate	199.5 GB per GPU for DeepSeek-V3
Communication	EP requires all-to-all collectives to route tokens across GPUs	20-60% of training time
Compute	Small per-expert GEMMs underutilize Tensor Cores; many kernel launches	Host-boundedness

2. MoE Layer Architecture: Four-Stage Forward Pass

Stage 1: Route. A learned linear projection maps each token’s hidden state to $E$ logits: $\mathbf{l} = \mathbf{W}_r^\top \mathbf{x} \in \mathbb{R}^E$. A score function, either softmax or sigmoid as used in DeepSeek-V3, converts these into probabilities. Top-$k$ selection chooses the active experts.

Stage 2: Dispatch. Tokens are permuted so those destined for the same expert are contiguous. Three backends are described: AllGather, All-to-All, and Flex via DeepEP/HybridEP.

Stage 3: Expert Computation. All local experts run in a single Grouped GEMM call. Each expert is a two-layer MLP with optional gating: $E_i(\mathbf{x}) = \mathbf{W}_2^{(i)} \phi(\mathbf{W}_1^{(i)} \mathbf{x})$

Stage 4: Combine. Inverse communication returns tokens, unpermutation restores order, and shared expert output is added.

3. Parallel Folding and Multi-Dimensional Parallelism

The Dense-Sparse Mismatch

A single Transformer block contains two fundamentally different computation patterns:

Aspect	Attention (Dense)	MoE (Sparse)
TP	Large QKV matrices benefit from high TP	Small per-expert dims make high TP counterproductive
CP	Long sequences benefit from high CP	No sequence dependency; CP is irrelevant
EP	Not applicable	Essential for distributing experts

Prior frameworks forced World Size = TP x CP x PP x DP, where EP <= DP. This creates three problems:

Multiplicative GPU requirements: EP=8 forces DP>=8, and with CP=8 the minimum becomes 64 GPUs.
Forced suboptimal parallelism: high TP fragments small experts, while low TP underparallelizes attention.
Cross-node communication: EP constrained within DP forces all-to-all across slow interconnects.

Parallel Folding Solution

Core idea: decouple attention and MoE parallelism mappings.

Attention layers form groups over TP x CP x DP x PP
MoE layers form groups over ETP x EP x EDP x PP
Only constraint: PP must remain consistent

Key benefits:

Breaks EP <= DP: EP can fold across TP x CP groups. Example: attention TP=4, CP=2, DP=8, PP=4 on 256 GPUs. Traditionally EP<=8; with folding EP=64 becomes possible.
Reduces minimum GPUs: CP=8 and EP=8 traditionally requires 64 GPUs; with folding, only 8.
Independent optimization: attention uses high TP, while MoE uses ETP=1 for full expert width.
NVLink locality: both CP and EP all-to-all stay within the NVLink domain.

Gradient Handling

Expert gradients are scaled by edp_size / dp_size to account for the different effective batch sizes seen by experts versus dense layers.

4. Breaking the Memory Wall

Memory Anatomy (DeepSeek-V3, PP4 x VPP4 x EP64, 256 GPUs)

Component	Memory/GPU	Optimization
Weights & Gradients	36.4 GB	PP, EP, TP sharding
Main Weights & Optimizer States	32.1 GB	Distributed optimizer, BF16 moments
Activations	131.0 GB	Low precision, recomputation, offloading
Total	199.5 GB

Key insight: activations dominate, exceeding weights and optimizer states combined.

Memory-Efficient Permutation (Zero Overhead)

Standard formulation applies routing weights after expert computation: $y = \sum_{i \in \mathcal{T}(\mathbf{x})} p_i \cdot \mathbf{W}_2^{(i)} \phi(\mathbf{W}_1^{(i)} \mathbf{x})$

Memory-efficient version absorbs $p_i$ before the second linear layer: $y = \sum_{i \in \mathcal{T}(\mathbf{x})} \mathbf{W}_2^{(i)} \left(p_i \cdot \phi(\mathbf{W}_1^{(i)} \mathbf{x})\right)$

Since $\mathbf{W}_2^{(i)}$ is a pure linear map with no bias, scalar multiplication commutes: $p_i \cdot \mathbf{W}_2^{(i)} \mathbf{h} = \mathbf{W}_2^{(i)} (p_i \cdot \mathbf{h})$

Why this saves memory: in the standard version, computing $\partial\mathcal{L}/\partial p_i$ requires retaining each expert output $E_i(\mathbf{x})$. In the efficient version, $p_i$ multiplies $\phi(\mathbf{z}_i)$ directly, so $\partial\mathcal{L}/\partial p_i$ only depends on $\phi(\mathbf{z}_i)$, which can be recomputed from $\mathbf{z}_i = \mathbf{W}_1^{(i)} \mathbf{x}$ already saved for SwiGLU backward. This saves roughly 26.3 GB per GPU for DeepSeek-V3 with essentially zero extra compute.

FP8/FP4 Activations

Linear layer inputs stored in FP8 instead of BF16 reduce memory by 50% per tensor. For DeepSeek-V3, that is roughly 16 GB saved. FP4 pushes this further to a 75% reduction.

Fine-Grained Recomputation

Two composable techniques:

Granular recomputation: selectively recompute only specific operations such as activation functions, LayerNorm, and MLA up-projection, typically with under 5% compute overhead.
Output-discarding recomputation: release checkpointed module outputs immediately after downstream consumption and restore them via recomputation during backward.

Recomputation Target	Memory Saved/GPU
MLA Up-Projection	30.4 GB
SwiGLU Activation	3.8 GB
LayerNorm	8.2 GB
Total	42.4 GB

Critical insight: full-layer recomputation of MoE is especially expensive because it re-triggers EP all-to-all communication. Fine-grained recomputation avoids that penalty.

Fine-Grained Activation Offloading

Forward: input activations are offloaded to CPU through a dedicated D2H stream, overlapping with the next module’s computation.

Backward: Layer-Staggered Reload reloads the same module type from the next layer while gradients are computed for the current layer. Only one activation per module type resides on GPU at a time.

Peak memory advantage over full recomputation:

Full recomputation: $L \times \text{layer_input} + 1 \times \text{layer_intermediate}$
Offloading: $1 \times \text{layer_input} + 1 \times \text{layer_intermediate}$

Results: 10-18% memory reduction with only 1.6-2% throughput overhead. For Qwen3-235B, offloading enabled a lower TP degree and about 15% throughput improvement.

Precision-Aware Optimizer

Adam stores first and second moments. The optimization here is to store moments in BF16 or FP8, then cast to FP32 inside TransformerEngine’s FusedAdam kernel for the actual update.

Memory per parameter per DP rank decreases from $6 + 12/d$ bytes to $6 + 8/d$ bytes, saving on the order of 10-12 GB from the 32.1 GB optimizer-state budget.

FSDP for MoE

Dual DeviceMesh design: dense layers shard across the full DP group, while expert layers shard across the EDP group.

Two key optimizations:

Non-uniform sharding: flatten and concatenate module parameters, then shard non-uniformly so shard boundaries align with communication buffers for zero-copy collectives.
Persistent double buffers with NCCL User Buffer Registration: pre-allocate two persistent buffers and cycle between them. This reduces SM footprint from 8-32 SMs to 1-4 SMs.

5. Breaking the Communication Wall

Communication Anatomy

For DeepSeek-V3: 58 MoE layers times 2 operations per layer gives 116 dispatch/combine operations per forward pass. Backward doubles this. At 50 GB/s inter-node bandwidth, a single 200 MB dispatch already costs milliseconds, which compounds rapidly over an iteration.

HybridEP

Developed by NVIDIA for NVLink-rich topologies such as NVL72.

Dispatch: reads data from global memory into shared memory based on routing info, then writes to destinations via FIFO queues. For inter-node traffic, GPUs with the same local index across nodes exchange first, then forward within the node.

Combine: fuses reduction into the communication kernel itself. Cross-node data is reduced first, then the remaining intra-node reduction completes locally.

Performance on GB200 with EP64:

HybridEP dispatch: 675 us vs all-to-all 930 us
HybridEP combine: 744 us vs all-to-all 827 us

EP Communication Overlapping

1F1B forward-backward overlap merges the forward pass of one microbatch with the backward pass of another.

Two key optimizations:

Stream Separation: compute stream and communication stream run in parallel.
W/D Split: backward MLP is split into weight gradient (W/mlp) and data gradient (D/mlp). Only D/mlp depends on backward dispatch, which opens more room to hide communication.

Result: EP communication overhead drops from 30-40% to under 5% of iteration time for DeepSeek-V3 on H100.

6. Breaking the Compute Efficiency Wall

Grouped GEMM

DeepSeek-V3’s 256 small experts produce GEMMs with M dimensions around 128 tokens per expert, far below the regime needed for peak Tensor Core efficiency.

Four implementations are discussed:

Multi-stream cuBLASLt GEMMs: individual GEMMs launched into multiple CUDA streams
CUTLASS Grouped GEMM: a fused single-kernel path
cuBLASLt Grouped GEMM (device-initiated): reads shapes from device memory, making it CUDA-Graph-compatible
cuteDSL Grouped GEMM: fuses SwiGLU activation and FP8 quantization into the GEMM epilogue

Permutation Fusion

Three-stage pipeline:

Preprocessing: generate the Row ID map once
Permute: move tokens according to offset maps
Unpermute: inverse permutation plus FP32 accumulation

Router and Aux-Loss Fusion

Three fused kernels are described:

score computation with top-$k$ and softmax/sigmoid
score computation for auxiliary loss
auxiliary loss computation

CUDA Graphs

Full vs Partial: Full CUDA Graphs capture the entire forward-backward pass, but only for drop-and-pad MoE. Partial CUDA Graphs capture only the static components:

Graphable: attention, router, EP preprocessing, shared experts, dense MLP
Not graphable: token dispatch, expert GEMM with dynamic M dimension, token combine

Memory optimizations:

Graph count reduction: without PP, microbatches share graphs (L x 2); with PP, each microbatch needs its own (L x M x 2). An is_first_microbatch GPU flag controls microbatch-specific behavior.
Pool sharing: all graphs share one pool when captured in execution order.
Buffer reuse: static I/O buffers are reused across graphs per PP execution order.

Result: about 10% end-to-end speedup with roughly 7 GB extra memory on DeepSeek-V3 GB200.

Full CUDA Graphs for Dropless MoE: Three Complementary Techniques

Challenge 1: kernel launch without knowing problem size. Solution: device-initiated Grouped GEMM, where cuBLASLt reads shapes directly from device memory. The cuteDSL path can also fuse SwiGLU and quantization into the epilogue.

Challenge 2: memory allocation without knowing actual size. Without mitigation, worst-case buffers waste $O(\text{EP_size})$ more memory.

ECHO (Elastic Cloning for Hot Experts)

Popular experts can receive far more tokens than others. ECHO dynamically clones hot experts onto spare slots on underutilized ranks.

Forward: the ECHO planner identifies hot experts, generates the hot expert map and updated routing map, copies weights to spare slots through HybridEP, and routes tokens to both the home and cloned experts.

Backward: expert-gradient dispatch collects gradients from cloned experts back to the home experts.

Paged Stashing

Decouples worst-case computation buffers from activation storage:

Single tmp buffer sized for worst case and shared across all layers
Paged stashing buffer stores only the actual tokens used per layer

This reduces memory from $O(\text{layers} \times \text{worst_case})$ to $O(\text{worst_case} + \text{actual_total})$.

Implementation detail: PagedStashBuffer uses 64 tokens per page with a circular-buffer free list. Stash and reload kernels are device-initiated and overlap with computation via dedicated Pack and Unpack CUDA streams.

7. Reduced-Precision Training (FP8/FP4)

Strategy: Selective Precision

Three principles:

Protect routing: keep the router in FP32
Preserve key components: embeddings, output layers, main gradients, master weights, and optimizer states stay high precision
Quantize bulk computation: expert GEMMs and activations go low precision

FP8 Recipes

Per-Tensor FP8 (Hopper and Blackwell): one scale per tensor. Two variants are discussed:

Delayed scaling: uses historical amax, but is not the recommended path
Current/live scaling: computes scale just in time

The hybrid format uses E4M3 for inputs and weights and E5M2 for gradients.

Blockwise FP8 (recommended on Hopper): uses E4M3 for all tensors, with activations and gradients quantized in 1 x 128 tiles and weights in 128 x 128 blocks. The paper notes this recipe has already been proven at scale on models such as DeepSeek-V3, Minimax-M2, and Ant Ling-2.0.

MXFP8 (recommended on Blackwell): uses 1 x 32 element granularity with native Blackwell Tensor Core support and hardware-accelerated scaling. The important caveat is that parameter AllGather still communicates in BF16 because MXFP8 uses different quantization directions for forward and backward.

NVFP4

Uses E2M1 with two-level microscaling:

Per-tensor FP32 scale: remaps the overall distribution into a range compatible with block scaling
Per-block E4M3 scale: blocks of 16 elements then map into FP4 range

Three critical algorithmic additions are required for stable training:

Random Hadamard Transforms (RHT): applied to weight-gradient computation to reduce outlier impact
2D scaling: 16 x 16 weight-block scaling keeps forward and backward consistent
Stochastic rounding: applied on gradients to reduce rounding bias during FP4 conversion

FP8/FP4 Primary Weights

The framework eliminates a redundant BF16 copy by casting directly from FP32 to FP8 or FP4. The distributed-optimizer quantization flow is:

Get local abs-max from the master weights
AllReduce to get global abs-max
Use global abs-max plus master weights for the partial cast

For the blockwise recipe, specialized kernels aware of the 2D weight layout compute abs-max over 128 x 128 blocks.

MoE-Specific FP8/FP4 Challenges

Padding alignment: FP8 GEMMs require alignment to 16 for per-tensor and blockwise FP8, or 32 for MXFP8 and NVFP4. Because token dimension varies dynamically, the paper describes two solutions: routing-map padding and fusing padding directly into permutation.

Grouped quantization: multiple expert input tensors are quantized in a single fused kernel, and the implementation is CUDA-Graphable.

NVFP4 quantization fusion: RHT is fused with quantization to avoid extra BF16 traffic. Hadamard is still computed twice, once for amax and once inside the fused quantization path, but this remains faster than materializing a full BF16 buffer. Stochastic rounding uses cuRANDDx for in-kernel random-number generation.

8. Long-Context MoE Training

The Computational Shift

At 64K tokens, SDPA consumes 69% of FLOPs, versus roughly 10-15% at short sequence lengths. SDPA scales as $O(s^2)$ while MoE scales as $O(s)$, so attention becomes the dominant cost.

Key recommendation: do not recompute core attention at long sequence lengths. At 64K, SDPA recomputation adds about 18% compute overhead but saves only 9 GB of memory. Recomputing non-SDPA components instead saves 89.8 GB with lower performance impact.

CP vs TP Trade-offs

P2P CP: preferred across nodes because ring-style KV exchange overlaps naturally with SDPA
All-to-all CP: converts sequence-sharded layouts to head-sharded layouts before SDPA
TP: preferred within nodes for sharding linear weights

Practical guideline: all-to-all CP + TP inside nodes; P2P CP across nodes.

Dynamic Context Parallelism

For variable-length sequences, the system selects CP degree per microbatch based on actual sequence lengths. Multiple CP groups are pre-constructed per rank during initialization. The per-token loss is:

\[\mathcal{L} = \frac{\sum_{t \in \mathcal{V}} \ell_t}{|\mathcal{V}|}\]

where $\mathcal{V}$ is the set of valid non-padding tokens.

9. Production Features

Load Balancing Strategies

Three approaches are discussed:

Auxiliary loss: gradient-based, differentiable, soft balance
Sinkhorn: assignment-based, non-differentiable, hard balance; iterates row and column normalization to convergence
Aux-loss-free / Expert Bias: feedback-based, non-differentiable, adaptive; updates expert bias based on token-count feedback

Latent MoE

Latent MoE inserts a shared down-projection before expert dispatch and an up-projection after combine:

\[\text{output}(\mathbf{x}) = \mathbf{W}_\uparrow \cdot \left(\sum_{i \in \mathcal{T}_{K,E}} p_i E_i(\mathbf{W}_\downarrow \cdot \mathbf{x}; \ell)\right) + \sum_j E_j^\text{shared}(\mathbf{x}; d)\]

The compression ratio $\alpha = d / \ell$ reduces both all-to-all volume and per-expert weight size by a factor of $\alpha$.

Flexible Asymmetric VPP

Allows different numbers and types of layers per virtual pipeline stage. For DeepSeek-V3 with 61 decoder layers plus 1 MTP layer, PP=16, and VPP=2, the first stage holds the embedding plus 3 dense decoder layers to match the cost of 2 MoE layers, while the MTP layer sits in its own standalone stage and the loss is separated.

Upcycling

Converts a dense checkpoint into MoE via virtual-group initialization: shard MLP weights in the intermediate dimension (4h -> 2h), duplicate the shards, then initialize half the router weights and duplicate them. This guarantees Top-2 routing initially selects one expert from each shard pair, so the MoE output exactly matches the dense model at the start of training.

Multi-Token Prediction (MTP)

MTP optimizes the model to predict multiple consecutive future tokens at each position, densifying the supervision signal. Unlike parallel independent predictions, it preserves causal dependencies between predictions through hidden-state transitions, which improves convergence and generation quality. During inference, the model falls back to ordinary single-token prediction for compatibility.

Flexible pipeline parallelism allows MTP layers to be placed strategically within the VPP layout. In the DeepSeek-V3 example with PP=16 and VPP=2, the MTP layer is placed in a standalone pipeline stage on PP rank 14 to balance the workload.

Muon Optimizer

Unlike AdamW, which performs element-wise updates, Muon applies a matrix-aware optimization by orthogonalizing entire weight matrices. The production integration provides:

Split QKV support: efficient orthogonalization even when attention projection matrices are stored as separate Q, K, and V tensors
Distributed optimizer integration: optimizer states are sharded across data-parallel ranks while preserving correct orthogonalization semantics
CPU offloading: Muon’s orthogonalization buffers can be offloaded when GPU memory is tight

MuonClip addresses a separate stability problem in trillion-parameter training, where query-key dot products can grow without bound and trigger attention explosions. The paper notes hardware-accelerated implementations in cuDNN, cudnn-frontend, and Transformer Engine.

10. Performance Results

Model	System	#GPUs	Dtype	TFLOPS/GPU	Tokens/s/GPU
DeepSeek-V3	GB300	256	MXFP8	1,233	4,730
DeepSeek-V3	GB200	256	MXFP8	1,048	4,020
DeepSeek-V3	GB200	256	BF16	857	3,298
DeepSeek-V3	H100	1,024	FP8-BLK	368	1,412
Qwen3-235B	GB300	256	MXFP8	974	6,583
Qwen3-235B	GB200	256	MXFP8	919	6,212
Qwen3-235B	GB300	128	MXFP8 (131K seq)	1,150	1,556

GB200 and GB300 deliver roughly 3x higher token throughput than H100. Long-context DeepSeek-V3 at 256K tokens still reaches 88% of short-context MFU.

11. Case Study: DeepSeek-V3 GB200 vs H100

Config	GB200 (256 GPUs)	H100 (1,024 GPUs)
TP/PP/EP	1/4/64	2/8/64
Precision	MXFP8	FP8-Blockwise
Dispatcher	HybridEP	DeepEP
Recompute	mlp only	mlp, mla_up_proj, moe_act, layernorm
CUDA Graphs	Enabled	-
EP Overlap	-	Enabled
Performance	1,048 TFLOPS/GPU	368 TFLOPS/GPU

Key insight: the same model requires fundamentally different strategies on different hardware. GB200’s 192 GB memory and NVL72 topology largely eliminate the communication wall, shifting the bottleneck toward CPU overhead and making CUDA Graphs essential. H100’s 80 GB memory and NVL8 topology instead make cross-node communication the main bottleneck, so EP overlap becomes essential, and FP8 is what frees enough memory to hold the extra overlap buffers.

12. Systematic Optimization Workflow

A three-phase workflow emerges from tuning Mixtral, DeepSeek-V3, and Qwen3 across GB200 and H100. The process is inherently iterative: solving one bottleneck often exposes the next.

Phase 1: Establish Memory-Feasible Parallelism

Memory feasibility is the first hard constraint. The paper explicitly frames the impact of each parallelism strategy on per-GPU memory:

Strategy	Peak Activation	Weight Memory	Optimizer States	Comm (Per-Layer)
TP	$1/d$ (with SP)	$1/d$	$1/d$	High
EP	~1 (load-dependent)	$1/d$ (MoE only)	$1/d$	Medium
PP	1 (>1 with VPP)	$1/d$	$1/d$	Medium
CP	$1/d$	1	$1/d$*	Medium
DP	1	1	$1/d$*	Low

* Requires distributed optimizer.

Practical tip: use --fake-init-process-group to emulate distributed training on a single GPU for rapid iteration on parallelism configurations without allocating a full cluster.

Phase 2: Select Optimal Parallelism Strategy

Five guidelines:

Minimize model parallelism, maximize data parallelism: keep TP, EP, PP, and CP as small as possible while still avoiding OOM. Use the distributed optimizer (--use-distributed-optimizer) to shard optimizer states across DP ranks.
Keep EP and TP within the NVLink domain: make sure EP x TP fits inside the local NVLink island, typically 8 GPUs per node or 72 GPUs for NVL72. When scaling beyond that, prefer PP over stretching TP or EP across nodes.
Use pipeline parallelism for multi-node scaling: PP distributes layers across nodes. Enable VPP to reduce pipeline bubbles when PP >= 2, and balance work across VPP ranks.
Prefer EP over TP for expert layers: EP yields larger local GEMMs, lower communication overhead, simpler computation graphs, and eliminates local token permutation when EP = num_experts. The paper gives a concrete example: Mixtral-8x7B with EP8 x TP1 outperforms EP4 x TP2.
Enable context parallelism for long sequences: use CP when sequence length is at least about 8K tokens. For sequences shorter than about 4K, the CP overhead can exceed the benefit.

Phase 3: Profile and Optimize Bottlenecks

Diagnose which wall dominates, then apply targeted fixes.

Memory bottleneck: symptom is forced full recomputation or overly aggressive parallelism just to avoid OOM.

Optimization	Overhead	Config Flag
FP8 Training	Low	`--fp8-format --fp8-recipe`
Selective Recomputation	Low	`--recompute-granularity --recompute-modules`
Precision-Aware Optimizer	Low	`--use-precision-aware-optimizer`
Activation Offloading	Medium	`--fine-grained-activation-offloading --offload-modules`
Optimizer Offloading	Medium	`--offload-optimizer-states`

Communication bottleneck: symptom is a profile dominated by collectives.

Communication Type	Config Flag
DP gradient reduce/param gather	`--overlap-grad-reduce --overlap-param-gather`
TP communication	`--tp-comm-overlap`
EP dispatcher	`--moe-token-dispatcher-type`
EP all-to-all hiding	`--overlap-moe-expert-parallel-comm`
PP send/recv	`--pipeline-model-parallel-layout`

CPU overhead bottleneck: symptom is Nsight Systems showing gaps between GPU kernels.

Optimization	Config Flag
Disable Python GC	`--manual-gc --manual-gc-interval 10`
Reduce kernel launches	Decrease TP or increase MBS
Enable CUDA Graphs	`--cuda-graph-impl transformer_engine`

Computation bottleneck: symptom is low SM utilization even after communication and CPU gaps are under control.

Optimization	Config Flag
Grouped GEMM	`--moe-grouped-gemm`
Kernel fusions	`--moe-router-fusion --moe-permute-fusion`
FP8 precision	`--fp8-format --fp8-recipe`

Key insight: the same model on different hardware needs different optimization priorities. On NVL8, where EP crosses nodes, the Communication Wall dominates and can consume 30-50% of step time in all-to-all. On NVL72, where EP stays inside NVLink, enabling FP8 often shifts the bottleneck to CPU overhead instead.

Iterative Nature

The ordering matters: memory in Phase 1, then parallelism in Phase 2, then profiling in Phase 3. But the process is cyclical. Memory optimizations may enable smaller parallelism degrees, which pushes you back to Phase 1. Phase 3 optimizations such as EP communication overlap and CUDA Graphs also consume memory, which can force you to revisit earlier choices.

13. RL Post-Training Insights

Router Replay: logs expert assignments during inference and replays them during training for more stable optimization
Packing-aware dynamic batch size: keeps total effective tokens per batch more consistent
Attention cost metric: sorts microbatches by $\sum (\text{seq_len})^2$ in serpentine order to reduce synchronization bubbles
Dynamic CP: selects CP degree per microbatch rather than provisioning for the worst-case sequence mix

14. Conclusion

MoE sparsity introduces two fundamental challenges:

Parameter-compute mismatch, which creates the Three Walls of memory, communication, and compute efficiency
Dense-sparse mismatch, which requires decoupled parallelism between attention and expert layers

Key contributions summarized:

Contribution	Impact
Parallel Folding	Breaks `EP <= DP`; enables flexible parallelism mapping
Memory Optimization	199.5 GB to under 80 GB per GPU for DeepSeek-V3
Communication Optimization	All-to-all shifts from foreground bottleneck to mostly background work
Compute Efficiency	Grouped GEMM plus CUDA Graphs plus sync-free execution
FP8/FP4 Training	Cross-cutting improvements across all three walls
Long-Context Training	Keeps sub-sequence lengths manageable through CP and TP scaling
RL Support	Packed sequences, Dynamic-CP, and router replay

Final performance: DeepSeek-V3 reaches 1,233 / 1,048 TFLOPS/GPU on GB300 and GB200 with 256 GPUs, versus 368 TFLOPS/GPU on H100 with 1,024 GPUs. GB200 and GB300 deliver about 3x higher token throughput than H100.

The broader takeaway is that Megatron-Core MoE is a full-stack systems response to these two mismatches. Memory savings from FP8, recomputation, and offloading are not isolated wins; they enable communication overlap, better parallelism choices, and eventually CUDA Graph coverage. Large-scale MoE training becomes feasible only when routing, parallelism, kernels, optimizer states, and long-context execution are all tuned together.