MoE Parallel Folding
Reading the following paper:
Scalability bottlenecks in training large-scale Mixture of Experts (MoE) models on massive GPU clusters (up to 1,024 H100s): the traditional coupling of parallelism strategies between Attention layers and MoE layers is suboptimal due to their distinct computational and communication patterns.
MoE Parallel Folding: a technique that decouples the parallelism mappings of Attention and MoE components. This allows for heterogeneous configurations—specifically “folding” communication-intensive expert parallelism into high-bandwidth intra-node networks (NVLink)—without constraining the parallelism of the Attention layers. The method is implemented in Megatron-Core and achieves state-of-the-art Model FLOPs Utilization (MFU), specifically 49.3% for Mixtral 8x22B.
Problem: Coupled Parallelism
Existing hybrid parallelism strategies (like 3D parallelism) typically nest Expert Parallelism (EP) within Data Parallelism (DP) or couple it rigidly with Tensor Parallelism (TP). This creates two main issues:
- Scalability Cap: The maximum degree of EP is bounded by the DP degree. As batch sizes shrink per GPU at scale, the available parallelism for experts diminishes.
- Resource Mismatch: Attention layers are dense and sequence-level, favoring Tensor Parallelism (TP) and Context Parallelism (CP). MoE layers are sparse and token-level, favoring EP. Forcing them to share dimensions results in suboptimal communication overhead, often forcing heavy All-to-All operations over slow inter-node interconnects.
Solution: MoE Parallel Folding
1 The Concept
The framework redefines the model into two distinct 4D parallel groups that can be configured independently, provided the Pipeline Parallelism (PP) degree remains consistent.
- Attention Group: $TP \times CP \times DP \times PP$
- MoE Group: $TP_{expert} \times EP \times DP_{expert} \times PP$ (referred to as ETP, EP, EDP, and PP)
“Folding” Explained: The key insight is to “fold” the communication-heavy dimensions of the MoE layer (specifically EP) into the high-bandwidth intra-node domain (NVLink). For example, one can utilize a high degree of TP for Attention layers (inter-node) while simultaneously using pure EP for MoE layers folded entirely within a node, decoupled from the Attention mapping.
2 Understanding Expert Duplication (ETP/EP/EDP)
A unique aspect of this framework (visualized in the paper’s Figure 1) is the potential duplication of experts across groups.
- The Configuration: In a setup like
ETP2-EP1on 4 GPUs, the system utilizes Expert Data Parallelism (EDP). - Why Duplication Occurs: If the product of ETP and EP is smaller than the total number of GPUs, the remaining degrees of parallelism are assigned to EDP. This effectively replicates the experts (similar to standard Data Parallelism) to process different batches of tokens.
- Efficiency Gains: This duplication allows the system to confine All-to-All token dispatching to smaller, local groups (e.g., inside a node) rather than broadcasting globally. The duplicated groups (EDP) do not exchange tokens; they only synchronize gradients, which significantly reduces token routing traffic over slower networks.
3 Visualizing the Transition (Figure 9 Analysis)
The transition between Attention and MoE layers is zero-overhead and involves no network communication, only a local reshape.
- Attention Layer (Coupled): Processed as sequences. Heavy use of TP and CP requires “Ring Exchange” and “AllGather/ReduceScatter” to handle sequence context and large weight matrices.
- The Reshape: The system flattens the structured sequences (split by CP/TP) into a “batch of tokens.” The MoE layer ignores sequence order.
- MoE Layer (Decoupled): The parallelism shifts to pure EP (e.g., EP8). Every GPU acts as a unique expert. Tokens are permuted and routed globally (All-to-All) across all ranks to find their destination, processed, and then returned.
The Enabler: Flexible Token-Level Dispatcher
To support this heterogeneity, dispatcher handles the transition between the distinct parallel mappings of Attention and MoE layers.
1 Dispatcher Workflow
The dispatcher treats inputs as a bag of tokens, executing the following flow:
- Router: Determines token mapping and permutes tokens to contiguous memory.
- All-to-All-V (EP Group): Exchanges tokens to the device hosting the specific expert.
- AllGather-V (ETP Group): Only if ETP > 1. Ensures all members of an Expert-TP group share activations.
- Computation: Local GPU computes the expert Feed-Forward Network (FFN).
- ReduceScatter-V (ETP Group): Only if ETP > 1. Aggregates output hidden states.
- All-to-All-V: Returns tokens to origin devices.
- Un-permute: Restores sequence order for the subsequent Attention layer.
2 Support for Token-Dropless Training
The dispatcher supports token-dropless training (processing all tokens regardless of load balance) through dynamic tensor shapes.
- Mechanism: It eliminates sequence dependencies and does not enforce a capacity factor. The
All-to-All-Voperations dynamically expand or contract based on the router’s Top-K selection. - Validation: Numerical correctness was validated against baselines on Mixtral 8x7B, proving the dispatcher correctly handles variable loads without dropping data.
3 Dropping Strategy
When token dropping is used, the system defaults to sub-sequence dropping. It makes dropping decisions based only on the local sub-sequence logits, avoiding the communication overhead of gathering logits across ranks (full-sequence dropping) while maintaining convergence.
Insights
1. Fine-Grained vs. Coarse-Grained MoE: Fine-grained MoEs (e.g., Qwen2-57B or re-parameterized Mixtral G8T8) suffer a performance penalty compared to coarse-grained models.
- Why: They activate more experts per token and have smaller hidden sizes per expert, reducing GEMM efficiency and increasing communication volume.
- Result: Mixtral 8x22B (Coarse) reached 49.3% MFU, while Mixtral-G8T8 (Fine) only reached 28.8% MFU.
2. ETP vs. EP: Tensor Parallelism within the Expert layer (ETP) introduces significantly higher communication overhead than standard Expert Parallelism (EP). The optimal strategy is generally to minimize ETP and maximize EP, keeping EP groups compact (intra-node) whenever possible.
3. Context Scaling: The framework scales to 128K context length. Without parallel folding, EP groups often span across context parallelism groups, forcing All-to-All communications over lower-bandwidth inter-node fabrics. Folding allows these to remain within high-bandwidth domains.
Performance Benchmarks
Experiments were conducted on NVIDIA H100 clusters (Eos).
- Comparison: The method outperforms FSDP, FSDP+EP, and standard Megatron-Core (MCore).
- Mixtral 8x22B: 49.3% MFU (vs. 46.3% MCore baseline).
- Qwen2-57B-A14B: 39.0% MFU (vs. 35.3% MCore baseline).
- Scaling: Strong scaling is demonstrated up to 1,024 GPUs. For Llama3-8x70B, MFU dropped only from 43.7% to 41.5% when scaling from 128 to 1024 GPUs.
- FP8 Acceleration: Utilizing FP8 precision with delayed scaling provided a further 1.30x speedup over BF16, achieving 631.7 TFLOPS.