Mixed Routing Strategy
- Hunyuan-Large employs a mixed routing strategy within its MoE architecture.
- Combines a “shared expert” for processing all input tokens with multiple “specialized experts” for specific token types.
- The shared expert captures general knowledge applicable to all tokens.
- Specialized experts learn domain-specific knowledge, routing each token to the most relevant expert using a top-k routing mechanism.
- This strategy leverages both broad and specialized knowledge, enhancing overall capabilities.
Expert Choice vs. Recycle Routing
- Hunyuan-Large uses a novel “recycle routing” strategy to address token dropping in traditional top-k routing.
- In top-k routing, tokens are assigned to top-k scoring experts with a capacity limit, leading to potential information loss when tokens are discarded.
- Recycle routing randomly re-assigns dropped tokens to other experts with available capacity.
- Enhances the model’s ability to retain and utilize crucial information, improving training efficiency and effectiveness.
- Similar to expert choice on load balancing, ensuring all tokens are handled by experts with balance.
Training
Dropout Adjustment
- Dropout is used during the supervised fine-tuning (SFT) stage.
- Attention dropout of 0.1 and hidden dropout of 0.2 are applied.
- Helps prevent overfitting by randomly dropping out neurons during training.
- MoE architecture benefits more from suitable dropout rates compared to dense models.
- Careful tuning of dropout parameters is important for optimizing MoE model performance.
Exponential Moving Average (EMA)
- EMA is used during the reinforcement learning from human feedback (RLHF) phase.
- Maintains a moving average of model weights, updated with each training step.
- Provides a stable model less prone to oscillations.
- Mitigates “reward hacking” and reduces “alignment tax.”
Learning Rate
- Three-Phase Learning Rate Schedule:
- Warm-up Phase: Gradually increases the learning rate to a peak value.
- Gradual Decay Phase: Slowly decreases the learning rate over time.
- Annealing Phase: Significantly reduces the learning rate for the final 5% of pre-training tokens.
- Expert-Specific Learning Rate Scaling:
- Different learning rates for different experts (shared vs. routed experts) within the MoE architecture.
- Adjusts for varying numbers of tokens processed by each expert.
- Enhances training efficiency.
Two-Stage Long-Context Pre-Training
- Stage 1: Gradually increases token length from 32,000 to 256,000.
- Stage 2: Fine-tunes on a corpus of 10 billion tokens for long-context understanding.
- Scaling of RoPE:
- Utilizes Rotary Position Embedding (RoPE) during long-context pre-training.
- RoPE base frequency scaled to 1 billion to handle extended sequence lengths effectively.
- Mixed Data for Long-Context Pre-Training:
- Corpus comprises 25% natural long-context data and 75% normal-length pre-training data.
- Ensures development of specialized long-context skills and general language understanding.
Explorations on MoE Scaling Laws
- Research on scaling laws of MoE models with varying parameters and data sizes.
- Derived formulas for estimating compute budget and optimal parameters.
- Insights crucial for designing Hunyuan-Large for optimal performance.
Inference
Cross-Layer Attention (CLA)
- Allows adjacent layers to share the same KV cache.
- Reuses key-value pairs generated by the previous layer, reducing memory footprint.
KV Cache Compression
- Integrates Grouped-Query Attention (GQA) and Cross-Layer Attention (CLA).
- Saves almost 95% of KV cache memory, increasing inference efficiency.