Notes on Reading Hunyuan Model
Read Hunyuan Model
Mixed Routing Strategy
- Hunyuan-Large employs a mixed routing strategy within its MoE architecture.
 - Combines a “shared expert” for processing all input tokens with multiple “specialized experts” for specific token types.
 - The shared expert captures general knowledge applicable to all tokens.
 - Specialized experts learn domain-specific knowledge, routing each token to the most relevant expert using a top-k routing mechanism.
 - This strategy leverages both broad and specialized knowledge, enhancing overall capabilities.
 
Expert Choice vs. Recycle Routing
- Hunyuan-Large uses a novel “recycle routing” strategy to address token dropping in traditional top-k routing.
 - In top-k routing, tokens are assigned to top-k scoring experts with a capacity limit, leading to potential information loss when tokens are discarded.
 - Recycle routing randomly re-assigns dropped tokens to other experts with available capacity.
 - Enhances the model’s ability to retain and utilize crucial information, improving training efficiency and effectiveness.
 - Similar to expert choice on load balancing, ensuring all tokens are handled by experts with balance.
 
Training
Dropout Adjustment
- Dropout is used during the supervised fine-tuning (SFT) stage.
 - Attention dropout of 0.1 and hidden dropout of 0.2 are applied.
 - Helps prevent overfitting by randomly dropping out neurons during training.
 - MoE architecture benefits more from suitable dropout rates compared to dense models.
 - Careful tuning of dropout parameters is important for optimizing MoE model performance.
 
Exponential Moving Average (EMA)
- EMA is used during the reinforcement learning from human feedback (RLHF) phase.
 - Maintains a moving average of model weights, updated with each training step.
 - Provides a stable model less prone to oscillations.
 - Mitigates “reward hacking” and reduces “alignment tax.”
 
Learning Rate
- Three-Phase Learning Rate Schedule:
    
- Warm-up Phase: Gradually increases the learning rate to a peak value.
 - Gradual Decay Phase: Slowly decreases the learning rate over time.
 - Annealing Phase: Significantly reduces the learning rate for the final 5% of pre-training tokens.
 
 - Expert-Specific Learning Rate Scaling:
    
- Different learning rates for different experts (shared vs. routed experts) within the MoE architecture.
 - Adjusts for varying numbers of tokens processed by each expert.
 - Enhances training efficiency.
 
 
Two-Stage Long-Context Pre-Training
- Stage 1: Gradually increases token length from 32,000 to 256,000.
 - Stage 2: Fine-tunes on a corpus of 10 billion tokens for long-context understanding.
 - Scaling of RoPE:
    
- Utilizes Rotary Position Embedding (RoPE) during long-context pre-training.
 - RoPE base frequency scaled to 1 billion to handle extended sequence lengths effectively.
 
 - Mixed Data for Long-Context Pre-Training:
    
- Corpus comprises 25% natural long-context data and 75% normal-length pre-training data.
 - Ensures development of specialized long-context skills and general language understanding.
 
 
Explorations on MoE Scaling Laws
- Research on scaling laws of MoE models with varying parameters and data sizes.
 - Derived formulas for estimating compute budget and optimal parameters.
 - Insights crucial for designing Hunyuan-Large for optimal performance.
 
Inference
Cross-Layer Attention (CLA)
- Allows adjacent layers to share the same KV cache.
 - Reuses key-value pairs generated by the previous layer, reducing memory footprint.
 
KV Cache Compression
- Integrates Grouped-Query Attention (GQA) and Cross-Layer Attention (CLA).
 - Saves almost 95% of KV cache memory, increasing inference efficiency.