Notes on Reading Hunyuan Model

Mixed Routing Strategy

Hunyuan-Large employs a mixed routing strategy within its MoE architecture.
Combines a “shared expert” for processing all input tokens with multiple “specialized experts” for specific token types.
The shared expert captures general knowledge applicable to all tokens.
Specialized experts learn domain-specific knowledge, routing each token to the most relevant expert using a top-k routing mechanism.
This strategy leverages both broad and specialized knowledge, enhancing overall capabilities.

Hunyuan-Large uses a novel “recycle routing” strategy to address token dropping in traditional top-k routing.
In top-k routing, tokens are assigned to top-k scoring experts with a capacity limit, leading to potential information loss when tokens are discarded.
Recycle routing randomly re-assigns dropped tokens to other experts with available capacity.
Enhances the model’s ability to retain and utilize crucial information, improving training efficiency and effectiveness.
Similar to expert choice on load balancing, ensuring all tokens are handled by experts with balance.

Dropout is used during the supervised fine-tuning (SFT) stage.
Attention dropout of 0.1 and hidden dropout of 0.2 are applied.
Helps prevent overfitting by randomly dropping out neurons during training.
MoE architecture benefits more from suitable dropout rates compared to dense models.
Careful tuning of dropout parameters is important for optimizing MoE model performance.

Three-Phase Learning Rate Schedule:
- Warm-up Phase: Gradually increases the learning rate to a peak value.
- Gradual Decay Phase: Slowly decreases the learning rate over time.
- Annealing Phase: Significantly reduces the learning rate for the final 5% of pre-training tokens.
Expert-Specific Learning Rate Scaling:
- Different learning rates for different experts (shared vs. routed experts) within the MoE architecture.
- Adjusts for varying numbers of tokens processed by each expert.
- Enhances training efficiency.

Stage 1: Gradually increases token length from 32,000 to 256,000.
Stage 2: Fine-tunes on a corpus of 10 billion tokens for long-context understanding.
Scaling of RoPE:
- Utilizes Rotary Position Embedding (RoPE) during long-context pre-training.
- RoPE base frequency scaled to 1 billion to handle extended sequence lengths effectively.
Mixed Data for Long-Context Pre-Training:
- Corpus comprises 25% natural long-context data and 75% normal-length pre-training data.
- Ensures development of specialized long-context skills and general language understanding.

Allows adjacent layers to share the same KV cache.
Reuses key-value pairs generated by the previous layer, reducing memory footprint.