Hunyuan Model

Mixed Routing Strategy

  • Hunyuan-Large employs a mixed routing strategy within its MoE architecture.
  • Combines a “shared expert” for processing all input tokens with multiple “specialized experts” for specific token types.
  • The shared expert captures general knowledge applicable to all tokens.
  • Specialized experts learn domain-specific knowledge, routing each token to the most relevant expert using a top-k routing mechanism.
  • This strategy leverages both broad and specialized knowledge, enhancing overall capabilities.

Expert Choice vs. Recycle Routing

  • Hunyuan-Large uses a novel “recycle routing” strategy to address token dropping in traditional top-k routing.
  • In top-k routing, tokens are assigned to top-k scoring experts with a capacity limit, leading to potential information loss when tokens are discarded.
  • Recycle routing randomly re-assigns dropped tokens to other experts with available capacity.
  • Enhances the model’s ability to retain and utilize crucial information, improving training efficiency and effectiveness.
  • Similar to expert choice on load balancing, ensuring all tokens are handled by experts with balance.

Training

Dropout Adjustment

  • Dropout is used during the supervised fine-tuning (SFT) stage.
  • Attention dropout of 0.1 and hidden dropout of 0.2 are applied.
  • Helps prevent overfitting by randomly dropping out neurons during training.
  • MoE architecture benefits more from suitable dropout rates compared to dense models.
  • Careful tuning of dropout parameters is important for optimizing MoE model performance.

Exponential Moving Average (EMA)

  • EMA is used during the reinforcement learning from human feedback (RLHF) phase.
  • Maintains a moving average of model weights, updated with each training step.
  • Provides a stable model less prone to oscillations.
  • Mitigates “reward hacking” and reduces “alignment tax.”

Learning Rate

  • Three-Phase Learning Rate Schedule:
    • Warm-up Phase: Gradually increases the learning rate to a peak value.
    • Gradual Decay Phase: Slowly decreases the learning rate over time.
    • Annealing Phase: Significantly reduces the learning rate for the final 5% of pre-training tokens.
  • Expert-Specific Learning Rate Scaling:
    • Different learning rates for different experts (shared vs. routed experts) within the MoE architecture.
    • Adjusts for varying numbers of tokens processed by each expert.
    • Enhances training efficiency.

Two-Stage Long-Context Pre-Training

  • Stage 1: Gradually increases token length from 32,000 to 256,000.
  • Stage 2: Fine-tunes on a corpus of 10 billion tokens for long-context understanding.
  • Scaling of RoPE:
    • Utilizes Rotary Position Embedding (RoPE) during long-context pre-training.
    • RoPE base frequency scaled to 1 billion to handle extended sequence lengths effectively.
  • Mixed Data for Long-Context Pre-Training:
    • Corpus comprises 25% natural long-context data and 75% normal-length pre-training data.
    • Ensures development of specialized long-context skills and general language understanding.

Explorations on MoE Scaling Laws

  • Research on scaling laws of MoE models with varying parameters and data sizes.
  • Derived formulas for estimating compute budget and optimal parameters.
  • Insights crucial for designing Hunyuan-Large for optimal performance.

Inference

Cross-Layer Attention (CLA)

  • Allows adjacent layers to share the same KV cache.
  • Reuses key-value pairs generated by the previous layer, reducing memory footprint.

KV Cache Compression

  • Integrates Grouped-Query Attention (GQA) and Cross-Layer Attention (CLA).
  • Saves almost 95% of KV cache memory, increasing inference efficiency.