Jianyu Huang’s Blog

Notes on Reading Hunyuan Model

2024-11-12T00:00:00+00:00

Hunyuan Model

Mixed Routing Strategy

Hunyuan-Large employs a mixed routing strategy within its MoE architecture.
Combines a “shared expert” for processing all input tokens with multiple “specialized experts” for specific token types.
The shared expert captures general knowledge applicable to all tokens.
Specialized experts learn domain-specific knowledge, routing each token to the most relevant expert using a top-k routing mechanism.
This strategy leverages both broad and specialized knowledge, enhancing overall capabilities.

Expert Choice vs. Recycle Routing

Hunyuan-Large uses a novel “recycle routing” strategy to address token dropping in traditional top-k routing.
In top-k routing, tokens are assigned to top-k scoring experts with a capacity limit, leading to potential information loss when tokens are discarded.
Recycle routing randomly re-assigns dropped tokens to other experts with available capacity.
Enhances the model’s ability to retain and utilize crucial information, improving training efficiency and effectiveness.
Similar to expert choice on load balancing, ensuring all tokens are handled by experts with balance.

Training

Dropout Adjustment

Dropout is used during the supervised fine-tuning (SFT) stage.
Attention dropout of 0.1 and hidden dropout of 0.2 are applied.
Helps prevent overfitting by randomly dropping out neurons during training.
MoE architecture benefits more from suitable dropout rates compared to dense models.
Careful tuning of dropout parameters is important for optimizing MoE model performance.

Exponential Moving Average (EMA)

EMA is used during the reinforcement learning from human feedback (RLHF) phase.
Maintains a moving average of model weights, updated with each training step.
Provides a stable model less prone to oscillations.
Mitigates “reward hacking” and reduces “alignment tax.”

Learning Rate

Three-Phase Learning Rate Schedule:
- Warm-up Phase: Gradually increases the learning rate to a peak value.
- Gradual Decay Phase: Slowly decreases the learning rate over time.
- Annealing Phase: Significantly reduces the learning rate for the final 5% of pre-training tokens.
Expert-Specific Learning Rate Scaling:
- Different learning rates for different experts (shared vs. routed experts) within the MoE architecture.
- Adjusts for varying numbers of tokens processed by each expert.
- Enhances training efficiency.

Two-Stage Long-Context Pre-Training

Stage 1: Gradually increases token length from 32,000 to 256,000.
Stage 2: Fine-tunes on a corpus of 10 billion tokens for long-context understanding.
Scaling of RoPE:
- Utilizes Rotary Position Embedding (RoPE) during long-context pre-training.
- RoPE base frequency scaled to 1 billion to handle extended sequence lengths effectively.
Mixed Data for Long-Context Pre-Training:
- Corpus comprises 25% natural long-context data and 75% normal-length pre-training data.
- Ensures development of specialized long-context skills and general language understanding.

Explorations on MoE Scaling Laws

Research on scaling laws of MoE models with varying parameters and data sizes.
Derived formulas for estimating compute budget and optimal parameters.
Insights crucial for designing Hunyuan-Large for optimal performance.

Inference

Cross-Layer Attention (CLA)

Allows adjacent layers to share the same KV cache.
Reuses key-value pairs generated by the previous layer, reducing memory footprint.

KV Cache Compression

Integrates Grouped-Query Attention (GQA) and Cross-Layer Attention (CLA).
Saves almost 95% of KV cache memory, increasing inference efficiency.

Educational Materials for GEMM Optimizations on CPUs and GPUs

2024-11-11T00:00:00+00:00

Educational Materials for GEMM Optimizations on CPUs and GPUs

Recently, a colleague reached out to me regarding a paper I published in 2018: Huang, 2018. He also pointed me to this directory: Optimizing SGEMM on NVIDIA Turing GPUs, where the author referenced my previous work: “We refer readers to Huang, 2018 for more details.”

This paper was the last project I worked on during my PhD, even after my PhD defense, for a PPoPP submission. Although it was initially rejected, it was eventually accepted by ACM TOMS. There is some history behind this work: in my PhD proposal, I planned to focus on distributed memory GEMM. However, due to several challenges, such as the availability of the IBM Mira Supercomputer and the readiness of collaborative communication primitives, I decided to switch the GEMM optimization platforms to NVIDIA GPUs. At that time, NVIDIA had just released CUTLASS on the V100, which I promptly adopted for this paper, implementing the Strassen algorithm on top of it. This paper is among the first to utilize CUTLASS.

During my PhD, I was quite enthusiastic about contributing to HPC education. As a teaching assistant, I developed step-by-step materials on optimizing GEMM, which later evolved into BLISlab (BLISlab GitHub) and was submitted to ArXiv (ArXiv Submission). To my surprise, this tutorial was even cited by the original Triton paper (Triton Paper).

Before that, my PhD advisor was working on a popular GEMM optimization tutorial for CPUs on older platforms, such as the Intel Pentium 4 processor released in 2008 (GEMM Optimization on CPUs). I contributed by converting it to a Wiki format and updating it for more recent hardware. This work was also cited by the TVM tutorial (TVM Tutorial).

Working on these educational materials has been incredibly fulfilling. I am proud that more people can be inspired to explore the frontiers of this field, starting from the foundational basics. It reminds me of my PhD advisor’s educational course titled “From Foundations to Frontiers” (Course on edX).

Welcome to Jekyll!

2024-11-10T18:36:19+00:00

You’ll find this post in your _posts directory. Go ahead and edit it and re-build the site to see your changes. You can rebuild the site in many different ways, but the most common way is to run jekyll serve, which launches a web server and auto-regenerates your site when a file is updated.

Jekyll requires blog post files to be named according to the following format:

YEAR-MONTH-DAY-title.MARKUP

Where YEAR is a four-digit number, MONTH and DAY are both two-digit numbers, and MARKUP is the file extension representing the format used in the file. After that, include the necessary front matter. Take a look at the source for this post to get an idea about how it works.

Jekyll also offers powerful support for code snippets:

def print_hi(name)
  puts "Hi, #{name}"
end
print_hi('Tom')
#=> prints 'Hi, Tom' to STDOUT.

Check out the Jekyll docs for more info on how to get the most out of Jekyll. File all bugs/feature requests at Jekyll’s GitHub repo. If you have questions, you can ask them on Jekyll Talk.