Read DeepSeek V3 paper.

Summary

DeepSeek-V3: a 671B parameter Mixture-of-Experts (MoE) language model with 37B parameters activated per token. DeepSeek-V3 builds upon the successes of its predecessors, DeepSeek-V2 and DeepSeek-V2.5, incorporating several key innovations:

  • Architecture:
    • Employs Multi-head Latent Attention (MLA) for efficient inference and reduced Key-Value cache.
    • Utilizes an enhanced DeepSeekMoE architecture with an auxiliary-loss-free strategy for load balancing, minimizing performance degradation while encouraging expert specialization.
    • Introduces a Multi-Token Prediction (MTP) objective, predicting multiple future tokens for improved data efficiency and potential for speculative decoding.
  • Training:
    • Employs a novel DualPipe algorithm for efficient pipeline parallelism with reduced pipeline bubbles and computation-communication overlap.
    • Utilizes efficient cross-node all-to-all communication kernels, maximizing bandwidth utilization.
    • Supports FP8 mixed-precision training for accelerated training and reduced memory usage.
    • Implements memory optimizations, eliminating the need for costly tensor parallelism.
  • Pre-training:
    • Trained on a massive, diverse, and high-quality dataset comprising 14.8 trillion tokens.
    • Demonstrates remarkable stability throughout the training process.
    • Incorporates a Fill-in-the-Middle (FIM) strategy for improved bidirectional context understanding.
    • Features a two-stage context length extension process, enabling handling of inputs up to 128K tokens.
  • Post-training:
    • Includes Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) stages to align with human preferences.
    • Employs a novel knowledge distillation methodology from the DeepSeek-R1 series, enhancing reasoning capabilities while maintaining output style and length.
    • Utilizes DeepSeek-V3 itself as a generative reward model during RL, further improving alignment and performance.

DeepSeek-V3 excels across various benchmarks, including:

  • Knowledge: Outperforms other open-source models on MMLU, MMLU-Pro, GPQA, SimpleQA, and Chinese SimpleQA.
  • Code, Math, and Reasoning: Achieves state-of-the-art results on coding competition benchmarks (LiveCodeBench), math benchmarks (MATH-500), and exhibits robust performance on engineering tasks (SWE-Bench).

DeepSeek-V3 sets a new standard for open-source language models, achieving comparable performance to leading closed-source models while maintaining economical training costs ($5.576M: 2K H100 for 2 months). The report concludes by highlighting limitations, primarily related to deployment scale and speed, and outlines future research directions, including:

  • Further architectural refinements for efficiency and infinite context length support.
  • Exploring alternative architectures beyond the Transformer.
  • Developing more general and scalable reward methods for RL.
  • Expanding multilingual capabilities and cultural awareness.

MLA vs. MHA vs. MQA

Reference: Jianlin’s blog.

Multi-head latent attention (MLA) is a core component of the DeepSeek-V3 architecture, designed for efficient inference and reduced memory consumption. It achieves this through low-rank joint compression of attention keys and values, minimizing the Key-Value (KV) cache required during text generation. Here’s a breakdown of the key differences between MLA and other attention mechanisms:

Multi-Head Attention (MHA):

  • Standard attention mechanism in Transformer models.
  • Computes attention scores for each head independently, resulting in a larger KV cache during generation.

Multi-Query Attention (MQA):

  • An efficient variant of MHA where queries are projected independently, while keys and values are shared across heads.
  • Reduces the number of parameters and computations compared to MHA.
  • Still requires caching all keys and values for generation.

Multi-Head Latent Attention (MLA):

  • Introduces low-rank compression for both keys and values, producing a significantly smaller latent representation.
  • Only the compressed latent vectors and decoupled keys (carrying positional information) need to be cached during generation.
  • Results in substantial KV cache reduction while maintaining performance comparable to MHA. In essence, MLA in DeepSeek-V3 addresses the memory bottleneck associated with large context lengths by compressing the key and value representations, allowing for efficient inference without sacrificing model performance. This makes it particularly well-suited for large language models like DeepSeek-V3 that are designed to handle long and complex inputs.

This innovative approach involves forming the $K$ and $V$ matrices through a down and up projection. Instead of storing $K$ and $V$ in the KV cache, you can store a smaller slice of $C$ instead. The process is as follows:

  • $C = X \times D$
  • $Q = X \times W_q$
  • $K = C \times U_k$
  • $V = C \times U_v$

During decoding or inference, in traditional attention mechanisms, a new row of $k$ and $v$ is concatenated to $K$ and $V$ for each new token, and the softmax operation is only performed on the last row. There’s no need to recompute $\text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V$ because operations like MLP and RMSNorm are row-wise. Therefore, the next layer’s KV cache suffices. During inference, the up projection is integrated into $W_q$:

\[QK^T = X \times W_q \times (C \times U_k)^T = X \times W_q \times (X \times D \times U_k)^T = X \times W_q \times U_k^T \times D^T \times X^T = (X \times (W_q \times U_k^T)) \times (D^T \times X^T)\]

These two matrices can then be passed to Flash Attention!

FP8 Training Recipes

  1. Mixed Precision Framework:
    • DeepSeek-V3 strategically employs FP8 for compute-intensive operations, such as General Matrix Multiplications (GEMMs) in the forward and backward passes for linear layers. This theoretically doubles the computational speed compared to BF16.
    • Operators sensitive to low-precision computations, such as the embedding module, output head, MoE gating, normalization, and attention operators, retain their original precision (BF16 or FP32) for stability.
    • Master weights, gradients, and optimizer states are maintained in higher precision (FP32 or BF16) to ensure numerical stability.
  2. Fine-grained Quantization:
    • Tile-wise quantization for activations (1x128 tiles) and block-wise quantization for weights (128x128 blocks) ensure more accurate representation of values and better handling of outliers, mitigating the limited dynamic range of FP8.
    • Per-group scaling factors along the inner dimension of GEMM operations are introduced to further enhance quantization accuracy.
  3. Increased Accumulation Precision:
    • To address the potential underflow issues in FP8 GEMM, DeepSeek-V3 employs precise FP32 accumulation. This contrasts with some FP8 frameworks that use limited-precision accumulation within Tensor Cores, leading to greater errors, especially with large inner dimensions.
  4. Mantissa over Exponents :
    • DeepSeek-V3 consistently uses the E4M3 (4-bit exponent, 3-bit mantissa) format for all tensors, prioritizing mantissa precision over the exponent range. This is made feasible by the fine-grained quantization strategy.
  5. Online Quantization:
    • DeepSeek-V3 employs online quantization where scaling factors and quantization are performed for each tile or block on-the-fly. This ensures accurate scaling based on the current data distribution, simplifying the framework.
  6. Low-Precision Storage and Communication:
    • Optimizer states use BF16 instead of FP32 for tracking the first and second moments without impacting performance.
    • Activations are cached in FP8, further reducing memory consumption. However, certain activations crucial for precision, such as those used in attention and MoE layers, utilize a customized E5M6 format or remain in BF16.
    • Activations are quantized to FP8 before MoE up-projection and activation gradients are quantized to FP8 before MoE down-projection, minimizing communication overhead.

These combined strategies allow DeepSeek-V3 to leverage the benefits of FP8 training – accelerated computation and reduced memory footprint – while maintaining comparable accuracy to higher precision training. The efficacy of this approach is demonstrated in ablation studies, where FP8 training exhibits a relative loss error consistently below 0.25% compared to the BF16 baseline.

MoE Architecture Configurations

DeepSeek-V3 employs a Mixture-of-Experts (MoE) architecture for its Feed-Forward Networks (FFNs), utilizing a total of 257 experts per MoE layer. This includes:

  • 1 shared expert: This expert is accessible to all tokens.
  • 256 routed experts: These experts are selectively activated based on the input tokens. For each token, only 8 routed experts are activated, determined by the token-to-expert affinity scores. Therefore, for each token processed by an MoE layer, a total of 9 experts (1 shared + 8 routed) contribute to the computation.

While DeepSeek-V3 activates 8 routed experts per token, the system’s design allows it to scale up to 13 experts per token (4 nodes × 3.2 experts/node) without incurring additional communication overhead. This scalability is achieved through efficient cross-node communication strategies that prioritize NVLink transfers within nodes.

Inference Optimization

  1. Multi-Head Latent Attention (MLA): This architectural innovation significantly reduces the memory required for caching key-value pairs during text generation. By compressing keys and values into smaller latent representations, MLA enables efficient processing of long sequences without sacrificing performance. Only these compressed latent vectors and decoupled keys, containing positional information, need to be cached, resulting in a much smaller memory footprint compared to standard Multi-Head Attention.

  2. Strategic Deployment Strategy: DeepSeek-V3 employs a distinct deployment strategy that separates the computationally intensive prefilling stage from the latency-sensitive decoding stage. This separation allows for optimized resource allocation and parallelism techniques tailored to each stage, ensuring both high throughput and low latency.

    • Prefilling Stage (TP4+DP8 for non-MoE, EP32 for MoE): Employs a combination of 4-way tensor parallelism, sequence parallelism, and 8-way data parallelism for the attention component. For the MoE component, it uses 32-way expert parallelism, ensuring a sufficiently large batch size per expert to maximize computational efficiency. The prefilling stage also utilizes redundant experts to balance load across GPUs and processes two micro-batches concurrently to improve throughput and hide communication overhead.

    • Decoding Stage (TP4+DP80 for non-MoE, EP320 for MoE): Configured for low-latency text generation, employing 4-way tensor parallelism with sequence parallelism for attention and 80-way data parallelism. The MoE component utilizes 320-way expert parallelism, with each GPU hosting only one expert. Direct point-to-point IB transfers are used for all-to-all communication to minimize latency, further enhanced by leveraging IBGDA technology. The decoding stage also benefits from redundant expert deployment and explores concurrent processing of two micro-batches with overlapping operations to optimize throughput.

  3. Multi-Token Prediction (MTP) for Speculative Decoding: Although primarily a training technique, MTP modules can be optionally utilized during inference for speculative decoding. By predicting multiple future tokens in advance, this technique can potentially accelerate text generation. DeepSeek-V3 demonstrates a high acceptance rate (85%-90%) for the second token predicted by MTP, leading to a significant speedup in decoding (1.8x increase in Tokens Per Second).

  4. Treatment of the Shared Expert During Decoding: DeepSeek-V3 handles the shared expert in the MoE layer as a routed expert during decoding. This ensures that every token considers the shared expert, effectively treating it as a high-priority expert that’s always selected.