Read DeepSeek V3 paper.

Summary

DeepSeek-V3: a 671B parameter Mixture-of-Experts (MoE) language model with 37B parameters activated per token. DeepSeek-V3 builds upon the successes of its predecessors, DeepSeek-V2 and DeepSeek-V2.5, incorporating several key innovations:

  • Architecture:
    • Employs Multi-head Latent Attention (MLA) for efficient inference and reduced Key-Value cache.
    • Utilizes an enhanced DeepSeekMoE architecture with an auxiliary-loss-free strategy for load balancing, minimizing performance degradation while encouraging expert specialization.
    • Introduces a Multi-Token Prediction (MTP) objective, predicting multiple future tokens for improved data efficiency and potential for speculative decoding.
  • Training:
    • Employs a novel DualPipe algorithm for efficient pipeline parallelism with reduced pipeline bubbles and computation-communication overlap.
    • Utilizes efficient cross-node all-to-all communication kernels, maximizing bandwidth utilization.
    • Supports FP8 mixed-precision training for accelerated training and reduced memory usage.
    • Implements memory optimizations, eliminating the need for costly tensor parallelism.
  • Pre-training:
    • Trained on a massive, diverse, and high-quality dataset comprising 14.8 trillion tokens.
    • Demonstrates remarkable stability throughout the training process.
    • Incorporates a Fill-in-the-Middle (FIM) strategy for improved bidirectional context understanding.
    • Features a two-stage context length extension process, enabling handling of inputs up to 128K tokens.
  • Post-training:
    • Includes Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) stages to align with human preferences.
    • Employs a novel knowledge distillation methodology from the DeepSeek-R1 series, enhancing reasoning capabilities while maintaining output style and length.
    • Utilizes DeepSeek-V3 itself as a generative reward model during RL, further improving alignment and performance.

DeepSeek-V3 excels across various benchmarks, including:

  • Knowledge: Outperforms other open-source models on MMLU, MMLU-Pro, GPQA, SimpleQA, and Chinese SimpleQA.
  • Code, Math, and Reasoning: Achieves state-of-the-art results on coding competition benchmarks (LiveCodeBench), math benchmarks (MATH-500), and exhibits robust performance on engineering tasks (SWE-Bench).

DeepSeek-V3 sets a new standard for open-source language models, achieving comparable performance to leading closed-source models while maintaining economical training costs ($5.576M: 2K H100 for 2 months). The report concludes by highlighting limitations, primarily related to deployment scale and speed, and outlines future research directions, including:

  • Further architectural refinements for efficiency and infinite context length support.
  • Exploring alternative architectures beyond the Transformer.
  • Developing more general and scalable reward methods for RL.
  • Expanding multilingual capabilities and cultural awareness.

MLA vs. MHA vs. MQA

Reference: Jianlin’s blog.

Multi-head latent attention (MLA) is a core component of the DeepSeek-V3 architecture, designed for efficient inference and reduced memory consumption. It achieves this through low-rank joint compression of attention keys and values, minimizing the Key-Value (KV) cache required during text generation. Here’s a breakdown of the key differences between MLA and other attention mechanisms:

Multi-Head Attention (MHA):

  • Standard attention mechanism in Transformer models.
  • Computes attention scores for each head independently, resulting in a larger KV cache during generation.

Multi-Query Attention (MQA):

  • An efficient variant of MHA where queries are projected independently, while keys and values are shared across heads.
  • Reduces the number of parameters and computations compared to MHA.
  • Still requires caching all keys and values for generation.

Multi-Head Latent Attention (MLA):

  • Introduces low-rank compression for both keys and values, producing a significantly smaller latent representation.
  • Only the compressed latent vectors and decoupled keys (carrying positional information) need to be cached during generation.
  • Results in substantial KV cache reduction while maintaining performance comparable to MHA. In essence, MLA in DeepSeek-V3 addresses the memory bottleneck associated with large context lengths by compressing the key and value representations, allowing for efficient inference without sacrificing model performance. This makes it particularly well-suited for large language models like DeepSeek-V3 that are designed to handle long and complex inputs.

This innovative approach involves forming the $K$ and $V$ matrices through a down and up projection. Instead of storing $K$ and $V$ in the KV cache, you can store a smaller slice of $C$ instead. The process is as follows:

  • $C = X \times D$
  • $Q = X \times W_q$
  • $K = C \times U_k$
  • $V = C \times U_v$

During decoding or inference, in traditional attention mechanisms, a new row of $k$ and $v$ is concatenated to $K$ and $V$ for each new token, and the softmax operation is only performed on the last row. There’s no need to recompute $\text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V$ because operations like MLP and RMSNorm are row-wise. Therefore, the next layer’s KV cache suffices. During inference, the up projection is integrated into $W_q$:

\[QK^T = X \times W_q \times (C \times U_k)^T = X \times W_q \times (X \times D \times U_k)^T = X \times W_q \times U_k^T \times D^T \times X^T = (X \times (W_q \times U_k^T)) \times (D^T \times X^T)\]

These two matrices can then be passed to Flash Attention!

FP8 Training Recipes

  1. Mixed Precision Framework:
    • DeepSeek-V3 strategically employs FP8 for compute-intensive operations, such as General Matrix Multiplications (GEMMs) in the forward and backward passes for linear layers. This theoretically doubles the computational speed compared to BF16.
    • Operators sensitive to low-precision computations, such as the embedding module, output head, MoE gating, normalization, and attention operators, retain their original precision (BF16 or FP32) for stability.
    • Master weights, gradients, and optimizer states are maintained in higher precision (FP32 or BF16) to ensure numerical stability.
  2. Fine-grained Quantization:
    • Tile-wise quantization for activations (1x128 tiles) and block-wise quantization for weights (128x128 blocks) ensure more accurate representation of values and better handling of outliers, mitigating the limited dynamic range of FP8.
    • Per-group scaling factors along the inner dimension of GEMM operations are introduced to further enhance quantization accuracy.
  3. Increased Accumulation Precision:
    • To address the potential underflow issues in FP8 GEMM, DeepSeek-V3 employs precise FP32 accumulation. This contrasts with some FP8 frameworks that use limited-precision accumulation within Tensor Cores, leading to greater errors, especially with large inner dimensions.
  4. Mantissa over Exponents :
    • DeepSeek-V3 consistently uses the E4M3 (4-bit exponent, 3-bit mantissa) format for all tensors, prioritizing mantissa precision over the exponent range. This is made feasible by the fine-grained quantization strategy.
  5. Online Quantization:
    • DeepSeek-V3 employs online quantization where scaling factors and quantization are performed for each tile or block on-the-fly. This ensures accurate scaling based on the current data distribution, simplifying the framework.
  6. Low-Precision Storage and Communication:
    • Optimizer states use BF16 instead of FP32 for tracking the first and second moments without impacting performance.
    • Activations are cached in FP8, further reducing memory consumption. However, certain activations crucial for precision, such as those used in attention and MoE layers, utilize a customized E5M6 format or remain in BF16.
    • Activations are quantized to FP8 before MoE up-projection and activation gradients are quantized to FP8 before MoE down-projection, minimizing communication overhead.

These combined strategies allow DeepSeek-V3 to leverage the benefits of FP8 training – accelerated computation and reduced memory footprint – while maintaining comparable accuracy to higher precision training. The efficacy of this approach is demonstrated in ablation studies, where FP8 training exhibits a relative loss error consistently below 0.25% compared to the BF16 baseline.

MoE Architecture Configurations

DeepSeek-V3 employs a Mixture-of-Experts (MoE) architecture for its Feed-Forward Networks (FFNs), utilizing a total of 257 experts per MoE layer. This includes:

  • 1 shared expert: This expert is accessible to all tokens.
  • 256 routed experts: These experts are selectively activated based on the input tokens. For each token, only 8 routed experts are activated, determined by the token-to-expert affinity scores. Therefore, for each token processed by an MoE layer, a total of 9 experts (1 shared + 8 routed) contribute to the computation.

While DeepSeek-V3 activates 8 routed experts per token, the system’s design allows it to scale up to 13 experts per token (4 nodes × 3.2 experts/node) without incurring additional communication overhead. This scalability is achieved through efficient cross-node communication strategies that prioritize NVLink transfers within nodes.

Inference Optimization

  1. Multi-Head Latent Attention (MLA): This architectural innovation significantly reduces the memory required for caching key-value pairs during text generation. By compressing keys and values into smaller latent representations, MLA enables efficient processing of long sequences without sacrificing performance. Only these compressed latent vectors and decoupled keys, containing positional information, need to be cached, resulting in a much smaller memory footprint compared to standard Multi-Head Attention.

  2. Strategic Deployment Strategy: DeepSeek-V3 employs a distinct deployment strategy that separates the computationally intensive prefilling stage from the latency-sensitive decoding stage. This separation allows for optimized resource allocation and parallelism techniques tailored to each stage, ensuring both high throughput and low latency.

    • Prefilling Stage (TP4+DP8 for non-MoE, EP32 for MoE): Employs a combination of 4-way tensor parallelism, sequence parallelism, and 8-way data parallelism for the attention component. For the MoE component, it uses 32-way expert parallelism, ensuring a sufficiently large batch size per expert to maximize computational efficiency. The prefilling stage also utilizes redundant experts to balance load across GPUs and processes two micro-batches concurrently to improve throughput and hide communication overhead.

    • Decoding Stage (TP4+DP80 for non-MoE, EP320 for MoE): Configured for low-latency text generation, employing 4-way tensor parallelism with sequence parallelism for attention and 80-way data parallelism. The MoE component utilizes 320-way expert parallelism, with each GPU hosting only one expert. Direct point-to-point IB transfers are used for all-to-all communication to minimize latency, further enhanced by leveraging IBGDA technology. The decoding stage also benefits from redundant expert deployment and explores concurrent processing of two micro-batches with overlapping operations to optimize throughput.

  3. Multi-Token Prediction (MTP) for Speculative Decoding: Although primarily a training technique, MTP modules can be optionally utilized during inference for speculative decoding. By predicting multiple future tokens in advance, this technique can potentially accelerate text generation. DeepSeek-V3 demonstrates a high acceptance rate (85%-90%) for the second token predicted by MTP, leading to a significant speedup in decoding (1.8x increase in Tokens Per Second).

  4. Treatment of the Shared Expert During Decoding: DeepSeek-V3 handles the shared expert in the MoE layer as a routed expert during decoding. This ensures that every token considers the shared expert, effectively treating it as a high-priority expert that’s always selected.

Derivation of MLA

0. Notation and Dimensions

Let us define the tensor dimensions used throughout the formulas:

  • $B$: Batch size.
  • $L$: Sequence length.
  • $d$: Model hidden dimension ($5120$).
  • $n_h$: Number of attention heads ($128$).
  • $d_h$: Dimension per attention head ($128$).
  • $d_c$: KV compression dimension (Latent KV size, $512$).
  • $d’_c$: Query compression dimension (Latent Query size, $1536$).
  • $d_R$: RoPE dimension per head ($64$).
  • $h_t$: Input hidden state at step $t$. Shape: $(B, L, d)$.

1. Query Generation (Compressed & Decoupled)

MLA compresses the query into a latent vector ($c_Q$) and then splits it into two parts: a “content” part ($q^C$) and a “RoPE” part ($q^R$).

Step 1.1: Query Compression The input $h_t$ is projected down to a compressed latent vector $c_Q$. \(c_{Q}t = W_{DQ} h_t\)

  • Tensor Shape: $(B, L, d’_c)$
  • Weight: $W_{DQ} \in \mathbb{R}^{d’_c \times d}$

Step 1.2: Content Query Expansion The latent vector is up-projected to generate the “content” part of the queries for all heads. \(q^C_t = W_{UQ} c_{Q}t\)

  • Tensor Shape: $(B, L, n_h, d_h)$
  • Weight: $W_{UQ} \in \mathbb{R}^{n_h d_h \times d’_c}$

Step 1.3: Decoupled RoPE Query Generation A separate projection generates the component used for Rotary Position Embeddings (RoPE). Note that RoPE is applied here. \(q^R_t = \text{RoPE}(W_{QR} c_{Q}t)\)

  • Tensor Shape: $(B, L, n_h, d_R)$
  • Weight: $W_{QR} \in \mathbb{R}^{n_h d_R \times d’_c}$

Step 1.4: Final Query Concatenation The final query for head $i$ is the concatenation of the content part and the RoPE part. \(q_{t,i} = [q^C_{t,i}; q^R_{t,i}]\)

  • Tensor Shape (per head): $(B, L, 1, d_h + d_R)$
  • Total Query Tensor: $(B, L, n_h, d_h + d_R)$
  • Note: The effective head dimension becomes $128 + 64 = 192$.

2. Key-Value Generation (Compressed & Decoupled)

This section contains the critical logic for memory savings. The “Content” Key and Value are derived from a single compressed latent vector ($c_{KV}$), while the “RoPE” Key is a separate, shared vector.

Step 2.1: KV Compression (Latent Vector) The input is projected down to the latent KV vector. This is the primary vector cached during inference,. \(c_{KV}t = W_{DKV} h_t\)

  • Tensor Shape: $(B, L, d_c)$
  • Weight: $W_{DKV} \in \mathbb{R}^{d_c \times d}$

Step 2.2: Content Key Expansion The latent vector is up-projected to generate the content keys. \(k^C_t = W_{UK} c_{KV}t\)

  • Tensor Shape: $(B, L, n_h, d_h)$
  • Weight: $W_{UK} \in \mathbb{R}^{n_h d_h \times d_c}$

Step 2.3: Decoupled Shared RoPE Key The RoPE key is generated directly from the input $h_t$, not from the latent vector $c_{KV}$. Crucially, this key is shared (broadcast) across all attention heads,. \(k^R_t = \text{RoPE}(W_{KR} h_t)\)

  • Tensor Shape: $(B, L, 1, d_R)$ (Broadcasts to $n_h$)
  • Weight: $W_{KR} \in \mathbb{R}^{d_R \times d}$
  • Cache Note: This vector is also cached during inference.

Step 2.4: Final Key Concatenation The final key for head $i$ combines the head-specific content key with the shared RoPE key. \(k_{t,i} = [k^C_{t,i}; k^R_t]\)

  • Total Key Tensor: $(B, L, n_h, d_h + d_R)$

Step 2.5: Value Expansion The values are also up-projected from the same latent vector $c_{KV}$. \(v^C_t = W_{UV} c_{KV}t\)

  • Tensor Shape: $(B, L, n_h, d_h)$
  • Weight: $W_{UV} \in \mathbb{R}^{n_h d_h \times d_c}$

3. Attention Calculation & Output

Step 3.1: Scaled Dot-Product Attention The attention scores are calculated using the concatenated Query and Key vectors (dimension $d_h + d_R$). The values used are the expanded $v^C$.

\[o_{t,i} = \sum_{j=1}^{t} \text{Softmax}_j \left( \frac{q_{t,i}^T k_{j,i}}{\sqrt{d_h + d_R}} \right) v^C_{j,i}\]
  • Score Shape: $(B, n_h, L, L)$
  • Output Head Shape: $(B, L, n_h, d_h)$

Step 3.2: Final Output Projection The output heads are concatenated (flattened) and projected back to the model dimension. \(u_t = W_O [o_{t,1}; o_{t,2}; ...; o_{t,n_h}]\)

  • Concatenated Shape: $(B, L, n_h \cdot d_h)$
  • Weight: $W_O \in \mathbb{R}^{d \times n_h d_h}$
  • Final Output Shape: $(B, L, d)$

Inference Optimization Note (The “Blue Box”)

Explicit computation of $k^C_t$ and $v^C_t$ (Step 2.2 and 2.5) is not required during inference.

  1. Cache: We only cache $c_{KV}t$ (size $d_c=512$) and $k^R_t$ (size $d_R=64$) per token.
  2. Absorption: Because matrix multiplication is associative, the up-projection matrices $W_{UK}$ and $W_{UV}$ can be absorbed into the Query projection ($W_{UQ}$) and Output projection ($W_O$) respectively,.
    • Instead of expanding $c_{KV}$ to full keys ($n_h \times d_h = 16,384$), the model computes the attention score interaction directly in the latent space or via an absorbed projection matrix.

This reduces the KV cache from $2 \times n_h \times d_h \times L$ (Standard MHA) to $(d_c + d_R) \times L$ (MLA), achieving a compression ratio of roughly 93.3% in DeepSeek-V2/V3.

GQA with 2.25 groups

The “2.25” GQA equivalence is a metric derived from comparing the number of elements stored in the Key-Value (KV) cache of the Multi-Head Latent Attention (MLA) mechanism against that of Grouped-Query Attention (GQA).

MLA architecture compresses the KV cache so significantly that the total memory usage per token is equivalent to a GQA model with only 2.25 groups.

To find the equivalence, we equate the memory cost (number of elements cached per token) of MLA to GQA.

  • GQA Cache Size: In GQA, we cache a Key and a Value vector for each group ($n_g$). \(\text{Cache}_{\text{GQA}} = 2 \times n_g \times d_h\) Where $n_g$ is the number of groups and $d_h$ is the dimension per head.

  • MLA Cache Size: In MLA, we cache the compressed latent vector ($c_{KV}$) and the shared decoupled RoPE key ($k^R$). \(\text{Cache}_{\text{MLA}} = d_c + d_R\) Where $d_c$ is the KV compression dimension and $d_R$ is the decoupled RoPE dimension.

DeepSeek uses following dimensions:

  • $d_h$ (Head Dimension): 128
  • $d_c$ (KV Compression Dimension): 512
  • $d_R$ (RoPE Dimension): 64

Alternatively, DeepSeek V2 expresses these as ratios in the Table 1 caption:

  • $d_c = 4 d_h$
  • $d_R = \frac{1}{2} d_h$

We solve for $n_g$ (number of groups) by setting the GQA cache size equal to the MLA cache size:

\[2 \cdot n_g \cdot d_h = d_c + d_R\]

Using the ratios:

\[2 \cdot n_g \cdot d_h = 4 d_h + 0.5 d_h\] \[2 \cdot n_g \cdot d_h = 4.5 d_h\]

Divide both sides by $d_h$:

\[2 \cdot n_g = 4.5\] \[n_g = 2.25\]

Using the raw numbers:

\[2 \cdot n_g \cdot 128 = 512 + 64\] \[256 \cdot n_g = 576\] \[n_g = \frac{576}{256}\] \[n_g = 2.25\]