LLM Architecture Evolution
Read on LLM Architecture Evolution:
LLM Architecture Table
| Model | Parameters (Total / Active) | Arch. Type | Attention Mechanism | Key Technical Features & Innovations |
|---|---|---|---|---|
| Kimi K2 / Thinking | ~1 Trillion / 32B | MoE | MLA (Latent) | Largest model in current generation; Uses Muon optimizer (instead of AdamW) for smoother loss decay; “Thinking” variant extends context to 256k. |
| DeepSeek V3 / R1 | 671B / 37B | MoE | MLA (Latent) | Fine-grained experts (256 routed, 8 active); 1 Shared Expert; MLA compresses KV cache; Aux-loss-free load balancing. |
| Llama 4 | 400B / 17B | MoE | GQA | Coarse-grained experts (2 active large experts, 8192 hidden size); Alternates MoE and Dense layers; Standard GQA. |
| Grok 2.5 | 270B / 115B | MoE | GQA | 8 Large Experts; Uses a Shared Expert implementation via an additional SwiGLU module. |
| GLM-4.5 | 355B (Base) / 106B (Air) | MoE | GQA | 3 Dense Layers at the start (stabilizes early feature extraction before routing); Shared Expert; Retains Attention Bias units. |
| gpt-oss | 120B / 3.6B | MoE | SWA + GQA | Wide & Shallow (24 layers, 2880 dim); 32 experts / 4 active; Attention Sinks implemented as learned bias logits (not tokens); Bias units in attention. |
| Qwen3 (MoE) | 235B / 22B | MoE | GQA | No Shared Expert (unlike Qwen2.5 or Qwen3-Next); 8 routed experts; Focused on serving scale. |
| Qwen3-Next | 80B / ~3B | MoE + Hybrid | Hybrid (3:1) | DeltaNet (Linear) + Gated Attn; Multi-Token Prediction (MTP); Returns to using a Shared Expert; Native 262k context. |
| MiniMax-M2 | ~235B / 10B | MoE | GQA | Hyper-sparse (4.37% active); Per-Layer QK-Norm (unique norm per head); Partial RoPE (rotates only 50% of dims); Reverted from Linear Attention to Full Attention. |
| Kimi Linear | 48B | Hybrid | Hybrid (3:1) | DeltaNet + MLA; Channel-wise gating (vs. scalar in Qwen); NoPE (No Positional Embeddings) in MLA layers; Matches performance of full MLA. |
| DeepSeek V3.2 | 671B (+14B MTP) / 37B | MoE | Sparse Attn | Introduces Sparse Attention to improve efficiency over V3; Benchmarks comparable to GPT-5.1/Gemini 3.0 Pro. |
| Gemma 3 | 1B – 27B | Dense | SWA + GQA | 5:1 Sliding Window ratio (1024 token window); Hybrid Norm (RMSNorm applied Pre and Post block). |
| Gemma 3n | 4B | Dense | SWA + GQA | Mobile optimized; Per-Layer Embedding (PLE) (streams embeddings from CPU/SSD); MatFormer (sliceable architecture). |
| Mistral Small 3.1 | 24B | Dense | GQA | Abandoned Sliding Window Attention (reverted to standard GQA); Faster inference than Gemma 3 via cache/layer reduction. |
| Olmo 3 | 7B / 32B | Dense | MHA / GQA | Post-Norm (RMSNorm after blocks); SWA (only in 7B); YaRN for context extension (up to 64k); Fully transparent/open-source. |
| SmolLM3 | 3B | Dense | GQA | NoPE (No Positional Embeddings)—relies entirely on Causal Mask for directionality; Excellent length generalization. |
- The “Sparsity” Camp (MoE Granularity):
- Fine-Grained: DeepSeek V3 and Kimi K2 favor hundreds of small experts (256+), activating many (8) to specialize deeply in niche concepts.
- Coarse-Grained: Llama 4, Grok 2.5, and gpt-oss stick to the “few and large” expert philosophy (e.g., Llama 4 has only 8 experts total, activating 2), prioritizing simpler routing over extreme specialization.
- The “Memory” Camp (Attention Compression):
- Latent Compression (MLA): Used by DeepSeek and Kimi, this compresses Key/Value heads into a latent vector. It is mathematically complex but offers the highest performance-to-memory ratio.
- Locality (SWA): Used by Gemma 3 and gpt-oss. By forcing the model to look only at the last 1,024 tokens for 5 out of 6 layers, they physically delete the need for a massive KV cache, regardless of total context length.
- The “Linear” Camp ($O(n)$ Complexity):
- Qwen3-Next and Kimi Linear have successfully implemented DeltaNet (a Recurrent/Linear mechanism). By using a 3:1 ratio (3 linear layers to 1 full attention layer), they maintain the “infinite” context retrieval of Transformers while gaining the speed of RNNs.