LLM Architecture Evolution

Read on LLM Architecture Evolution:

The Big LLM Architecture Comparison.

LLM Architecture Table

Model	Parameters (Total / Active)	Arch. Type	Attention Mechanism	Key Technical Features & Innovations
Kimi K2 / Thinking	~1 Trillion / 32B	MoE	MLA (Latent)	Largest model in current generation; Uses Muon optimizer (instead of AdamW) for smoother loss decay; “Thinking” variant extends context to 256k.
DeepSeek V3 / R1	671B / 37B	MoE	MLA (Latent)	Fine-grained experts (256 routed, 8 active); 1 Shared Expert; MLA compresses KV cache; Aux-loss-free load balancing.
Llama 4	400B / 17B	MoE	GQA	Coarse-grained experts (2 active large experts, 8192 hidden size); Alternates MoE and Dense layers; Standard GQA.
Grok 2.5	270B / 115B	MoE	GQA	8 Large Experts; Uses a Shared Expert implementation via an additional SwiGLU module.
GLM-4.5	355B (Base) / 106B (Air)	MoE	GQA	3 Dense Layers at the start (stabilizes early feature extraction before routing); Shared Expert; Retains Attention Bias units.
gpt-oss	120B / 3.6B	MoE	SWA + GQA	Wide & Shallow (24 layers, 2880 dim); 32 experts / 4 active; Attention Sinks implemented as learned bias logits (not tokens); Bias units in attention.
Qwen3 (MoE)	235B / 22B	MoE	GQA	No Shared Expert (unlike Qwen2.5 or Qwen3-Next); 8 routed experts; Focused on serving scale.
Qwen3-Next	80B / ~3B	MoE + Hybrid	Hybrid (3:1)	DeltaNet (Linear) + Gated Attn; Multi-Token Prediction (MTP); Returns to using a Shared Expert; Native 262k context.
MiniMax-M2	~235B / 10B	MoE	GQA	Hyper-sparse (4.37% active); Per-Layer QK-Norm (unique norm per head); Partial RoPE (rotates only 50% of dims); Reverted from Linear Attention to Full Attention.
Kimi Linear	48B	Hybrid	Hybrid (3:1)	DeltaNet + MLA; Channel-wise gating (vs. scalar in Qwen); NoPE (No Positional Embeddings) in MLA layers; Matches performance of full MLA.
DeepSeek V3.2	671B (+14B MTP) / 37B	MoE	Sparse Attn	Introduces Sparse Attention to improve efficiency over V3; Benchmarks comparable to GPT-5.1/Gemini 3.0 Pro.
Gemma 3	1B – 27B	Dense	SWA + GQA	5:1 Sliding Window ratio (1024 token window); Hybrid Norm (RMSNorm applied Pre and Post block).
Gemma 3n	4B	Dense	SWA + GQA	Mobile optimized; Per-Layer Embedding (PLE) (streams embeddings from CPU/SSD); MatFormer (sliceable architecture).
Mistral Small 3.1	24B	Dense	GQA	Abandoned Sliding Window Attention (reverted to standard GQA); Faster inference than Gemma 3 via cache/layer reduction.
Olmo 3	7B / 32B	Dense	MHA / GQA	Post-Norm (RMSNorm after blocks); SWA (only in 7B); YaRN for context extension (up to 64k); Fully transparent/open-source.
SmolLM3	3B	Dense	GQA	NoPE (No Positional Embeddings)—relies entirely on Causal Mask for directionality; Excellent length generalization.

The “Sparsity” Camp (MoE Granularity):
- Fine-Grained: DeepSeek V3 and Kimi K2 favor hundreds of small experts (256+), activating many (8) to specialize deeply in niche concepts.
- Coarse-Grained: Llama 4, Grok 2.5, and gpt-oss stick to the “few and large” expert philosophy (e.g., Llama 4 has only 8 experts total, activating 2), prioritizing simpler routing over extreme specialization.
The “Memory” Camp (Attention Compression):
- Latent Compression (MLA): Used by DeepSeek and Kimi, this compresses Key/Value heads into a latent vector. It is mathematically complex but offers the highest performance-to-memory ratio.
- Locality (SWA): Used by Gemma 3 and gpt-oss. By forcing the model to look only at the last 1,024 tokens for 5 out of 6 layers, they physically delete the need for a massive KV cache, regardless of total context length.
The “Linear” Camp ($O(n)$ Complexity):
- Qwen3-Next and Kimi Linear have successfully implemented DeltaNet (a Recurrent/Linear mechanism). By using a 3:1 ratio (3 linear layers to 1 full attention layer), they maintain the “infinite” context retrieval of Transformers while gaining the speed of RNNs.