Read on LLM Architecture Evolution:

LLM Architecture Table

Model Parameters (Total / Active) Arch. Type Attention Mechanism Key Technical Features & Innovations
Kimi K2 / Thinking ~1 Trillion / 32B MoE MLA (Latent) Largest model in current generation; Uses Muon optimizer (instead of AdamW) for smoother loss decay; “Thinking” variant extends context to 256k.
DeepSeek V3 / R1 671B / 37B MoE MLA (Latent) Fine-grained experts (256 routed, 8 active); 1 Shared Expert; MLA compresses KV cache; Aux-loss-free load balancing.
Llama 4 400B / 17B MoE GQA Coarse-grained experts (2 active large experts, 8192 hidden size); Alternates MoE and Dense layers; Standard GQA.
Grok 2.5 270B / 115B MoE GQA 8 Large Experts; Uses a Shared Expert implementation via an additional SwiGLU module.
GLM-4.5 355B (Base) / 106B (Air) MoE GQA 3 Dense Layers at the start (stabilizes early feature extraction before routing); Shared Expert; Retains Attention Bias units.
gpt-oss 120B / 3.6B MoE SWA + GQA Wide & Shallow (24 layers, 2880 dim); 32 experts / 4 active; Attention Sinks implemented as learned bias logits (not tokens); Bias units in attention.
Qwen3 (MoE) 235B / 22B MoE GQA No Shared Expert (unlike Qwen2.5 or Qwen3-Next); 8 routed experts; Focused on serving scale.
Qwen3-Next 80B / ~3B MoE + Hybrid Hybrid (3:1) DeltaNet (Linear) + Gated Attn; Multi-Token Prediction (MTP); Returns to using a Shared Expert; Native 262k context.
MiniMax-M2 ~235B / 10B MoE GQA Hyper-sparse (4.37% active); Per-Layer QK-Norm (unique norm per head); Partial RoPE (rotates only 50% of dims); Reverted from Linear Attention to Full Attention.
Kimi Linear 48B Hybrid Hybrid (3:1) DeltaNet + MLA; Channel-wise gating (vs. scalar in Qwen); NoPE (No Positional Embeddings) in MLA layers; Matches performance of full MLA.
DeepSeek V3.2 671B (+14B MTP) / 37B MoE Sparse Attn Introduces Sparse Attention to improve efficiency over V3; Benchmarks comparable to GPT-5.1/Gemini 3.0 Pro.
Gemma 3 1B – 27B Dense SWA + GQA 5:1 Sliding Window ratio (1024 token window); Hybrid Norm (RMSNorm applied Pre and Post block).
Gemma 3n 4B Dense SWA + GQA Mobile optimized; Per-Layer Embedding (PLE) (streams embeddings from CPU/SSD); MatFormer (sliceable architecture).
Mistral Small 3.1 24B Dense GQA Abandoned Sliding Window Attention (reverted to standard GQA); Faster inference than Gemma 3 via cache/layer reduction.
Olmo 3 7B / 32B Dense MHA / GQA Post-Norm (RMSNorm after blocks); SWA (only in 7B); YaRN for context extension (up to 64k); Fully transparent/open-source.
SmolLM3 3B Dense GQA NoPE (No Positional Embeddings)—relies entirely on Causal Mask for directionality; Excellent length generalization.
  1. The “Sparsity” Camp (MoE Granularity):
    • Fine-Grained: DeepSeek V3 and Kimi K2 favor hundreds of small experts (256+), activating many (8) to specialize deeply in niche concepts.
    • Coarse-Grained: Llama 4, Grok 2.5, and gpt-oss stick to the “few and large” expert philosophy (e.g., Llama 4 has only 8 experts total, activating 2), prioritizing simpler routing over extreme specialization.
  2. The “Memory” Camp (Attention Compression):
    • Latent Compression (MLA): Used by DeepSeek and Kimi, this compresses Key/Value heads into a latent vector. It is mathematically complex but offers the highest performance-to-memory ratio.
    • Locality (SWA): Used by Gemma 3 and gpt-oss. By forcing the model to look only at the last 1,024 tokens for 5 out of 6 layers, they physically delete the need for a massive KV cache, regardless of total context length.
  3. The “Linear” Camp ($O(n)$ Complexity):
    • Qwen3-Next and Kimi Linear have successfully implemented DeltaNet (a Recurrent/Linear mechanism). By using a 3:1 ratio (3 linear layers to 1 full attention layer), they maintain the “infinite” context retrieval of Transformers while gaining the speed of RNNs.