Gemma 4: Architecture and Multimodal Innovations
Reading notes on Gemma 4 Model Card.
Gemma 4 represents a comprehensive effort in optimizing parameter efficiency, memory bandwidth, and long-context multimodal processing across edge and server deployments.
1. Model Lineup
Gemma 4 introduces four variants, categorized by structural paradigm:
| Model | Total / Active Params | Layers | Context | Modalities |
|---|---|---|---|---|
| E2B | 5.1B / 2.3B effective | 35 | 128K | Text, Image, Audio |
| E4B | 8B / 4.5B effective | 42 | 128K | Text, Image, Audio |
| 26B A4B | 25.2B / 3.8–4B active (MoE) | — | 256K | Text, Image |
| 31B | 30.7B dense | — | 256K | Text, Image |
2. Text Architecture
Interleaved Local and Global Attention
Instead of standard full attention, Gemma 4 interleaves local (sliding window) with global (full) attention:
- Sliding Window: 512 tokens for E2B/E4B, 1024 for 26B/31B.
- Interleaving Ratio: 5:1 (5 local layers per 1 global layer), except E2B which uses 4:1.
- Final Layer Constraint: The final layer is always global, ensuring full-sequence synthesis before output.
Shared KV Cache and K=V Trick
Two major optimizations for global attention memory overhead:
- GQA with K=V: Local layers use standard GQA (2 Query heads per 1 KV head). Global layers scale to 8 Query heads per KV head with doubled Key dimensions. For global attention layers, Keys are set equal to Values ($K=V$), collapsing the KV-cache into a K-cache only and halving the memory footprint for those states.
- Shared KV Cache: The last $N$ layers reuse $K$ and $V$ tensors from the previous non-shared layer of the same attention type (sliding or full), cutting both redundant compute and memory.
Proportional RoPE (p-RoPE)
Standard RoPE applies rotation across all embedding pairs. With 256K contexts, low-frequency pair rotations accumulate and add noise to semantic tracking.
Gemma 4 uses p-RoPE on global attention layers only: with $p = 0.25$, only the first 25% of coordinate pairs receive RoPE positional information, while the remaining 75% receive zero rotation. This isolates positional data to high-frequency dimensions, leaving low-frequency dimensions clean for semantic meaning.
3. Per-Layer Embeddings (PLE)
The “E” in E2B/E4B stands for “Effective” parameters. Instead of adding depth or width, these models use Per-Layer Embeddings.
- Architecture: Beyond the standard embedding lookup ($V \times d_{model}$), PLE adds a massive lookup table of dimensions $V \times d_{PLE} \times L$ (Vocabulary $\times$ PLE dim $\times$ Number of Layers). For E2B: $262{,}144 \times 256 \times 35$.
- Flow: At ingestion, the model fetches a per-layer $d_{PLE}$ embedding for every layer. During the forward pass at layer $L$, a gating function weights this layer-specific embedding, projects it back to $d_{model}$ (1,536 for E2B), and combines it with the residual stream via a lightweight block after attention and FFN.
- Hardware Insight: The PLE lookup table is queried only once per token at inference start, so it can reside in flash memory rather than VRAM. This allows a 5.1B parameter model to run at the speed and VRAM cost of a 2.3B model — hence the “effective” parameter metric.
4. Mixture of Experts (MoE)
The 26B A4B model uses sparse activation to run at 4B-model speed:
- Routing: FFN is split into 128 total experts; a router selects 8 experts per token. The token embedding is scaled by the router’s probability for each expert’s contribution.
- Shared Expert: 1 shared expert is always activated for every token, formulated at 3x the size of a standard expert to capture broad, general knowledge.
5. Multimodal Encoders
Vision Encoder
Based on a ViT (~150M params for E-models, ~550M for larger models):
- Variable Aspect Ratios: Adaptively resizes input while maintaining aspect ratio, applying padding only where the image doesn’t perfectly divide into 16x16 pixel patches (no warping into squares).
- 2D RoPE: Patch embeddings are split into two halves — one receives RoPE tracking width ($w$), the other tracks height ($h$), baking 2D coordinates into the transformer.
- Soft Token Budget & Spatial Pooling: Users define a “soft token budget” ($B$, range 70–1120). The image is divided into at most $B \times 9$ patches. Every $3 \times 3$ grid of neighboring patches is averaged (pooled) into a single embedding. A linear projection and RMSNorm then map vision space into text embedding space.
Audio Encoder (E2B and E4B only)
A ~300M parameter encoder processes raw audio (up to 30 seconds) into LLM-compatible tokens:
- Extract Mel-spectrogram features (time vs. frequency).
- Group features into chunks.
- Overlap and downsample chunks via two 2D convolutional layers.
- Process through a Conformer (Transformer Encoder with convolutional module).
- Linear projection to align with the Gemma 4 embedding space.
6. Deployment Notes
- Sampling: DeepMind recommends
temperature=1.0,top_p=0.95,top_k=64across all models. - Thinking Mode: Activated by placing a
<|think|>token in the system prompt. The model outputs reasoning in<|channel>thought\n ... <channel|>tags. In multi-turn conversations, historical thoughts must be stripped — only final answers remain in context history. - Modality Order: Image and audio soft tokens should always be placed before the text prompt. For PLE conditioning on multimodal inputs, audio/image positions use the
padtoken ID, passing neutral per-layer signals.