Gemma 4: Architecture and Multimodal Innovations

Reading notes on Gemma 4 Model Card.

Gemma 4 represents a comprehensive effort in optimizing parameter efficiency, memory bandwidth, and long-context multimodal processing across edge and server deployments.

1. Model Lineup

Gemma 4 introduces four variants, categorized by structural paradigm:

Model	Total / Active Params	Layers	Context	Modalities
E2B	5.1B / 2.3B effective	35	128K	Text, Image, Audio
E4B	8B / 4.5B effective	42	128K	Text, Image, Audio
26B A4B	25.2B / 3.8–4B active (MoE)	—	256K	Text, Image
31B	30.7B dense	—	256K	Text, Image

2. Text Architecture

Interleaved Local and Global Attention

Instead of standard full attention, Gemma 4 interleaves local (sliding window) with global (full) attention:

Sliding Window: 512 tokens for E2B/E4B, 1024 for 26B/31B.
Interleaving Ratio: 5:1 (5 local layers per 1 global layer), except E2B which uses 4:1.
Final Layer Constraint: The final layer is always global, ensuring full-sequence synthesis before output.

Shared KV Cache and K=V Trick

Two major optimizations for global attention memory overhead:

GQA with K=V: Local layers use standard GQA (2 Query heads per 1 KV head). Global layers scale to 8 Query heads per KV head with doubled Key dimensions. For global attention layers, Keys are set equal to Values ($K=V$), collapsing the KV-cache into a K-cache only and halving the memory footprint for those states.
Shared KV Cache: The last $N$ layers reuse $K$ and $V$ tensors from the previous non-shared layer of the same attention type (sliding or full), cutting both redundant compute and memory.

Proportional RoPE (p-RoPE)

Standard RoPE applies rotation across all embedding pairs. With 256K contexts, low-frequency pair rotations accumulate and add noise to semantic tracking.

Gemma 4 uses p-RoPE on global attention layers only: with $p = 0.25$, only the first 25% of coordinate pairs receive RoPE positional information, while the remaining 75% receive zero rotation. This isolates positional data to high-frequency dimensions, leaving low-frequency dimensions clean for semantic meaning.

3. Per-Layer Embeddings (PLE)

The “E” in E2B/E4B stands for “Effective” parameters. Instead of adding depth or width, these models use Per-Layer Embeddings.

Architecture: Beyond the standard embedding lookup ($V \times d_{model}$), PLE adds a massive lookup table of dimensions $V \times d_{PLE} \times L$ (Vocabulary $\times$ PLE dim $\times$ Number of Layers). For E2B: $262{,}144 \times 256 \times 35$.
Flow: At ingestion, the model fetches a per-layer $d_{PLE}$ embedding for every layer. During the forward pass at layer $L$, a gating function weights this layer-specific embedding, projects it back to $d_{model}$ (1,536 for E2B), and combines it with the residual stream via a lightweight block after attention and FFN.
Hardware Insight: The PLE lookup table is queried only once per token at inference start, so it can reside in flash memory rather than VRAM. This allows a 5.1B parameter model to run at the speed and VRAM cost of a 2.3B model — hence the “effective” parameter metric.

4. Mixture of Experts (MoE)

The 26B A4B model uses sparse activation to run at 4B-model speed:

Routing: FFN is split into 128 total experts; a router selects 8 experts per token. The token embedding is scaled by the router’s probability for each expert’s contribution.
Shared Expert: 1 shared expert is always activated for every token, formulated at 3x the size of a standard expert to capture broad, general knowledge.

5. Multimodal Encoders

Vision Encoder

Based on a ViT (~150M params for E-models, ~550M for larger models):

Variable Aspect Ratios: Adaptively resizes input while maintaining aspect ratio, applying padding only where the image doesn’t perfectly divide into 16x16 pixel patches (no warping into squares).
2D RoPE: Patch embeddings are split into two halves — one receives RoPE tracking width ($w$), the other tracks height ($h$), baking 2D coordinates into the transformer.
Soft Token Budget & Spatial Pooling: Users define a “soft token budget” ($B$, range 70–1120). The image is divided into at most $B \times 9$ patches. Every $3 \times 3$ grid of neighboring patches is averaged (pooled) into a single embedding. A linear projection and RMSNorm then map vision space into text embedding space.

Audio Encoder (E2B and E4B only)

A ~300M parameter encoder processes raw audio (up to 30 seconds) into LLM-compatible tokens:

Extract Mel-spectrogram features (time vs. frequency).
Group features into chunks.
Overlap and downsample chunks via two 2D convolutional layers.
Process through a Conformer (Transformer Encoder with convolutional module).
Linear projection to align with the Gemma 4 embedding space.

6. Deployment Notes

Sampling: DeepMind recommends temperature=1.0, top_p=0.95, top_k=64 across all models.
Thinking Mode: Activated by placing a <|think|> token in the system prompt. The model outputs reasoning in <|channel>thought\n ... <channel|> tags. In multi-turn conversations, historical thoughts must be stripped — only final answers remain in context history.
Modality Order: Image and audio soft tokens should always be placed before the text prompt. For PLE conditioning on multimodal inputs, audio/image positions use the pad token ID, passing neutral per-layer signals.