The Economics of a Token: A Roofline Tour of Frontier Inference

Reading notes on Reiner Pope’s (MatX CEO, ex-Google TPU) blackboard lecture with Dwarkesh Patel on frontier AI training and inference economics (video, episode page, transcript, flashcards). The talk is a model of how to reason from first principles: with just two equations — one for compute time, one for memory time — you can reverse-engineer batch sizes, rack layouts, “fast modes,” over-training ratios, and even why APIs charge 50% more above 200k context.

1. The Two Equations That Run Inference

A single decode step on a transformer must move both weights and KV cache through the chip and perform a matmul against the active parameters. The wall-clock time is the max of the two:

\[t_{\text{inference}} \ge \max\Big(t_{\text{compute}},\; t_{\text{mem}}\Big)\]

Compute time — multiply active parameters against the batch. Attention matmuls are small in comparison and we ignore them:

\[t_{\text{compute}} = \frac{B \cdot N_{\text{active}}}{\text{FLOPs}}\]

Memory time — fetch all weights once (active or not, the model still has to live in HBM), plus the KV cache for every sequence in the batch:

\[t_{\text{mem}} = \underbrace{\frac{N_{\text{total}}}{\text{Mem BW}}}_{\text{weight fetch (constant in }B\text{)}} + \underbrace{\frac{B \cdot L_{\text{ctx}} \cdot \text{bytes/token}}{\text{Mem BW}}}_{\text{KV fetch (linear in }B\text{)}}\]

bytes/token is a model architecture parameter (number of layers × 2 × d_head × number of KV heads, divided by any cross-layer sharing). For DeepSeek V3 the numbers are concrete: ~37 B active, ~700 B total.

Everything else in the lecture is a consequence of these two lines.

Latency vs. batch — why there’s a lower floor

Plot batch size B on the x-axis and time on the y-axis. The three components:

t_compute is linear in B (no offset).
weight fetch is constant in B.
KV fetch is linear in B.

So t_mem = (const) + (linear). Taking the max with t_compute, you get a curve that is flat at small batches (you can’t beat reading all the weights once) and rises with B once compute (or KV) catches up. The flat part is the lower bound on decode latency: it’s just N_total / Mem BW — you cannot decode faster than the time it takes to drag the model through HBM.

Why batch size dominates the cost curve

Cost is proportional to GPU rental time. Cost per token is t / B. Divide each term by B:

Compute → constant (FLOPs amortize linearly with batch).
KV fetch → constant (one KV per sequence anyway).
Weight fetch → 1/B hyperbola (the big bill at batch 1).

At B=1 the weight-fetch term is astronomical — the entire model is pulled through HBM to produce one token. As batch grows, that hyperbola collapses and you asymptote to a compute-bound floor.

Fast Mode and Slow Mode

This curve explains the entire latency/price menu API providers offer:

Fast Mode (≈6× price for ≈2.5× speed) lives on the steep part of the hyperbola: smaller batches, lower latency per token, but weight-fetch is only partially amortized so cost per token goes up.
Normal Mode runs at the optimal batch (~300 × sparsity, derived below), sitting on the compute floor where the hyperbola has been crushed flat.
“Slow Mode” at 10× cheaper is impossible. Normal pricing is already on the floor — the weight-fetch hyperbola is flat. The only terms still on the curve are KV-fetch and compute, both of which are unique per sequence. You cannot amortize them across a larger batch no matter how long you wait. Slow Mode just lives on the floor line; it can’t go below it.

The optimal batch size: a 300 × sparsity rule

Equate weight-fetch time with compute time (drop the KV term — adding it just pushes the optimum higher):

\[\frac{N_{\text{total}}}{\text{Mem BW}} = \frac{B \cdot N_{\text{active}}}{\text{FLOPs}}\]

Rearrange so all hardware terms sit on one side:

\[\underbrace{\frac{\text{FLOPs}}{\text{Mem BW}}}_{\text{hardware ratio}} = \frac{B \cdot N_{\text{active}}}{N_{\text{total}}} \;\;\Rightarrow\;\; B \ge \underbrace{\frac{\text{FLOPs}}{\text{Mem BW}}}_{\approx\, 300} \cdot \underbrace{\frac{N_{\text{total}}}{N_{\text{active}}}}_{\text{sparsity}}\]

The dimensionless hardware ratio: FLOPs is multiplies/sec (FP4); Mem BW is bytes/sec. Each FP4 multiply consumes ½ byte, so the ratio is genuinely unitless and lands near 300 on essentially every datacenter GPU, A100 → H100 → B100 → Rubin. FLOPs and bandwidth grew in lockstep across generations.

So the optimal batch is roughly 300 × sparsity. For DeepSeek (32-of-256 experts, sparsity ≈ 8) that’s ~2,400 sequences. In practice people run 2–3× above the balance point because real-world MFU is below the roofline.

The “train schedule” intuition

The natural inference cadence is Capacity / Bandwidth of HBM:

\[t_{\text{HBM drain}} = \frac{\text{HBM capacity}}{\text{Mem BW}}\]

For Rubin: 288 GB / 20 TB·s⁻¹ ≈ 15 ms. For Hopper and Blackwell this number sits in the same 15–20 ms range. Why? If you picked a latency much bigger than this, you’d have time to read all of HBM twice in one decode step — but the weights are read-only and the KVs change at most once per step. There’s no reason to spend bandwidth reading either of them twice.

So:

A “train” of ~2,000–3,000 tokens departs every ~20 ms. A request that arrives just after one train leaves waits up to 20 ms for the next train, plus 20 ms for it to finish. Worst-case queue latency ≈ 40 ms.

Throughput per rack is just B × (1 / 20 ms) ≈ 64 × B ≈ 128k tokens/s. Gemini’s reported hundreds-of-millions tokens/s globally is ~1000× that — i.e., to be at frontier scale you need ~1,000 racks.

The diminishing returns of sparsity

You’d think pushing sparsity is a free lunch: less compute per token, and the formula above just calls for a bigger batch. Quality-wise, it’s not free. The Unified Scaling Laws for Routed Language Models paper (and successors) shows roughly:

Going from a dense 1.3 B model to a sparse 64-expert / 370 M-active model is roughly equivalent in quality — 64× the total parameters bought ~4× active-param efficiency.

So sparsity is real but the returns are sublinear. Plus those extra total parameters consume HBM capacity, which competes with KV cache. The bound on frontier models is roughly:

Active parameters limited by compute (B × N_active / FLOPs cost).
Total parameters limited by scale-up size (everything must fit and be addressable across one all-to-all domain).

2. MoE Layouts and the Tyranny of the Rack

Sparsity slashes compute time but multiplies total parameters — driving memory capacity and the optimal batch up. The standard layout is expert parallelism (EP): different experts on different GPUs, with the router replicated on every GPU.

The communication pattern is all-to-all: any GPU might route a token to any expert, in both up- and down-projection halves (factor of 2). NVLink turns this into a 2-hop scale-up network inside a rack — GPU → NV switch → GPU. Crucially, leaving the rack hits the scale-out network, which is ~8× slower bandwidth. So a single MoE layer is effectively bounded to a single scale-up domain.

That’s the structural reason rack sizes keep growing:

Gen	GPUs per scale-up domain
Hopper	8 (tray)
Blackwell	72 (rack)
Rubin	~500+ (denser rack)

Hopper → Blackwell was mostly a form-factor change (tray → rack) — no fundamental physics breakthrough required. Blackwell → Rubin needs genuinely new cable density: bend radius of the cables, density of backplane connectors, rack weight (rack frames have to be heavy enough not to sag — but heavier metal compounds the weight problem), power delivery, and cooling are all simultaneously at physical limits.

3. Pipeline Parallelism: Helps Weights, Doesn’t Help KV

If a model is too big for one rack, the natural fallback is pipeline parallelism (PP) — early layers on Rack A, later layers on Rack B. The cross-rack link only ships one activation tensor per microbatch, far less than the all-to-all explosion, so it tolerates the 8× scale-out bandwidth penalty.

The microbatch math

Let E be expert-parallelism width (GPUs per rack), P be number of pipeline stages (racks), and B_µ the per-microbatch (local) batch size. To keep all P stages busy without bubbles, you need exactly P microbatches in flight:

\[B_{\text{global}} = P \cdot B_{\mu}, \quad B_{\mu} \approx 300 \cdot \text{sparsity}\]

Memory required per GPU:

\[c_{\text{mem}} = \frac{N_{\text{total}} + B_{\text{global}} \cdot L_{\text{ctx}} \cdot \text{bytes/token}}{E \cdot P}\]

Substitute B_global = P · B_µ:

\[c_{\text{mem}} = \underbrace{\frac{N_{\text{total}}}{E \cdot P}}_{\text{weights: shrinks with }P} + \underbrace{\frac{B_{\mu} \cdot L_{\text{ctx}} \cdot \text{bytes/token}}{E}}_{\text{KV: }P\text{ cancels}}\]

The P in the KV term cancels. Per-GPU, pipelining shrinks the weight memory footprint but not the KV cache footprint.

Intuition for why the cancellation happens: pipelining lets each GPU store fewer layers of KV, but you also have to keep P microbatches in flight to fill the pipeline. The two effects exactly cancel.

Why scale-up size still matters even when pipelining works

Even setting capacity aside, scale-up size accelerates the weight-fetch term itself:

\[t_{\text{mem,weights}} = \frac{N_{\text{total}}}{\text{Mem BW per GPU} \cdot \text{scale-up size}}\]

You can use every GPU in the scale-up domain in parallel to fetch weights for a single forward pass (you can’t do that across pipeline stages, since they run at different times). Per-GPU bandwidth grew only ~1.5–2× per generation, but scale-up size grew 8× from Hopper to Blackwell. That’s the bigger latency unlock.

Practical conclusion

Frontier labs fill a scale-up rack with expert parallelism and use little or no pipeline parallelism for inference. DeepSeek’s published infra report confirms this. Pipelining is mostly worth the complexity for very large models that won’t fit one rack — and even there only P = 2 is common.

Pipelining also has a per-hop latency cost (network card → top-of-rack switch → possibly data-center switch → reverse path on the other rack, all in series for decode), on the order of a few milliseconds per hop. With 4 pipeline stages and ~2 ms per hop, you’ve added ~10 ms to a ~20 ms decode step — a noticeable hit.

And there’s a non-performance cost that’s arguably bigger. Pipelining cuts the model at layer boundaries, which locks in the assumption that layer n only talks to layer n−1. Architectural innovations that break this — Kimi-style attention attending to residuals from several layers back, interleaved attention patterns, layer-skipping — become hard or impossible to express cleanly. This slows research iteration. That’s what Horace He’s critique (and the spirit of Ilya’s “as we now know, pipelining is not wise” line) was about: not just bubbles or throughput, but giving up architectural degrees of freedom.

4. Over-Training: Why Chinchilla Is Wrong for Deployed Models

Chinchilla minimizes training compute. But a deployed model spends most of its lifetime in inference and (now) RL generation. The right objective is total compute. A useful heuristic:

For a sum of cost terms that follow power laws, the minimum tends to land where the terms are equalized.

So set pre-training, RL, and inference compute roughly equal.

The three costs

\[\text{Pre-train} = 6 \cdot N_{\text{active}} \cdot D_{\text{PT}}\] \[\text{RL} \approx (2{-}6) \cdot N_{\text{active}} \cdot D_{\text{RL}} \cdot \tfrac{1}{\text{MFU}_{\text{RL}}}\] \[\text{Inference} = 2 \cdot N_{\text{active}} \cdot D_{\text{Inf}}\]

Where do the 2 and the 6 come from? A forward pass costs 2 FLOPs per parameter per token (one multiply + one add per weight). The backward pass costs 4 FLOPs per parameter per token — half for the gradient with respect to the weights, half for the gradient with respect to the inputs. So forward + backward = 6, which is the standard 6 · N · D formula. Inference is forward-only (2). RL is somewhere in between (you generate every rollout — forward only — but only train on some). RL also runs at lower MFU than pre-training (~30%) because so much of it is decode.

Equating data

The N_active cancels from all three. After absorbing the FLOPs and MFU factors, you get something like:

\[D_{\text{PT}} \;\approx\; \alpha \cdot D_{\text{RL}} \;\approx\; \beta \cdot D_{\text{Inf}}\]

with α ≈ 1/10 (RL is ~10× less wall-clock-efficient than PT per token) and β ≈ 1/10 (similar reasoning vs. inference). Plugging in plausible MFU numbers gives the rule-of-thumb form:

\[D_{\text{PT}} \;\approx\; 1.5 \cdot D_{\text{RL}} \;\approx\; D_{\text{Inf}}\]

Which says: inference token count ≈ pre-training token count, and RL ≈ ⅔ of either.

Why 1.5× for RL but ~1× for inference?

Both RL and inference are more decode-heavy than PT, so both pay an MFU penalty. The asymmetry comes from FLOPs/token: inference is purely forward (k = 2), while RL still does backward passes on some rollouts (k ≈ 3).

The per-token cost ratio vs. PT:

\[\frac{\text{cost}_{X}}{\text{cost}_{PT}} \;=\; \frac{k_X / 6}{\text{MFU}_X / \text{MFU}_{PT}}\]

Inference: k = 2, MFU averaged across prefill (high) + decode (low) → ratio works out to ≈ 1× PT cost per token.
RL: k ≈ 3, MFU dominated by decode rollouts with no prefill to amortize against (~15%) → ratio ≈ 1.5× PT cost per token.

So RL pays for both the MFU hit and extra backward FLOPs; inference only pays the MFU hit but saves on FLOPs. They roughly balance for inference and don’t for RL.

Net token counts:

Phase	Cost / token	Tokens (for equal $)
Pre-training	1×	D_PT
Inference	≈1×	≈ D_PT
RL	≈1.5×	≈ ⅔ · D_PT

RL gets fewer tokens than PT not because it’s less important but because each RL token costs more GPU-seconds. PT and inference end up at roughly the same token count — which is the source of the famous “each deployed model should output roughly as many tokens as it consumed in pre-training” line.

Plugging in numbers

A frontier system generating ~50 M tok/s (per specific model, after dividing the family-wide aggregate by ~10) for a 2-month lifecycle produces:

\[D_{\text{Inf}} \approx 5 \times 10^{7} \cdot (60 \cdot 60 \cdot 24 \cdot 60) \;\approx\; 2 \times 10^{14} \text{ tokens} = 200\text{ T}\]

By the heuristic above, D_PT ≈ D_Inf ≈ 100–200 T. Rumored numbers for current frontier pre-training do hover near 150 T. Chinchilla’s recommendation for a 100 B-active model is D ≈ 20 · N ≈ 2 T. So:

\[\frac{D_{\text{PT}}^{\text{actual}}}{D_{\text{PT}}^{\text{Chinchilla}}} \;\approx\; \frac{150\text{ T}}{2\text{ T}} \;\approx\; \mathbf{100\times}\]

Frontier models are over-trained ~100× beyond Chinchilla — a deliberate choice to shrink active parameters and cut serving cost. The startling implication: roughly speaking, each deployed model should produce, in inference output, the sum of all the data it was pre-trained on.

5. Deducing Hardware from API Pricing

Public price sheets, treated as a constraint, leak the chip.

Decode is 5× the price of prefill

Both prefill and decode use the same weights and the same memory + compute equations. The only thing that changes is how many tokens are produced per pass: prefill processes L_pass tokens at once, decode processes 1.

Cost per token = t_inference / L_pass. Divide each term:

Compute per token: stays the same (compute scales with tokens).
Weight fetch per token: divides by L_pass — large in decode (L_pass = 1), tiny in prefill.

Plot cost-per-token vs. L_pass:

cost/tok │ ___ memory-bound (decode)
         │\
         │ \
         │  \____________ compute-bound (prefill)
         └──────────────► L_pass
            1         large

Decode pays the full weight-fetch cost amortized over a single output token; prefill amortizes it across the whole prompt. A 5× price ratio between output and input tokens is direct evidence that decode is severely memory-bandwidth-bottlenecked — t_mem is roughly 5× t_compute at decode. The two lines cross somewhere in between.

The 200k context cliff

Several APIs charge ~50% more above 200k context. That’s where KV-fetch time = weight-fetch time:

\[\frac{B \cdot L_{\text{ctx}} \cdot \text{bytes/token}}{\text{Mem BW}} = \frac{B \cdot N_{\text{active}}}{\text{FLOPs}}\]

B cancels. Apply the 1/300 hardware ratio and rearrange for bytes/token:

\[\text{bytes/token} \;=\; \frac{1}{300} \cdot \frac{N_{\text{active}}}{L_{\text{ctx}}}\]

For N_active = 100 B, L_ctx = 200k: bytes/token ≈ 1,667 B ≈ 2 KB. Sanity check:

\[\text{bytes/token} = (\text{# attn layers}) \cdot 2 \cdot d_{\text{head}} \cdot (\text{# KV heads}) / (\text{cross-layer sharing})\]

With d_head = 128 and 8 KV heads (and modest layer count, or some cross-layer sharing à la Character AI / Gemma), 2 KB falls right in range. Sparse attention can hit the same number via a 1/sparsity factor.

Why a cliff? Plot cost vs. context length. Compute is flat in L_ctx. Memory time starts at the weight-fetch value and grows linearly as KV-fetch accumulates. Their max has a visible inflection at the crossover.

Why 200k specifically? Solve for the crossover. With ~2 KB/token of KV on a 100 B-active model, the inflection lands ~200k. The cliff isn’t placed by the provider — it’s placed by physics.

Why 50% (rather than 100%)? The lecture doesn’t derive 1.5× from first principles. It’s a two-tier piecewise price chosen so the provider stays profitable across the full context range, with competitive pressure (“price close to cost or someone scoops you”) setting the magnitude.

“5 minutes vs. 1 hour” KV caches — the cache-tier oracle

For cached prompts, providers offer storage tiers (e.g. 5-minute write at one price, 1-hour write at another). To analyze: there are two costs associated with reusing a KV cache —

Rematerialization cost (cache miss): rebuild KV from token IDs by running a forward pass. Cost per token = t_compute(per token) × GPU $/s.
Storage cost (cache hit): occupy bytes in some memory tier for the hold time, plus a one-time cost to retrieve it into HBM. Cost per token per second = (bytes/token / tier capacity) × GPU $/s. Retrieve cost = bytes/token / tier BW × GPU $/s.

The optimal tier balances retrieve cost and hold cost. Set retrieval-cost ≈ hold-cost × capacity-fraction, and you find:

The optimal tier for a hold time T is the one whose drain time capacity / bandwidth ≈ T.

Memory tier drain times:

Tier	Drain time
HBM	~20 ms
DDR	~1–10 s
Flash (NVMe)	~1 min
Spinning disk	~1 hour

The 5-minute tier is roughly the residency time for KV in Flash; the 1-hour tier lines up with disk. Pricing tiers really are the cost of moving KV between memory levels — and it’s mildly shocking that spinning disk still has a role at frontier scale.

A cache hit being ~10× cheaper than a cache miss is just the cost ratio between loading KV from storage vs. recomputing it via a forward pass.

Why context length stalled at 100–200k

There are two costs to longer context: bandwidth (linear in L_ctx in dense attention) and compute (linear too, but with a slope so small you only notice it at millions of tokens). The binding constraint is bandwidth. HBM bandwidth is barely growing — there is no obvious path to 100M-token contexts via brute force. Sparse attention buys you a √L factor at the cost of attending to fewer tokens; useful, but not infinite. That’s the structural reason why model context lengths jumped from 8k to ~200k and then plateaued.

6. Convergent Evolution: Cryptography ↔ Neural Nets

The deep reason both fields produce layered, mixing architectures is the same constraint: every output bit (or feature) must depend on every input bit, through transformations complex enough that the dependency can’t be unwound by inspection. A cipher that doesn’t mix all input bits into every output bit leaks structure. A transformer that doesn’t let every output token attend to every input token leaves predictive signal on the table. Both reach for the same answer — stacking many simple-but-nonlinear mixing layers.

What differs is the direction of the goal: ciphers turn structure into randomness; neural nets extract structure from randomness. And the differentiability constraint flips: ciphers are deliberately non-differentiable (differential cryptanalysis — perturbing the input and measuring how much the output changes — is the canonical attack, so cipher designers go to great lengths to make small input perturbations cause maximal output change: the avalanche property). Neural nets are deliberately differentiable for SGD, using residual connections and LayerNorm to prevent an output blowup so gradients can flow.

Adversarial examples in image classification are the cipher-style avalanche property showing up where you don’t want it — a tiny input perturbation flipping the model’s prediction.

Yet NN architecture has borrowed straight from cryptography to dodge memory limits.

Feistel networks

Given a non-invertible function f, build an invertible two-input function:

\[(x, y) \;\longmapsto\; (x, \; y + f(x))\]

To invert: read x directly off the output, then recover y = z - f(x). The construction is invertible regardless of whether f is. This is one of the foundational tricks in cipher design.

RevNets (2017)

The same trick, but f is a transformer layer. Construct each layer as:

\[x_{\text{new}} = y + f(x), \quad y_{\text{new}} = x\]

Now the entire forward pass is exactly invertible. During backprop, you don’t store the activations — you recompute them on the fly by running the forward steps backwards.

In normal training, the activation memory footprint is linear in number of layers and often the dominant HBM consumer. RevNets convert that capacity problem into a compute problem.

This is the mirror image of KV cache: KV cache spends memory to save compute (don’t recompute attention over the prefix); RevNets spend compute to save memory (don’t store activations). Both are profitable in the current regime because HBM is the binding constraint, not FLOPs.

Takeaways

Two equations (compute time, memory time) plus one constant (FLOPs/BW ≈ 300) are enough to derive optimal batch, optimal sparsity, the 20-ms train schedule, why MoE lives in one rack, and why pipelining is for training only.
The optimal batch is 300 × sparsity. DeepSeek’s sparsity ≈ 8 gives ≈ 2,400 sequences. Per-rack throughput ≈ 128 k tokens/s; competing with Gemini means ~1,000 racks.
Hardware ratios are remarkably stable. A100 → Rubin: FLOPs and BW grow together, the dimensionless ratio barely moves. The roofline is a long-lived design tool, not a per-generation rewrite.
KV cache is the new bottleneck. Pipelining doesn’t shrink it (the P cancels). Long-context surcharges are it. Cache-tier pricing is the cost of moving it through the memory hierarchy. Context length stalled at ~200k because HBM bandwidth has stalled.
Public prices are an oracle. The 200k cliff leaks ~2 KB KV/token. The 5× decode/prefill ratio leaks how memory-bound decode is. The 5-min and 1-hr cache tiers leak which physical memory level holds KV.
Over-training ~100× past Chinchilla is rational because over the lifecycle of a deployed model, inference compute ≈ pre-training compute. Each model produces roughly as many tokens in inference as it consumed in pre-training.
Compute ↔ memory trades are everywhere. KV cache trades memory for compute. RevNets (borrowed from cipher Feistel networks) trade compute for memory. Both are wins right now because HBM is the binding resource.

References:

Podcast (video): Reiner Pope on Dwarkesh — YouTube
Episode page: dwarkesh.com — Reiner Pope
Full transcript: Gist by @dwarkeshsp
Flashcards: flashcards.dwarkesh.com/reiner-pope
MatX: matx.com