How to Think About GPUs for LLM Scaling
Read How to Scale Your Model — Section 12: How to Think About GPUs. The book is a systems-level walkthrough of LLM scaling on TPUs; Section 12 is the bonus chapter that re-derives the same rooflines on NVIDIA GPUs (H100 / B200 / GB200 NVL72) and contrasts them with TPUs at every level — chip, node, scale-out network, and collectives.
1. Chip Level: SMs, Tensor Cores, and Why GPUs Look Like TPUs Now
A modern ML GPU is “a bunch of compute cores that specialize in matmul (SMs) connected to a stick of fast memory (HBM).” Components inside an H100/B200 SM:
- CUDA Cores: SIMT vector ALUs, 32 fp32 per subpartition. Do ReLUs, pointwise ops, reductions. Analogous to TPU VPU lanes, but each thread has its own instruction pointer (warp divergence is real).
- Tensor Core (TC): dedicated matmul unit. On H100, ~1024 bf16 FLOPs/cycle/TC (≈ 8×8×8 matmul). The TC carries the vast majority of FLOPs (990 bf16 TFLOPs/s on H100 vs. 66 TFLOPs/s from CUDA cores).
- Warp Scheduler: dispatches up to 64 resident warps per SM, hides memory latency by switching between them.
Each SM has 4 identical subpartitions, each with its own TC + Warp Scheduler + 16k 32-bit registers. The B200 introduces TMEM (256kB/SM) because the TC has grown so large its accumulators no longer fit in SMEM.
GPU ↔ TPU Cheat Sheet
| GPU | TPU | What it is |
|---|---|---|
| Streaming Multiprocessor (SM) | Tensor Core | Core “cell” containing other units |
| Warp Scheduler | VPU | SIMD vector arithmetic unit |
| CUDA Core | VPU ALU | SIMD ALU |
| SMEM (L1) | VMEM | Fast on-chip cache |
| Tensor Core | MXU | Matmul unit |
| HBM | HBM | Main high-BW memory |
The instructive comparison is the count: an H100 has 132 SMs × 4 subpartitions = 528 independent SIMD units (~16k ALUs total). A TPU v5p has 2 Tensor Cores × 4 VPU slots = 8. GPUs are far more modular; TPUs are far more monolithic. Modularity buys flexibility (“just launch dozens of kernels”) at the cost of harder-to-reach peak performance — the L2 cache is shared, memory coalescing is fragile, and the compiler controls less.
Memory Hierarchy
| GPU | SMs | SMEM/SM | L2 | HBM | HBM BW | bf16 TFLOPs/s |
|---|---|---|---|---|---|---|
| V100 | 80 | 96 kB | 6 MB | 32 GB | 0.9 TB/s | — |
| A100 | 108 | 192 kB | 40 MB | 80 GB | 2.0 TB/s | 312 |
| H100 | 132 | 256 kB | 50 MB | 80 GB | 3.4 TB/s | 990 |
| H200 | 132 | 256 kB | 50 MB | 141 GB | 4.8 TB/s | 990 |
| B200 | 148 | 256 kB | 126 MB | 192 GB | 8.0 TB/s | 2,250 |
TPU VMEM is roughly 2× larger and ~7× more bandwidth than GPU L2. That’s why TPUs are often easier to keep close to roofline for inference — weights and activations can be parked in VMEM where GPU code has to fight the L2 cache.
2. Networking: Node → Leaf → Spine
A DGX H100 SuperPod has three levels:
| Level | # GPUs | Degree | Switch BW (full-duplex) | Collective BW |
|---|---|---|---|---|
| Node | 8 | 8 | 6.4 TB/s (NVLink/NVSwitch) | 450 GB/s per GPU |
| Leaf (SU) | 256 | 32 | 25.6 TB/s | 400 GB/s per node |
| Spine | 1024 | 4 | 51.2 TB/s | 400 GB/s per node |
Within a node, GPUs talk over NVLink with full all-to-all connectivity through NVSwitch. Beyond a node, it’s a fat-tree InfiniBand network (8×400 Gbps IB links per node = 400 GB/s node egress).
Important caveat: NVIDIA advertises 450 GB/s NVLink, but in practice, AllReduce throughput tops out around 370 GB/s even on multi-GB messages — and on more typical sizes (e.g. a LLaMA-3 70B MLP shard of ~58 MB), only ~150 GB/s. Adjust your rooflines by ~20%.
SHARP (In-Network Reductions)
Hopper-era NVIDIA switches support SHARP — switches themselves perform reductions and multicast results, which theoretically halves AllReduce cost. In practice the speedup is ~30%, not 75%, so it just compensates for the bandwidth gap rather than meaningfully scaling things further.
3. Collectives: The Formulas Worth Memorizing
For an array of $B$ bytes, with $N$ GPUs per node, $W$ = per-GPU egress bandwidth:
- Intra-node AllGather / ReduceScatter: $T \approx \frac{B(N-1)}{NW} \approx \frac{B}{W}$
- Intra-node AllToAll: $T \approx \frac{B}{NW}$ (2× faster than on TPUs at the node level)
- Cross-node AllGather / RS: $T \approx \frac{B}{W_\text{node egress}}$ — driven by node egress, not GPU egress
- Cross-node AllToAll: $T \approx \frac{B}{M \cdot W_\text{node egress}}$ where $M = N/8$ is the number of nodes — drops from $B/(8 \cdot 450\text{e9})$ within a node to $B/(2 \cdot 400\text{e9})$ across just 2 nodes — >4× degradation.
- AllReduce: 2× the AG/RS cost (unless SHARP).
A useful sub-rule: when an array is sharded along an inner axis $Y$, the outer reduction’s cost drops by roughly the number of nodes spanned by $Y$ — which is why DeepSeek-V3’s 2-way DP across nodes lets it dodge the leaf-level bottleneck.
4. LLM Rooflines on GPUs
When does each parallelism strategy stop being compute-bound? Let $C$ = peak FLOPs/s, $W$ = collective bandwidth, $\alpha = C / W$.
For an H100, intra-node $\alpha = 990\text{e}12 / 450\text{e}9 \approx 2200$. Beyond a node, $\alpha = 990\text{e}12 / 400\text{e}9 \approx 2475$.
Data Parallelism / FSDP
Compute-bound requires per-GPU token batch size:
\[\frac{B}{X} > \frac{C}{W_\text{collective}} \approx 2500 \text{ tokens / GPU}\]For comparison, TPUs need ~850. This is why LLaMA-3 (16k H100s) needs a ~16M-token global batch, and DeepSeek V3 (2048 H800s with only 300 GB/s NVLink) needs ~3300 tokens/GPU, i.e. ~6.7M total (they used 4M).
MoE penalty: the bound inflates by $E/k$. For a model with $E=128$ experts and $k=4$ active, the per-GPU batch jumps to ~80k tokens — borderline absurd. This is the structural reason MoE training relies so heavily on pipeline + expert parallelism instead of pure FSDP.
Tensor Parallelism
Compute-bound when $Y < F \cdot W / C$. For LLaMA-3 ($F = 28{,}672$), this allows ~11-way TP intra-node — rounded down to 8-way (a single NVLink domain). Spanning 2 nodes barely buys you anything (~16-way max). Beyond that you’re comms-bound.
Expert Parallelism
For an MoE with $F < 8 C/W_\text{node}$: limited to 1–2 nodes of EP (DeepSeek-V3 territory, with small $F$). For $F > 8 C/W_\text{node}$: full multi-node EP up to $E$ nodes is feasible.
Pipeline Parallelism
Comms cost is basically free — the per-layer cost scales as $\frac{2BD}{W \cdot N_\text{layers}}$. So why isn’t everyone PP-maxxed?
- Code complexity (zero-bubble schedules don’t fit GSPMD well).
- PP fights ZeRO-3. Each microbatch can’t amortize a full weight AllGather.
- Bubbles + step imbalance still leave waste even with careful scheduling.
But PP buys a hidden win: each PP stage spans more nodes, which scales up the DP AllReduce bandwidth. That’s why LLaMA-3 uses 16-way PP — it cuts the FSDP critical batch size by 16×.
5. Two Worked Examples
LLaMA-3 (16k H100, 16M tokens):
- 8-way TP within a node
- 16-way PP
- 128-way ZeRO-1 DP
Dense model, small per-GPU batch (~1k tokens). The 16-way PP is what makes it work — without it, the FSDP roofline would demand a much larger global batch.
DeepSeek V3 (2048 H800, 62.9M tokens):
- 64-way EP across 8 nodes
- 16-way PP
- 2-way ZeRO-1 DP
Sparse MoE ($k=8$, $E=256$). With 1024-way model parallelism (EP × PP), the residual DP AllReduce happens at the spine level over only 2 nodes — which gets the (N-1)/N = 2× bandwidth bonus.
6. GB200 NVL72: What Changes
Blackwell with NVLink 5 doubles intra-node bandwidth (450 → 900 GB/s) but on a B200 SuperPod the node egress stays at 400 GB/s while FLOPs/chip ~2.25×. Net effect: cross-node rooflines get harder — the DP critical batch size grows from 2475 to ~5625 tokens/GPU.
GB200 NVL72 fixes this with a 72-GPU NVLink domain and 3.6 TB/s node egress (4× H100). That widens the cross-node compute-bound region by ~4× and makes EP across nodes substantially cheaper. Within a node, however, doubled FLOPs ≈ doubled BW, so rooflines look largely unchanged.
A bonus from Grace Hopper / Grace Blackwell: the CPU↔GPU NVLink-C2C is at full GPU-to-GPU bandwidth, so host memory becomes a viable offload target without a bandwidth cliff.
TL;DR
- GPU rooflines are stricter than TPU rooflines for cross-chip work. Per-GPU FSDP batch ≈ 2500 tokens (vs. ~850 on TPU); only ~8-way TP fits in an NVLink domain.
- Empirical NVLink BW is ~80% of spec — re-derive your batch sizes accordingly.
- SHARP doesn’t save you. ~30% in practice, not the 2× theoretical.
- MoE forces you off pure DP. $E/k$ penalty makes pipeline + expert parallelism mandatory.
- PP is ~free in comms but expensive in code. The real reason to use it: it shrinks the DP critical batch size by $N_\text{stages}$.
- GB200 NVL72 is the structural fix for cross-node bottlenecks; B200 SuperPod alone makes them slightly worse.
The chapter does for GPUs what the rest of the book did for TPUs: gives you a small set of inequalities you can plug into during model design, before any code is written. Worth a careful read alongside Section 5 (Training) and Section 7 (Inference) for the TPU side of the same picture.