<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://jianyuh.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://jianyuh.github.io/" rel="alternate" type="text/html" /><updated>2026-04-12T21:05:14+00:00</updated><id>https://jianyuh.github.io/feed.xml</id><title type="html">Jianyu Huang’s Blog</title><subtitle>Record the technical thoughts.
</subtitle><author><name>Jianyu Huang</name><email>jianyu0huang@gmail.com</email></author><entry><title type="html">NVIDIA Blackwell SM100: TMEM, TMA, and the New Tensor Core Roofline</title><link href="https://jianyuh.github.io/cuda/2026/04/12/blackwell-sm100.html" rel="alternate" type="text/html" title="NVIDIA Blackwell SM100: TMEM, TMA, and the New Tensor Core Roofline" /><published>2026-04-12T00:00:00+00:00</published><updated>2026-04-12T00:00:00+00:00</updated><id>https://jianyuh.github.io/cuda/2026/04/12/blackwell-sm100</id><content type="html" xml:base="https://jianyuh.github.io/cuda/2026/04/12/blackwell-sm100.html"><![CDATA[<p>Reading notes based primarily on:</p>
<ul>
  <li><a href="https://newsletter.semianalysis.com/p/dissecting-nvidia-blackwell-tensor">Dissecting NVIDIA Blackwell Tensor Cores</a></li>
</ul>

<p>Blackwell is not just Hopper with bigger Tensor Cores. The software contract changed. On SM100, kernel performance now depends on explicitly managing <strong>Tensor Memory (TMEM)</strong>, understanding CTA-scoped tensor instructions, and recognizing that the limiting resource is often no longer raw tensor math, but the bandwidth needed to feed it.</p>

<p>This also helps explain why several recent Blackwell-era kernels, including <a href="/flashattention/2026/03/06/FA4.html">FlashAttention-4</a> and <a href="/mxfp8/2025/12/07/MXFP8-Train.html">MXFP8 Training</a>, had to substantially rethink their pipelines instead of simply porting Hopper code.</p>

<hr />

<h2 id="1-blackwell-changes-who-owns-mma-execution">1. Blackwell changes who “owns” MMA execution</h2>

<p>Three architectural shifts matter most.</p>

<p>First, Blackwell introduces <strong>Tensor Memory (TMEM)</strong>, a new software-managed level in the memory hierarchy used to hold MMA accumulators. On Hopper, MMA results were tightly coupled to the issuing warpgroup and register file. On Blackwell, accumulators live in TMEM, which decouples result ownership from any particular thread. That sounds subtle, but it fundamentally changes kernel structure: epilogues, accumulator lifetimes, and overlap strategies all need to be redesigned around TMEM.</p>

<p>Second, <code class="language-plaintext highlighter-rouge">tcgen05</code> instructions are <strong>CTA-scoped</strong>. A single thread issues the instruction on behalf of the entire CTA. This is a major departure from Hopper’s warpgroup-scoped <code class="language-plaintext highlighter-rouge">wgmma</code> model. The practical consequence is that threads are no longer symmetric participants in tensor-core issue. Some threads orchestrate, others move data, and the whole CTA acts as the execution unit.</p>

<p>Third, Blackwell adds <strong>TPC-scoped TMA and MMA instructions</strong> through <code class="language-plaintext highlighter-rouge">cta_group::2</code> in PTX, or <code class="language-plaintext highlighter-rouge">2CTA</code> in SASS. Two CTAs, spanning two SMs, can collaboratively execute the same <code class="language-plaintext highlighter-rouge">tcgen05.mma</code>. Combined with native support for sub-byte and microscaled datatypes, this gives Blackwell a much more flexible tensor-core pipeline, but only if the kernel is explicitly written to use it.</p>

<p>The broader pattern is clear: Blackwell rewards kernels that think in terms of <strong>asynchronous clusters and shared on-chip resources</strong>, not warp-synchronous loops.</p>

<h2 id="2-physical-topology-now-leaks-into-software-decisions">2. Physical topology now leaks into software decisions</h2>

<p>The execution hierarchy above the SM matters more than it used to.</p>

<p>CTAs grouped into clusters are guaranteed to co-schedule on the same <strong>Graphics Processing Cluster (GPC)</strong>. That is essential for 2CTA execution and for efficient use of distributed shared memory. But there is a catch: if a persistent kernel launches one CTA per SM and the chosen cluster size does not evenly divide the SMs yielded by each GPC, the leftover CTAs can serialize. The result is that “one CTA per SM” is no longer automatically the right launch policy.</p>

<p>On B200, the topology is even more visible because the package spans two dies. Pointer-chasing measurements that intentionally fill L2 expose an average <strong>die-to-die latency penalty of roughly 300 cycles</strong>. For kernels relying on cluster-local reuse, this means that topology-blind scheduling can turn a theoretically local path into a very real latency tax.</p>

<p>In other words, Blackwell kernel tuning now includes <strong>placement</strong> as well as instruction selection.</p>

<h2 id="3-ldgsts-and-tma-serve-different-regimes">3. LDGSTS and TMA serve different regimes</h2>

<p>Blackwell offers two strong but very different ways to move data into shared memory.</p>

<h3 id="ldgsts-fast-to-ramp-fragile-at-scale">LDGSTS: fast to ramp, fragile at scale</h3>

<p><code class="language-plaintext highlighter-rouge">cp.async</code> / <strong>LDGSTS</strong> writes directly into shared memory without staging through registers, which reduces register pressure and makes it attractive for irregular data movement. Its DRAM throughput saturates at roughly <strong>6.6 TB/s</strong> with about <strong>32 KiB in-flight per SM</strong>.</p>

<p>The problem is latency scaling. The baseline latency is about <strong>600 ns</strong>, but once the in-flight footprint grows, the MIO subsystem becomes the bottleneck:</p>

<ul>
  <li>At <strong>8 KiB in-flight</strong>, latency rises to about <strong>1229 cycles</strong></li>
  <li>At <strong>12 KiB in-flight</strong>, latency can spike to about <strong>2177 cycles</strong></li>
</ul>

<p>So LDGSTS is excellent when the kernel needs responsiveness and flexibility, but it becomes increasingly fragile once too many warps are competing for the copy path.</p>

<h3 id="tma-higher-ceiling-slower-to-fill">TMA: higher ceiling, slower to fill</h3>

<p><strong>Tensor Memory Accelerator (TMA)</strong> is issued by a single thread, handles address generation in hardware, and can perform the swizzling needed by tensor-core layouts asynchronously. Its peak throughput is higher, approaching <strong>7.2 TB/s</strong>, but it needs much more data in flight, typically <strong>more than 64 KiB</strong>, before it reaches that ceiling.</p>

<p>That makes TMA the better fit for large, regular, deeply pipelined tiles, while LDGSTS remains attractive for sparse or irregular patterns.</p>

<p>This tradeoff shows up in real kernels. A reasonable rule of thumb is:</p>

<ul>
  <li>Use <strong>TMA</strong> for large, predictable tiles with enough buffering to hide setup cost</li>
  <li>Use <strong>LDGSTS</strong> for irregular or dynamic page fetches where responsiveness matters more than peak bandwidth</li>
</ul>

<p>Even within LDGSTS-heavy kernels, adding more stages and more copy-participating threads continues to help until register allocation becomes the limiting factor.</p>

<h2 id="4-multicast-and-dsmem-are-powerful-but-not-free">4. Multicast and DSMEM are powerful, but not free</h2>

<p>Blackwell’s cluster features are only as good as the access pattern driving them.</p>

<h3 id="tma-multicast-and-the-l2-request-coalescer">TMA multicast and the L2 Request Coalescer</h3>

<p>With <strong>TMA multicast</strong>, a single load can populate shared memory on multiple SMs in a cluster. This is serviced through the <strong>L2 Request Coalescer (LRC)</strong>. There is also an “implicit” form of multicast where multiple CTAs simply request the same data and rely on the hardware to merge requests.</p>

<p>Implicit multicast can reach roughly the same effective shared-memory fill throughput as explicit multicast, but the LRC stops saving much L2 traffic once the implicit requests exceed about <strong>64 bytes in-flight</strong>. So if the objective is not just SMEM fill rate but also lower L2 pressure, explicit multicast remains the cleaner tool.</p>

<h3 id="remote-shared-memory-is-not-local-shared-memory">Remote shared memory is not local shared memory</h3>

<p>The gap between local and remote shared-memory access is severe:</p>

<ul>
  <li>Local <code class="language-plaintext highlighter-rouge">ld.shared</code>: about <strong>128 B/clk</strong></li>
  <li>Naive remote <code class="language-plaintext highlighter-rouge">ld.shared::cluster</code>: about <strong>21 B/clk</strong></li>
</ul>

<p>The reason is painful but simple: the compiler often lowers remote loads to generic <code class="language-plaintext highlighter-rouge">LD</code> instructions rather than optimized <code class="language-plaintext highlighter-rouge">LDS</code> instructions. For high-throughput inter-CTA exchange, developers should rely on <strong><code class="language-plaintext highlighter-rouge">cp.async.bulk</code></strong> (<code class="language-plaintext highlighter-rouge">UBLKCP</code> in SASS), which pushes distributed shared-memory throughput up to about <strong>32 B/clk</strong>.</p>

<p>The lesson is that Blackwell’s cluster features are not self-optimizing. The fast path usually has to be spelled out explicitly.</p>

<h2 id="5-the-real-roofline-is-often-shared-memory-bandwidth">5. The real roofline is often shared-memory bandwidth</h2>

<p>One of the most important Blackwell insights is that many MMA instructions are no longer math-bound.</p>

<p>For 1SM MMA, under-sized shapes are heavily penalized:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">M=64</code> uses only about half the datapath</li>
  <li><code class="language-plaintext highlighter-rouge">M=128</code> reaches near-full utilization</li>
</ul>

<p>For 2SM MMA, <code class="language-plaintext highlighter-rouge">M=256</code> is the sweet spot because it maps to <strong>128 rows per SM</strong>, which keeps both SMs well utilized.</p>

<p>The deeper issue is operand movement. Blackwell supports:</p>

<ul>
  <li><strong>SS mode</strong>: both A and B come from shared memory</li>
  <li><strong>TS mode</strong>: A comes from TMEM, B comes from shared memory</li>
</ul>

<p>In <strong>SS mode</strong>, the instruction is entirely bound by shared-memory bandwidth for <code class="language-plaintext highlighter-rouge">N &lt; 128</code>.</p>

<p>Consider an FP16 1SM MMA with shape <code class="language-plaintext highlighter-rouge">M=128, N=64, K=16</code>:</p>

\[\text{A bytes} = 2 \times M \times K = 4096\]

\[\text{B bytes} = 2 \times N \times K = 2048\]

\[\text{Total FLOPs} = 2 \times M \times N \times K = 262{,}144\]

<p>Assuming Blackwell shared memory sustains <strong>128 B/clk</strong>, the shared-memory service time is:</p>

\[\text{SMEM cycles} = \frac{4096 + 2048}{128} = 48\]

<p>If the effective Tensor Core throughput for this instruction regime is <strong>8,192 FLOPs/clk</strong>, the math time is:</p>

\[\text{Math cycles} = \frac{262{,}144}{8{,}192} = 32\]

<p>So shared memory still dominates:</p>

\[48 \text{ SMEM cycles} &gt; 32 \text{ Math cycles}\]

<p>That is the core result. For <code class="language-plaintext highlighter-rouge">N=64</code>, the instruction is physically <strong>SMEM-bound</strong>, not Tensor-Core-bound. Only when <code class="language-plaintext highlighter-rouge">N=128</code> do the two sides align at roughly <strong>64 cycles each</strong>, marking the transition into a math-limited regime.</p>

<p>This produces a distinctly sloped roofline at exactly <strong>128 B/clk</strong>. On Blackwell, feeding the Tensor Cores is often harder than using them.</p>

<h2 id="6-why-2sm-mma-can-scale-by-more-than-2x">6. Why 2SM MMA can scale by more than 2x</h2>

<p>This shared-memory bottleneck explains a counterintuitive result: in <strong>SS mode</strong> and for small shapes, <strong>2SM MMA can achieve greater than 2x strong scaling over 1SM MMA</strong>.</p>

<p>That is not magic. It is bottleneck removal.</p>

<p>When the work is split across two SMs, each SM contributes its own shared-memory bandwidth. The kernel is no longer constrained by the single-SM SMEM ceiling that held back the 1SM path. In effect, the architecture doubles both the compute resources and the on-chip bandwidth feeding them, so the observed speedup can exceed the naive 2.0x expectation.</p>

<p>In <strong>TS mode</strong>, where operand A comes from TMEM rather than shared memory, scaling behaves much more cleanly and sits near the expected <strong>2.0x</strong>.</p>

<h2 id="7-latency-and-data-format-still-matter">7. Latency and data format still matter</h2>

<p>Single-instruction latency reveals more of the underlying machine behavior.</p>

<p>Latency grows roughly linearly from <code class="language-plaintext highlighter-rouge">N=64</code> to <code class="language-plaintext highlighter-rouge">N=128</code>, then spikes at <code class="language-plaintext highlighter-rouge">N=256</code>. Data format also changes the ordering:</p>

\[\text{S8} &lt; \text{BF16} = \text{E4M3} = \text{F4} &lt; \text{MXF8} = \text{MXF4}\]

<p>The intuition is straightforward:</p>

<ul>
  <li><strong>S8</strong> is fastest because integer tensor operations are power-efficient and simple</li>
  <li><strong>Microscaled formats</strong> pay a small extra cost to derive and apply scale factors</li>
</ul>

<p>Even if an instruction is well chosen, issue efficiency remains a separate problem. To truly approach speed-of-light throughput, a kernel likely needs on the order of <strong>256 to 1024 in-flight MMA instructions</strong> so that issue overhead and commit waits are fully amortized.</p>

<p>Most real kernels are nowhere near that. They often carry only <strong>1 to 4 in-flight MMAs</strong>, which artificially caps throughput around <strong>78% to 80% of speed-of-light</strong>. That is why maximizing MMA instruction size per shared-memory tile is not optional on Blackwell; it is one of the few levers strong enough to move the roofline.</p>

<h2 id="8-practical-rules-for-kernel-writers">8. Practical rules for kernel writers</h2>

<p>Blackwell tuning can be summarized in a few rules:</p>

<ol>
  <li><strong>Design around TMEM explicitly.</strong> Accumulators no longer belong to registers or warps, so the pipeline has to be structured around TMEM residency and transfer boundaries.</li>
  <li><strong>Treat CTA clusters as first-class hardware.</strong> Launch geometry, cluster size, and GPC packing all affect whether the kernel actually runs in parallel.</li>
  <li><strong>Choose the copy path by access pattern, not ideology.</strong> TMA wins on large regular tiles; LDGSTS wins on responsiveness and irregularity.</li>
  <li><strong>Do not treat remote DSMEM like local SMEM.</strong> Use <code class="language-plaintext highlighter-rouge">cp.async.bulk</code> for real inter-CTA throughput.</li>
  <li><strong>Expect shared memory to be the bottleneck before tensor math.</strong> For SS-mode kernels, shape selection and operand staging dominate achievable performance.</li>
  <li><strong>Use larger MMA shapes and deeper in-flight pipelines whenever possible.</strong> Blackwell leaves a lot of performance stranded when kernels are too shallow.</li>
</ol>

<p>The headline message is simple: <strong>Blackwell’s Tensor Cores got faster, but the software problem got harder</strong>. The best kernels are no longer the ones that merely maximize FLOPs. They are the ones that understand where the new bottlenecks moved, and then reorganize the entire pipeline around those new limits.</p>]]></content><author><name>Jianyu Huang</name><email>jianyu0huang@gmail.com</email></author><category term="CUDA" /><category term="CUDA" /><category term="NVIDIA" /><category term="Blackwell" /><category term="GEMM" /><category term="TMEM" /><summary type="html"><![CDATA[Reading notes based primarily on: Dissecting NVIDIA Blackwell Tensor Cores]]></summary></entry><entry><title type="html">Gemma 4: Architecture and Multimodal Innovations</title><link href="https://jianyuh.github.io/architecture/2026/04/05/Gemma4.html" rel="alternate" type="text/html" title="Gemma 4: Architecture and Multimodal Innovations" /><published>2026-04-05T00:00:00+00:00</published><updated>2026-04-05T00:00:00+00:00</updated><id>https://jianyuh.github.io/architecture/2026/04/05/Gemma4</id><content type="html" xml:base="https://jianyuh.github.io/architecture/2026/04/05/Gemma4.html"><![CDATA[<p>Reading notes on <a href="https://ai.google.dev/gemma/docs/core/model_card_4">Gemma 4 Model Card</a>.</p>

<p>Gemma 4 represents a comprehensive effort in optimizing parameter efficiency, memory bandwidth, and long-context multimodal processing across edge and server deployments.</p>

<hr />

<h2 id="1-model-lineup">1. Model Lineup</h2>

<p>Gemma 4 introduces four variants, categorized by structural paradigm:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Model</th>
      <th style="text-align: left">Total / Active Params</th>
      <th style="text-align: left">Layers</th>
      <th style="text-align: left">Context</th>
      <th style="text-align: left">Modalities</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><strong>E2B</strong></td>
      <td style="text-align: left">5.1B / 2.3B effective</td>
      <td style="text-align: left">35</td>
      <td style="text-align: left">128K</td>
      <td style="text-align: left">Text, Image, Audio</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>E4B</strong></td>
      <td style="text-align: left">8B / 4.5B effective</td>
      <td style="text-align: left">42</td>
      <td style="text-align: left">128K</td>
      <td style="text-align: left">Text, Image, Audio</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>26B A4B</strong></td>
      <td style="text-align: left">25.2B / 3.8–4B active (MoE)</td>
      <td style="text-align: left">—</td>
      <td style="text-align: left">256K</td>
      <td style="text-align: left">Text, Image</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>31B</strong></td>
      <td style="text-align: left">30.7B dense</td>
      <td style="text-align: left">—</td>
      <td style="text-align: left">256K</td>
      <td style="text-align: left">Text, Image</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="2-text-architecture">2. Text Architecture</h2>

<h3 id="interleaved-local-and-global-attention">Interleaved Local and Global Attention</h3>

<p>Instead of standard full attention, Gemma 4 interleaves local (sliding window) with global (full) attention:</p>
<ul>
  <li><strong>Sliding Window:</strong> 512 tokens for E2B/E4B, 1024 for 26B/31B.</li>
  <li><strong>Interleaving Ratio:</strong> 5:1 (5 local layers per 1 global layer), except E2B which uses 4:1.</li>
  <li><strong>Final Layer Constraint:</strong> The final layer is always global, ensuring full-sequence synthesis before output.</li>
</ul>

<h3 id="shared-kv-cache-and-kv-trick">Shared KV Cache and K=V Trick</h3>

<p>Two major optimizations for global attention memory overhead:</p>
<ul>
  <li><strong>GQA with K=V:</strong> Local layers use standard GQA (2 Query heads per 1 KV head). Global layers scale to 8 Query heads per KV head with doubled Key dimensions. For global attention layers, <strong>Keys are set equal to Values ($K=V$)</strong>, collapsing the KV-cache into a K-cache only and halving the memory footprint for those states.</li>
  <li><strong>Shared KV Cache:</strong> The last $N$ layers reuse $K$ and $V$ tensors from the previous non-shared layer of the same attention type (sliding or full), cutting both redundant compute and memory.</li>
</ul>

<h3 id="proportional-rope-p-rope">Proportional RoPE (p-RoPE)</h3>

<p>Standard RoPE applies rotation across all embedding pairs. With 256K contexts, low-frequency pair rotations accumulate and add noise to semantic tracking.</p>

<p>Gemma 4 uses <strong>p-RoPE</strong> on global attention layers only: with $p = 0.25$, only the first 25% of coordinate pairs receive RoPE positional information, while the remaining 75% receive zero rotation. This isolates positional data to high-frequency dimensions, leaving low-frequency dimensions clean for semantic meaning.</p>

<hr />

<h2 id="3-per-layer-embeddings-ple">3. Per-Layer Embeddings (PLE)</h2>

<p>The “E” in E2B/E4B stands for “Effective” parameters. Instead of adding depth or width, these models use Per-Layer Embeddings.</p>

<ul>
  <li><strong>Architecture:</strong> Beyond the standard embedding lookup ($V \times d_{model}$), PLE adds a massive lookup table of dimensions $V \times d_{PLE} \times L$ (Vocabulary $\times$ PLE dim $\times$ Number of Layers). For E2B: $262{,}144 \times 256 \times 35$.</li>
  <li><strong>Flow:</strong> At ingestion, the model fetches a per-layer $d_{PLE}$ embedding for every layer. During the forward pass at layer $L$, a gating function weights this layer-specific embedding, projects it back to $d_{model}$ (1,536 for E2B), and combines it with the residual stream via a lightweight block after attention and FFN.</li>
  <li><strong>Hardware Insight:</strong> The PLE lookup table is queried only once per token at inference start, so it can reside in flash memory rather than VRAM. This allows a 5.1B parameter model to run at the speed and VRAM cost of a 2.3B model — hence the “effective” parameter metric.</li>
</ul>

<hr />

<h2 id="4-mixture-of-experts-moe">4. Mixture of Experts (MoE)</h2>

<p>The 26B A4B model uses sparse activation to run at 4B-model speed:</p>
<ul>
  <li><strong>Routing:</strong> FFN is split into 128 total experts; a router selects 8 experts per token. The token embedding is scaled by the router’s probability for each expert’s contribution.</li>
  <li><strong>Shared Expert:</strong> 1 shared expert is always activated for every token, formulated at 3x the size of a standard expert to capture broad, general knowledge.</li>
</ul>

<hr />

<h2 id="5-multimodal-encoders">5. Multimodal Encoders</h2>

<h3 id="vision-encoder">Vision Encoder</h3>

<p>Based on a ViT (~150M params for E-models, ~550M for larger models):</p>

<ul>
  <li><strong>Variable Aspect Ratios:</strong> Adaptively resizes input while maintaining aspect ratio, applying padding only where the image doesn’t perfectly divide into 16x16 pixel patches (no warping into squares).</li>
  <li><strong>2D RoPE:</strong> Patch embeddings are split into two halves — one receives RoPE tracking width ($w$), the other tracks height ($h$), baking 2D coordinates into the transformer.</li>
  <li><strong>Soft Token Budget &amp; Spatial Pooling:</strong> Users define a “soft token budget” ($B$, range 70–1120). The image is divided into at most $B \times 9$ patches. Every $3 \times 3$ grid of neighboring patches is averaged (pooled) into a single embedding. A linear projection and RMSNorm then map vision space into text embedding space.</li>
</ul>

<h3 id="audio-encoder-e2b-and-e4b-only">Audio Encoder (E2B and E4B only)</h3>

<p>A ~300M parameter encoder processes raw audio (up to 30 seconds) into LLM-compatible tokens:</p>
<ol>
  <li>Extract Mel-spectrogram features (time vs. frequency).</li>
  <li>Group features into chunks.</li>
  <li>Overlap and downsample chunks via two 2D convolutional layers.</li>
  <li>Process through a <strong>Conformer</strong> (Transformer Encoder with convolutional module).</li>
  <li>Linear projection to align with the Gemma 4 embedding space.</li>
</ol>

<hr />

<h2 id="6-deployment-notes">6. Deployment Notes</h2>

<ul>
  <li><strong>Sampling:</strong> DeepMind recommends <code class="language-plaintext highlighter-rouge">temperature=1.0</code>, <code class="language-plaintext highlighter-rouge">top_p=0.95</code>, <code class="language-plaintext highlighter-rouge">top_k=64</code> across all models.</li>
  <li><strong>Thinking Mode:</strong> Activated by placing a <code class="language-plaintext highlighter-rouge">&lt;|think|&gt;</code> token in the system prompt. The model outputs reasoning in <code class="language-plaintext highlighter-rouge">&lt;|channel&gt;thought\n ... &lt;channel|&gt;</code> tags. In multi-turn conversations, historical thoughts must be stripped — only final answers remain in context history.</li>
  <li><strong>Modality Order:</strong> Image and audio soft tokens should always be placed <em>before</em> the text prompt. For PLE conditioning on multimodal inputs, audio/image positions use the <code class="language-plaintext highlighter-rouge">pad</code> token ID, passing neutral per-layer signals.</li>
</ul>]]></content><author><name>Jianyu Huang</name><email>jianyu0huang@gmail.com</email></author><category term="Architecture" /><category term="Architecture" /><category term="Multimodal" /><category term="MoE" /><category term="Google" /><summary type="html"><![CDATA[Reading notes on Gemma 4 Model Card.]]></summary></entry><entry><title type="html">Residual Matrix Transformers – Scaling the Residual Stream</title><link href="https://jianyuh.github.io/architecture/2026/04/03/RMT.html" rel="alternate" type="text/html" title="Residual Matrix Transformers – Scaling the Residual Stream" /><published>2026-04-03T00:00:00+00:00</published><updated>2026-04-03T00:00:00+00:00</updated><id>https://jianyuh.github.io/architecture/2026/04/03/RMT</id><content type="html" xml:base="https://jianyuh.github.io/architecture/2026/04/03/RMT.html"><![CDATA[<p>Paper: <a href="https://arxiv.org/abs/2505.14550">Residual Matrix Transformers: Scaling the Size of the Residual Stream</a>.</p>

<h2 id="core-motivation">Core Motivation</h2>

<p>The current paradigm of LLM scaling (Kaplan et al., 2020) relies heavily on expanding model size, data, and compute. However, the AI field is rapidly approaching the physical limits of available data and energy. While sparse modifications like Mixture of Experts (MoE) scale parameters without scaling per-example compute, the <strong>Residual Matrix Transformer (RMT)</strong> proposes an entirely new axis for scaling: <strong>the residual stream size</strong>.</p>

<p>In a standard transformer, the residual stream is a vector of dimension $D$ that acts as a memory bus across layers. Scaling $D$ linearly scales the size of all parameter matrices, inflating parameter and FLOP counts quadratically. The RMT solves this by replacing the residual stream vector with an <strong>outer product memory matrix</strong>, decoupling the residual stream’s bandwidth from the model’s compute and parameter footprint.</p>

<h2 id="mathematical-framework-outer-product-memory">Mathematical Framework: Outer Product Memory</h2>

<p>The architecture builds on outer product memory stores (Kohonen, 1972; Anderson, 1972).</p>

<p>Given a set of $N$ key vectors $q^{(p)} \in \mathbb{R}^{D_k}$ and data vectors $x^{(p)} \in \mathbb{R}^{D_v}$ for $p = 1, \ldots, N$, an outer product store $M \in \mathbb{R}^{D_k \times D_v}$ is constructed by summing their outer products:</p>

\[M = \text{Norm}\left(\sum_{p=1}^N q^{(p)} \otimes x^{(p)}\right)\]

<p>where $u \otimes v = uv^T$, and Norm is LayerNorm.</p>

<p>To retrieve a specific data vector $x^{(r)}$ from $M$, we perform a tensor contraction over the first dimension using its associated key vector $q^{(r)}$:</p>

\[x^{(r)} \approx q^{(r)} \cdot_1 M\]

<h2 id="rmt-architecture">RMT Architecture</h2>

<p>In the RMT, the batched residual stream for $N$ tokens is represented as a tensor $X \in \mathbb{R}^{D_k \times D_v \times N}$, rather than the standard matrix $X \in \mathbb{R}^{D \times N}$.</p>

<h3 id="attention-layer">Attention Layer</h3>

<p>In a standard transformer, features are retrieved using linear transformations (e.g., $Q^{(h)} = W_Q^{(h)} X$). In the RMT, the expensive $W_Q, W_K, W_V \in \mathbb{R}^{D_h \times D}$ weight matrices are <strong>completely removed and replaced by learned key vectors</strong> $r_Q^{(h)}, r_K^{(h)}, r_V^{(h)} \in \mathbb{R}^{D_k}$.</p>

<p>The attention inputs are retrieved via tensor contraction:</p>

\[Q^{(h)} = r_Q^{(h)} \cdot_1 X\]

\[K^{(h)} = r_K^{(h)} \cdot_1 X\]

\[V^{(h)} = r_V^{(h)} \cdot_1 X\]

<p>These resulting matrices belong to $\mathbb{R}^{D_v \times N}$ (where $D_v$ acts as the attention head dimension $D_h$). Standard Head Attention (SHA) is applied normally, and the output is written back into the residual matrix using an output key vector $w_O^{(h)} \in \mathbb{R}^{D_k}$:</p>

\[MHA(X) = \sum_{h=1}^R w_O^{(h)} \otimes SHA(Q^{(h)}, K^{(h)}, V^{(h)})\]

<h3 id="feed-forward-layer-ffn">Feed-Forward Layer (FFN)</h3>

<p>Unlike attention, the FFN retains its standard linear transformations $W_1$ and $W_2$, as evidence suggests these matrices store critical factual information rather than just routing it. The RMT uses key vector “adapters”: it retrieves $R$ data vectors from the matrix, concatenates them for the standard FFN operation, and then splits (un-vecs) the output to store it back into the residual matrix via outer products with $w_{FF}^{(h)}$.</p>

<h2 id="variance-propagation">Variance Propagation</h2>

<p>For deep networks to initialize and train stably, the mean and variance of activations and gradients must propagate effectively (Glorot &amp; Bengio, 2010). The paper provides a closed-form derivation proving that outer product storage and retrieval maintain healthy variance.</p>

<p>Let the forward storage operation for a single token be $X_{out} = \sum_{h=1}^R w^{(h)} \otimes x_{in}^{(h)}$, where weights are initialized independently with mean 0.</p>

\[E[X_{out, ij}] = \sum_{h=1}^R E[w_i^{(h)}] E[x_{in, j}^{(h)}] = 0\]

<p>The variance propagates as:</p>

\[Var(X_{out, ij}) = \sum_{h=1}^R Var\left(w_i^{(h)} x_{in, j}^{(h)}\right)\]

<p>Assuming independence and $\mu_w = 0$:</p>

\[Var(X_{out, ij}) = R \sigma_w^2 (\sigma_{x_{in}}^2 + \mu_{x_{in}}^2)\]

<p>By choosing standard initialization dimensions, the ratio $\frac{\sigma_{x_{out}}^2}{\sigma_{x_{in}}^2}$ can be kept close to 1, avoiding vanishing or exploding gradients.</p>

<h2 id="scaling-performance">Scaling Performance</h2>

<p>The replacement of standard weight matrices with vectors transforms the economics of the network:</p>

<ul>
  <li><strong>Cost of Scaling:</strong> Increasing the residual stream size by 100% in a standard transformer yields ~94% more FLOPs and 100% more parameters. In the RMT, increasing residual matrix capacity ($D_k$) by 100% increases both parameters and FLOPs by <strong>&lt; 1%</strong>.</li>
  <li><strong>Efficiency:</strong> To reach the identical target loss, the RMT uses <strong>58% fewer FLOPs, 25% fewer parameters, and 41% fewer training tokens</strong> compared to a Chinchilla-optimal baseline transformer.</li>
  <li><strong>Zero-Shot Dominance:</strong> Evaluated on LAMBADA, PIQA, and ARC, an RMT trained on 28% fewer FLOPs outperformed a standard transformer that was 33% larger.</li>
  <li><strong>“Free” Scaling Axis:</strong> When holding parameter count, dataset size, and compute budget constant, expanding the residual stream size $D_k$ monotonically decreased validation loss.</li>
</ul>

<h2 id="caveats">Caveats</h2>

<ul>
  <li><strong>Memory:</strong> Roughly equivalent during training (larger residual activations for gradient checkpointing offset by fewer model parameters), but strictly more efficient during inference.</li>
  <li><strong>Wall-Clock Time:</strong> In current PyTorch/JAX ecosystems, highly optimized GEMM kernels run significantly faster than unoptimized tensor contractions. Despite needing dramatically fewer FLOPs, the RMT takes ~43% longer per training step. A custom CUDA kernel for contracting over small key vectors could bridge this gap.</li>
</ul>

<h2 id="summary">Summary</h2>

<p>The RMT unlocks a new “free” scaling dimension by replacing the residual stream vector with an outer product memory matrix. Weight matrices in attention are replaced by learned key vectors, making residual stream scaling nearly cost-free in parameters and FLOPs. This promises significant reductions in the energy and data required for frontier LLM training, pending kernel-level optimization to close the wall-clock gap.</p>]]></content><author><name>Jianyu Huang</name><email>jianyu0huang@gmail.com</email></author><category term="Architecture" /><category term="Architecture" /><category term="Residual" /><summary type="html"><![CDATA[Paper: Residual Matrix Transformers: Scaling the Size of the Residual Stream.]]></summary></entry><entry><title type="html">Nvidia Inference: Disaggregated Decode, LPU Integration, and Datacenter Macro-Architectures</title><link href="https://jianyuh.github.io/ai/2026/03/30/nvidia-inference.html" rel="alternate" type="text/html" title="Nvidia Inference: Disaggregated Decode, LPU Integration, and Datacenter Macro-Architectures" /><published>2026-03-30T00:00:00+00:00</published><updated>2026-03-30T00:00:00+00:00</updated><id>https://jianyuh.github.io/ai/2026/03/30/nvidia-inference</id><content type="html" xml:base="https://jianyuh.github.io/ai/2026/03/30/nvidia-inference.html"><![CDATA[<p>Reading notes based on:</p>
<ul>
  <li><a href="https://newsletter.semianalysis.com/p/nvidia-the-inference-kingdom-expands">Nvidia – The Inference Kingdom Expands (SemiAnalysis)</a></li>
</ul>

<p>At GTC 2026, Nvidia aggressively expanded its hardware and inference ecosystem to address the emerging bottlenecks of memory-bound LLM decode phases, massive CPU demands in reinforcement learning, and KV cache storage limits. The company announced three entirely new system architectures: <strong>Groq LPX</strong> (integrating the LP30 chip), <strong>Vera ETL256</strong>, and <strong>STX</strong>, alongside major updates to the <strong>Kyber rack architecture (NVL144, NVL576, and NVL1152)</strong>.</p>

<hr />

<h2 id="1-the-groq-acquisition-and-lp30-architecture">1. The Groq “Acquisition” and LP30 Architecture</h2>

<p>Nvidia functionally acquired Groq via a $20B IP licensing and acqui-hire deal, bypassing drawn-out antitrust regulations. Groq’s hardware, specifically the LPU (Language Processing Unit), is utilized to create a disaggregated decode system.</p>

<h3 id="lp30-groq-3-lpu-hardware-details">LP30 (Groq 3 LPU) Hardware Details</h3>

<ul>
  <li><strong>Silicon &amp; Manufacturing:</strong> Designed on <strong>Samsung’s SF4X node</strong>, skipping the failed LPU 2 which suffered from 112G SerDes malfunctions. Using Samsung SF4X allows Nvidia to bypass TSMC N3 logic constraints and HBM allocation constraints, enabling incremental revenue scale-up.</li>
  <li><strong>Memory Hierarchy:</strong> Features a single-level memory hierarchy with <strong>500MB of on-chip SRAM</strong> (up from 230MB in Gen 1) providing an ultra-fast <strong>150 TB/s memory bandwidth</strong>. It lacks HBM entirely.</li>
  <li><strong>Compute:</strong> Dedicated to tensor-first compute with <strong>1.2 PFLOPs of FP8</strong>, which is a fraction of standard GPU compute but highly optimized for deterministic execution. The chip pumps instructions vertically and streams data horizontally across functional slices (VXM, MEM, SXM, MXM).</li>
  <li><strong>Form Factor:</strong> Deployed in the <strong>LPX Compute Tray</strong>, featuring a belly-to-belly PCB design (8 LPUs on top, 8 on bottom) to minimize X/Y trace distances, alongside <strong>2 Altera “Fabric Expansion Logic” FPGAs</strong>, 1 Intel Granite Rapids CPU, and a BlueField-4 module.</li>
</ul>

<hr />

<h2 id="2-decoding-acceleration-techniques">2. Decoding Acceleration Techniques</h2>

<p>The integration of the LPU is designed to accelerate the <strong>latency-sensitive, memory-bounded decode phase</strong> of LLM inference, leaving the <strong>compute-intensive prefill phase to GPUs</strong>.</p>

<h3 id="attention-ffn-disaggregation-afd">Attention-FFN Disaggregation (AFD)</h3>

<p><strong>The Problem:</strong> During decode, GPU utilization for the Attention mechanism barely improves as batch sizes scale because it is bounded by loading KV cache. Conversely, Feed Forward Network (FFN) utilization scales effectively with larger batch sizes. In state-of-the-art sparse Mixture-of-Expert (MoE) models, utilization drops further as tokens route to a larger pool of experts.</p>

<p><strong>The Solution:</strong> Attention operations are <strong>stateful</strong> (relying on dynamic KV cache) and are thus mapped to HBM-heavy Rubin GPUs. FFN operations are <strong>stateless</strong> (depending only on token inputs) and are mapped to the SRAM-heavy, deterministic LPUs.</p>

<p><strong>Network Optimization:</strong> To hide the communication latency of dispatching tokens from GPU to LPU experts and combining them back, the system relies on <strong>ping-pong pipeline parallelism</strong>, allowing tokens to continuously bounce between GPUs and LPUs over Spectrum-X Ethernet.</p>

<p>The key insight here is a further decomposition beyond <a href="/llm%20inference/2025/03/30/prefill-decoding-disagg.html">prefill-decode disaggregation</a>. Within the decode phase itself, attention and FFN have fundamentally different computational profiles:</p>

<table>
  <thead>
    <tr>
      <th>Property</th>
      <th>Attention (Decode)</th>
      <th>FFN (Decode)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>State</td>
      <td>Stateful (KV cache)</td>
      <td>Stateless</td>
    </tr>
    <tr>
      <td>Bottleneck</td>
      <td>Memory bandwidth</td>
      <td>Compute</td>
    </tr>
    <tr>
      <td>Batch scaling</td>
      <td>Poor (KV-bound)</td>
      <td>Good</td>
    </tr>
    <tr>
      <td>Best hardware</td>
      <td>HBM-heavy GPU</td>
      <td>SRAM-heavy LPU</td>
    </tr>
  </tbody>
</table>

<h3 id="speculative-decoding--memory-management">Speculative Decoding &amp; Memory Management</h3>

<ul>
  <li>LPUs can host draft models or Multi-Token Prediction (MTP) layers to predict $k$ new tokens, which the main model verifies in a single “warm prefill” step.</li>
  <li>Unlike stateless FFNs, draft models/MTP layers require tens of gigabytes of dynamic KV cache. To support this, <strong>the Altera FPGAs provide up to 256GB of additional DDR5 memory per FPGA</strong> for the LPUs to access.</li>
</ul>

<p>This is a natural evolution of the <a href="/llm/inference/speculative%20decoding/2024/12/15/speculative-decoding.html">speculative decoding</a> paradigm: instead of running the draft model on the same GPU as the verifier, offloading it to a dedicated LPU eliminates the resource contention entirely.</p>

<hr />

<h2 id="3-networking-topologies-and-bandwidth-math">3. Networking Topologies and Bandwidth Math</h2>

<p>Nvidia’s systems push the physical limits of copper to keep TCO down, orchestrating incredibly dense electrical networks before resorting to optical interconnects.</p>

<h3 id="lpx-rack-network-math">LPX Rack Network Math</h3>

<ul>
  <li><strong>Intra-Tray:</strong> 16 LPUs connect via an all-to-all PCB mesh. Each LPU routes to 15 others via $4 \times 100G$ C2C links.</li>
  <li><strong>Intra-Rack (Inter-Node):</strong> Each LPU routes $2 \times 100G$ to 15 other nodes. With FPGAs connecting at 25G/50G, a node features 1,020 differential pairs. Across 16 nodes, the <strong>copper backplane supports 8,160 differential pairs</strong> ($16 \times 1020 / 2$).</li>
  <li><strong>Total intra-rack scale-up bandwidth:</strong></li>
</ul>

\[\text{BW} = 256 \text{ LPUs} \times 90 \text{ lanes} \times 112\text{ Gbps} / 8 \times 2 \text{ directions} = 645 \text{ TB/s}\]

<h3 id="rubin-ultra-nvl144-kyber-rack-scale-up-math">Rubin Ultra NVL144 (Kyber Rack) Scale-up Math</h3>

<p>The Kyber rack fits 144 Rubin Ultra GPUs and 72 NVLink 7 switches.</p>

<ul>
  <li><strong>GPU Bandwidth:</strong> Each GPU uses 72 Differential Pairs (DPs). At $200\text{ Gbit/s bi-di}$ per channel, each GPU achieves <strong>14.4 Tbit/s uni-directional</strong> scale-up bandwidth.</li>
  <li><strong>Switch Bandwidth:</strong> Each NVSwitch 7 uses 144 lanes of 200G, totaling <strong>28.8 Tbit/s uni-directional</strong> bandwidth. Connecting these requires midplanes and copper flyover cables.</li>
  <li><strong>NVL288 Constraints:</strong> Scaling to 288 GPUs across two racks via copper would require 20,736 additional DPs, acting as a massive upper bound on cable content, unless higher radix switches are introduced.</li>
</ul>

<hr />

<h2 id="4-co-packaged-optics-cpo-vs-copper-roadmap">4. Co-Packaged Optics (CPO) vs. Copper Roadmap</h2>

<p>A key architectural insight from GTC 2026: <strong>Nvidia uses copper where it can, and optics where it must</strong>.</p>

<table>
  <thead>
    <tr>
      <th>System</th>
      <th>Generation</th>
      <th>Scale-up Interconnect</th>
      <th>CPO?</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>NVL72</td>
      <td>Rubin</td>
      <td>Intra-rack copper</td>
      <td>No</td>
    </tr>
    <tr>
      <td>NVL144</td>
      <td>Rubin Ultra</td>
      <td>Intra-rack copper (Kyber)</td>
      <td>No</td>
    </tr>
    <tr>
      <td>NVL576</td>
      <td>Rubin Ultra</td>
      <td>Intra-rack copper + inter-rack CPO</td>
      <td>Partial</td>
    </tr>
    <tr>
      <td>NVL1152</td>
      <td>Feynman</td>
      <td>Full rack-to-rack CPO</td>
      <td>Yes</td>
    </tr>
  </tbody>
</table>

<ul>
  <li><strong>Rubin / Rubin Ultra:</strong> Scale-up within NVL72 and NVL144 Kyber racks remains strictly copper.</li>
  <li><strong>NVL576 (Rubin Ultra):</strong> This 8-rack system will be the first introduction of <strong>CPO scale-up</strong>, utilizing a two-tier all-to-all network between racks, though intra-rack networking stays copper.</li>
  <li><strong>Feynman NVL1152:</strong> Will fully adopt CPO for rack-to-rack scale-up, overcoming the physical reach/shoreline limits of bumping electrical SerDes from 224G to 448G.</li>
</ul>

<p>The transition is driven by physics: as SerDes rates double, the reach of copper decreases and the shoreline (pin count per die edge) becomes the binding constraint. CPO sidesteps both by converting electrical signals to photons at the package boundary.</p>

<hr />

<h2 id="5-ancillary-infrastructure-vera-etl256-and-stx">5. Ancillary Infrastructure: Vera ETL256 and STX</h2>

<p>To prevent non-GPU components from bottlenecking system performance, Nvidia released auxiliary racks:</p>

<h3 id="vera-etl256">Vera ETL256</h3>

<p>A standalone, liquid-cooled rack packing <strong>256 Vera CPUs</strong> to handle the surging preprocessing and simulation demands of Reinforcement Learning workloads. It utilizes a single-tier Spectrum-6 multiplane topology.</p>

<p>This reflects a growing reality: RL training pipelines (reward model evaluation, environment simulation, data preprocessing) impose enormous CPU demands that steal GPU cycles if co-located. Dedicated CPU racks eliminate this contention.</p>

<h3 id="cmx-and-stx-context-memory-storage">CMX and STX (Context Memory Storage)</h3>

<p>To combat the exponential growth of KV Cache, Nvidia introduced <strong>Tier G3.5 NVMe storage</strong>. The STX reference rack utilizes <strong>BlueField-4 DPUs</strong> (featuring a Vera CPU, 2x CX-9 NICs, and 2x SOCAMM modules) to offload “warm” KV cache from expensive GPU HBM and system DRAM, optimizing inference efficiency.</p>

<p>This creates a multi-tier memory hierarchy for KV cache:</p>

<table>
  <thead>
    <tr>
      <th>Tier</th>
      <th>Medium</th>
      <th>Capacity</th>
      <th>Bandwidth</th>
      <th>Use Case</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Hot</td>
      <td>GPU HBM</td>
      <td>~192 GB/GPU</td>
      <td>~8 TB/s</td>
      <td>Active decoding</td>
    </tr>
    <tr>
      <td>Warm</td>
      <td>System DRAM / FPGA DDR5</td>
      <td>~512 GB–1 TB</td>
      <td>~200-400 GB/s</td>
      <td>Recent context, draft models</td>
    </tr>
    <tr>
      <td>Cold</td>
      <td>NVMe (STX)</td>
      <td>Multi-TB</td>
      <td>~50-100 GB/s</td>
      <td>Long context, session persistence</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="6-strategic-synthesis">6. Strategic Synthesis</h2>

<p>Nvidia is fundamentally transitioning from selling standalone AI accelerators to orchestrating entire <strong>datacenter macro-architectures</strong>. Several strategic threads emerge:</p>

<ol>
  <li>
    <p><strong>Supply chain arbitrage:</strong> By disaggregating decode tasks to SRAM-heavy LPUs manufactured on Samsung’s SF4X (not TSMC N3), Nvidia bypasses critical industry supply constraints (TSMC N3 capacity and HBM allocation) while preserving high-margin GPU allocations strictly for compute-heavy prefill.</p>
  </li>
  <li>
    <p><strong>Copper maximalism:</strong> The aggressive densification of copper flyover cables and midplanes in Kyber and LPX architectures proves that TCO optimization remains paramount. Nvidia delays the transition to expensive CPO interconnects until physical electrical bounds absolutely mandate it in the Feynman generation.</p>
  </li>
  <li>
    <p><strong>Full-stack lock-in:</strong> With dedicated GPU racks (Kyber), CPU racks (Vera ETL256), storage racks (STX), and decode accelerator racks (LPX), plus the networking fabric (Spectrum-X, NVLink 7) tying them together, Nvidia is selling complete datacenter blueprints rather than individual chips.</p>
  </li>
  <li>
    <p><strong>Inference economics:</strong> The LPU integration directly addresses the economic pain of decode. During decode, GPUs are massively underutilized on compute but bottlenecked on memory bandwidth. Offloading FFN to cheap, deterministic LPUs improves the cost-per-token by utilizing the right silicon for the right workload.</p>
  </li>
</ol>]]></content><author><name>Jianyu Huang</name><email>jianyu0huang@gmail.com</email></author><category term="AI" /><category term="NVIDIA" /><category term="GTC" /><category term="Inference" /><category term="LPU" /><category term="Groq" /><category term="Rubin" /><category term="NVLink" /><category term="SemiAnalysis" /><summary type="html"><![CDATA[Reading notes based on: Nvidia – The Inference Kingdom Expands (SemiAnalysis)]]></summary></entry><entry><title type="html">Appointment with Spring: Our Seventh Year Among the Blossoms</title><link href="https://jianyuh.github.io/life/non-tech/2026/03/29/cherry-blossom.html" rel="alternate" type="text/html" title="Appointment with Spring: Our Seventh Year Among the Blossoms" /><published>2026-03-29T00:00:00+00:00</published><updated>2026-03-29T00:00:00+00:00</updated><id>https://jianyuh.github.io/life/non-tech/2026/03/29/cherry-blossom</id><content type="html" xml:base="https://jianyuh.github.io/life/non-tech/2026/03/29/cherry-blossom.html"><![CDATA[<blockquote>
  <p>“Year after year, the flowers bloom the same; year after year, the people are not those of old.” (年年岁岁花相似，岁岁年年人不同)</p>
</blockquote>

<p><img src="/assets/images/cherry_blossom.png" alt="Cherry Blossoms" /></p>

<p>My first encounter with these blossoms was in March 2020. My wife had just begun her doctoral studies, and the world was tilting into the surreal uncertainty of a global pandemic. I remember standing together under the pale canopy, our faces hidden behind masks, breathless at the sheer scale of the bloom. Even amidst the collective shadow of that year, the vitality of the trees felt like a promise—a glimmer of hope for a new chapter. Back then, I was still living a life in transit, commuting long distances for work. Those swirling petals felt like a fairy tale, a gentle welcome to the city we had decided to call home.</p>

<p>In the years that followed, the spring bloom became our annual pilgrimage. Each passing season brought new friends into our circle and saw our lives grow more settled and rooted. My phone turned into a digital archive of this evolution, filled with photos of my wife framed by those same pink clouds, year after year—until the masks were finally put away and the world moved on.</p>

<p>Then, life bloomed in a more literal sense. Two years ago, in the heart of spring, our son was born at a nearby hospital. When he was just two weeks old, the blossoms reached their peak. Though my wife was still navigating the fragile days of recovery, we walked together with my parents to capture a family portrait under the trees. Our son was a tiny, drowsy observer, taking in the bustling crowds with the wide, uncomprehending eyes of a newborn.</p>

<p>A year later, the scene shifted. He was a sturdy one-year-old in a blue sweater vest, sporting a look of adorable reluctance as we tried to capture the perfect photo. He had just learned to walk—steadying himself with our hands, stumbling, and picking himself back up. Though he couldn’t speak yet, he pointed at the falling petals with sheer excitement, narrating his joy in a language of babbles.</p>

<p>Now, at two years old, the trip to the grove has become a sacred family ritual—a living yardstick of our growth. This year, his excitement began the moment we left the house; he knew exactly where we were headed. Once we arrived, the roles reversed: instead of carrying him, we were the ones giving chase. He ran through the grass, his eyes darting from the crowds to the trees, then up to the drones and planes roaring in the distance. To him, the world is a gallery of wonders. I hope these moments settle into the foundation of his memory—a subconscious blueprint of beauty he can carry forever.</p>

<p>I do not know how much longer our path will keep us in this city. Life may eventually take us elsewhere, and a blooming season like this might one day become a rare luxury. But for now, I deeply cherish this seventh year.</p>

<p>A flowering tree can live for over a century. To the cherry blossoms, I am merely one of a million passing shadows in their long lives. But to me, they are a permanent part of my soul’s geography—a witness to the years we grew, the years we loved, and the beautiful, fleeting gift of being alive.</p>]]></content><author><name>Jianyu Huang</name><email>jianyu0huang@gmail.com</email></author><category term="Life" /><category term="Non-tech" /><category term="Parenting" /><category term="Reflection" /><category term="Memoir" /><category term="Spring" /><summary type="html"><![CDATA[“Year after year, the flowers bloom the same; year after year, the people are not those of old.” (年年岁岁花相似，岁岁年年人不同)]]></summary></entry><entry><title type="html">Attention Residuals (AttnRes) – Generalizing Depth-wise Information Flow in LLMs</title><link href="https://jianyuh.github.io/residual/2026/03/16/attention-residuals.html" rel="alternate" type="text/html" title="Attention Residuals (AttnRes) – Generalizing Depth-wise Information Flow in LLMs" /><published>2026-03-16T00:00:00+00:00</published><updated>2026-03-16T00:00:00+00:00</updated><id>https://jianyuh.github.io/residual/2026/03/16/attention-residuals</id><content type="html" xml:base="https://jianyuh.github.io/residual/2026/03/16/attention-residuals.html"><![CDATA[<p>Paper: <a href="https://github.com/MoonshotAI/Attention-Residuals/blob/master/Attention_Residuals.pdf">Attention Residuals</a></p>

<p>This paper asks a simple question: if attention replaced recurrence along the sequence dimension, why are we still using a fixed additive recurrence along the depth dimension?</p>

<h2 id="motivation-residuals-behave-like-depth-wise-recurrence">Motivation: Residuals Behave Like Depth-Wise Recurrence</h2>

<p>Standard residual connections can be written as:</p>

\[h^l = h^{l-1} + f_{l-1}(h^{l-1})\]

<p>Unrolling over depth gives:</p>

\[h^l = h^1 + \sum_{i=1}^{l-1} f_i(h^i)\]

<p>That view makes the paper’s core observation clear: every layer sees a uniformly weighted sum of all earlier layer updates. The authors call this a <strong>time-depth duality</strong>. In an RNN, all past tokens are compressed into a single hidden state over time. In a Transformer with standard residuals, all past layer outputs are compressed into a single hidden state over depth.</p>

<p>The paper argues that this becomes especially problematic in PreNorm LLMs. Because the residual stream keeps accumulating unweighted updates, hidden-state magnitudes tend to grow with depth, leading to <strong>PreNorm dilution</strong>. Later layers then have to produce larger and larger updates just to maintain the same influence on the final representation.</p>

<h2 id="full-attention-residuals">Full Attention Residuals</h2>

<p>The proposed fix is to replace fixed accumulation with learned softmax attention over previous layers:</p>

\[h^l = \sum_{i=0}^{l-1} \alpha_{i \to l} \cdot v_i\]

<p>Here, the values are the embedding and prior layer outputs, while the attention weights are:</p>

\[\alpha_{i \to l} = \frac{\phi(q^l, k_i)}{\sum_{j=0}^{l-1} \phi(q^l, k_j)}\]

<p>Instead of deriving the query from the current hidden state, each layer uses a learned parameter vector:</p>

\[q^l = w^l \in \mathbb{R}^d\]

<p>This detail matters. The query is static, so depth-wise attention can still be computed in parallel across layers. The keys are RMS-normalized before scoring,</p>

\[\phi(q, k) = \exp(q^\top \mathrm{RMSNorm}(k))\]

<p>which prevents large-magnitude activations from dominating the attention distribution.</p>

<p>Conceptually, Full AttnRes turns the residual path into a learned depth mixer. A layer is no longer forced to treat all earlier layers equally; it can emphasize whichever layers are most useful.</p>

<h2 id="why-the-full-version-does-not-scale-cleanly">Why the Full Version Does Not Scale Cleanly</h2>

<p>From an arithmetic perspective, Full AttnRes is manageable. The paper quotes $O(L^2 d)$ work, which is acceptable for realistic layer counts. The real issue is systems cost: all previous layer outputs must remain available, so activation storage and cross-stage communication grow as $O(Ld)$. Under activation recomputation or pipeline parallelism, that becomes the bottleneck.</p>

<h2 id="block-attention-residuals">Block Attention Residuals</h2>

<p>To make the idea practical, the paper introduces <strong>Block AttnRes</strong>. The $L$ layers are partitioned into $N$ blocks of size $S$, and each block is summarized by a simple sum:</p>

\[b_n = \sum_{j \in B_n} f_j(h^j)\]

<p>A layer then attends to:</p>

<ul>
  <li>completed block summaries $b_0, b_1, \dots, b_{n-1}$</li>
  <li>the running partial sum of its current block</li>
</ul>

<p>This is the key compression step. Instead of exposing every earlier layer explicitly, the model exposes a smaller set of block summaries, which reduces memory and communication overhead from $O(Ld)$ to $O(Nd)$. The reported scaling results suggest that a small number of blocks, roughly $N \approx 8$, captures most of the gains of Full AttnRes.</p>

<h2 id="training-and-inference-optimizations">Training and Inference Optimizations</h2>

<p>The paper also does the systems work needed to make Block AttnRes usable at scale.</p>

<p>For training, the main trick is <strong>cross-stage caching</strong> under pipeline parallelism. Rather than repeatedly sending already-seen block summaries across virtual stages, each rank caches what it has received and only transmits the incremental blocks. That avoids redundant communication and reduces the peak communication cost enough to overlap it with computation.</p>

<p>For inference, the authors use a <strong>two-phase computation</strong>:</p>

<ol>
  <li>compute inter-block attention for all layers in a block together</li>
  <li>compute intra-block attention sequentially and merge it with an online softmax</li>
</ol>

<p>Because the layer queries are static parameters, this organization lowers memory traffic substantially. The paper reports total residual-path memory I/O of about $5.5d$ reads per layer, compared with $3d$ for standard residuals and much higher cost for multi-stream alternatives such as mHC.</p>

<h2 id="residual-connections-as-a-mixing-matrix">Residual Connections as a Mixing Matrix</h2>

<p>One of the nicest parts of the paper is the matrix view. If we define a depth mixing matrix $M \in \mathbb{R}^{L \times L}$, where $M_{i \to l}$ is the weight assigned by layer $l$ to layer $i$’s output, several architectures fall into the same template:</p>

<ul>
  <li><strong>Standard residuals:</strong> an all-ones lower-triangular matrix</li>
  <li><strong>Highway networks:</strong> input-dependent scalar gates, but still low-complexity depth mixing</li>
  <li><strong>Multi-stream methods such as mHC:</strong> depth-wise linear attention with a higher-rank structured state</li>
  <li><strong>Attention residuals:</strong> full depth-wise softmax attention</li>
</ul>

<p>Under this lens, Block AttnRes smoothly interpolates between standard residuals and Full AttnRes by changing how many block summaries are exposed.</p>

<h2 id="takeaways">Takeaways</h2>

<p>The empirical story is straightforward:</p>

<ul>
  <li><strong>PreNorm dilution is reduced.</strong> Activation magnitudes stay bounded more cleanly, and gradients are distributed more evenly over depth.</li>
  <li><strong>The model learns nontrivial skip patterns.</strong> Attention maps remain strongly local, but deeper layers sometimes jump back to very early layers or even the embedding.</li>
  <li><strong>The preferred architecture shifts.</strong> In the paper’s iso-compute and iso-parameter sweeps, AttnRes favors deeper, narrower models than standard residual designs do.</li>
</ul>

<p>That last point is especially interesting. If depth is no longer handicapped by uniform residual accumulation, then making a model deeper becomes a better tradeoff than it is in a baseline Transformer.</p>

<p>My main takeaway is that this paper treats residual connections as an architectural choice rather than a fixed law. Standard residuals hard-code a very specific depth mixing rule: every earlier layer contributes equally through simple addition. AttnRes replaces that rule with learned attention, then introduces a blockwise approximation that keeps the idea deployable at scale.</p>

<p>Whether this becomes a standard recipe will depend on implementation complexity and robustness across model families, but the framing is strong: residual paths are not just optimization scaffolding, they are a depth-wise information routing mechanism that can be redesigned.</p>]]></content><author><name>Jianyu Huang</name><email>jianyu0huang@gmail.com</email></author><category term="Residual" /><category term="Residual" /><summary type="html"><![CDATA[Paper: Attention Residuals]]></summary></entry><entry><title type="html">Reunion: A Gentle Reconciliation with Time</title><link href="https://jianyuh.github.io/life/non-tech/2026/03/14/reunion.html" rel="alternate" type="text/html" title="Reunion: A Gentle Reconciliation with Time" /><published>2026-03-14T00:00:00+00:00</published><updated>2026-03-14T00:00:00+00:00</updated><id>https://jianyuh.github.io/life/non-tech/2026/03/14/reunion</id><content type="html" xml:base="https://jianyuh.github.io/life/non-tech/2026/03/14/reunion.html"><![CDATA[<p>It sometimes feels like twenty years of life are merely the brief, beautiful intervals between a few meaningful reunions.</p>

<p>Recently, I had the chance to catch up with an old high school classmate who had flown in from overseas for a conference. We both happened to be in the same city on a business trip. Looking back, our lives have always shared a curious, almost poetic symmetry: our mothers were high school classmates, and we spent our three years of high school together as friends and fellow students. In our small hometown school, we were both dedicated to our studies, often finding ourselves at the top of our class. They were undeniably brilliant, eventually heading off to a prestigious university. I, on the other hand, faced some setbacks during my entrance exams and followed a different path that, through a series of unexpected turns, led me to another university in the same city.</p>

<p>Our rare encounters during those university years now feel like warm memories from a past life—gathering near their campus to welcome a visiting friend, or taking a long walk through a historic park. After that, our lives became like two rays of light, diverging as we pursued our own careers, seemingly never to intersect again.</p>

<p>That changed a couple of years ago when they visited my city for work. We reunited over dinner for the first time in over a decade—a moment filled with excitement and the natural, quiet curiosity that comes with time. We had both started families by then. I remember us trying to explain our respective professional fields to each other, though our industries were so different it was a bit of a challenge to grasp the details. But little did we know, life was unfolding in parallel for us once again: shortly after that meeting, my first child was born, and just a few months later, they welcomed their first child into the world.</p>

<p>Our meeting this time around felt remarkably comfortable, filled with a quiet, steady ease. Sitting together over a warm meal, we realized it had been exactly twenty years since we first started high school together. Time truly flies. They have since pivoted into a new role in the tech space, and it was wonderful to see how sharp, curious, and passionate they remain about their work.</p>

<p>An old joke we’d shared online about introducing our kids someday took on a sense of genuine warmth over our meal. After dinner, we wandered into a toy store and bought two identical little plushies—sweet tokens of our friendship to bring back to our children. Our farewell was a classic modern mishap: their phone was nearly dead, and getting them back to their hotel became a bit of an adventure. I stayed with them until we got enough of a charge to book a ride, and I happily saw them off.</p>

<p>Watching their car disappear into the evening, I couldn’t help but smile. The spirited energy of our youth has finally been gently smoothed over by twenty years of living, learning, and growing. When old friends meet, we inevitably talk of the past, but it’s the present we truly cherish. I don’t know when our paths will cross next, but it reminded me of a simple truth: all we can do is treasure the wonderful people and the beautiful moments right in front of us.</p>

<p><img src="/assets/images/reunion.png" alt="Reunion" /></p>]]></content><author><name>Jianyu Huang</name><email>jianyu0huang@gmail.com</email></author><category term="Life" /><category term="Non-tech" /><category term="Reflection" /><category term="Memoir" /><category term="Reunion" /><summary type="html"><![CDATA[It sometimes feels like twenty years of life are merely the brief, beautiful intervals between a few meaningful reunions.]]></summary></entry><entry><title type="html">Scalable Training of Mixture-of-Experts Models with Megatron Core</title><link href="https://jianyuh.github.io/moe/2026/03/11/MoE-Megatron.html" rel="alternate" type="text/html" title="Scalable Training of Mixture-of-Experts Models with Megatron Core" /><published>2026-03-11T00:00:00+00:00</published><updated>2026-03-11T00:00:00+00:00</updated><id>https://jianyuh.github.io/moe/2026/03/11/MoE-Megatron</id><content type="html" xml:base="https://jianyuh.github.io/moe/2026/03/11/MoE-Megatron.html"><![CDATA[<p>Reading the following paper:</p>
<ul>
  <li><a href="https://arxiv.org/pdf/2603.07685">Scalable Training of Mixture-of-Experts Models with Megatron Core</a></li>
</ul>

<p>This paper presents <strong>Megatron-Core MoE</strong>, the MoE training stack within NVIDIA’s Megatron-Core framework. It addresses the fundamental systems challenges of training trillion-parameter-class MoE models at high throughput on NVIDIA GPU clusters. The headline results: <strong>1,233 TFLOPS/GPU</strong> on GB300 and <strong>1,048 TFLOPS/GPU</strong> on GB200 for DeepSeek-V3-685B.</p>

<hr />

<h2 id="1-moe-fundamentals">1. MoE Fundamentals</h2>

<h3 id="architecture">Architecture</h3>

<p>Given an input token representation <strong>x</strong>, the router computes:</p>

\[\mathbf{p}(\mathbf{x}) = \text{Softmax}(\mathbf{W}_r \mathbf{x})\]

<p>The MoE layer output is:</p>

\[\text{MoE}(\mathbf{x}) = \sum_{i \in \text{TopK}(\mathbf{p}(\mathbf{x}))} p_i(\mathbf{x}) \cdot E_i(\mathbf{x})\]

<p>where $E_i$ is the $i$-th expert network. Three key advantages:</p>
<ul>
  <li><strong>Scalable capacity</strong>: model size grows independently of per-token compute</li>
  <li><strong>Computational efficiency</strong>: only $K$ of $E$ experts activate per token</li>
  <li><strong>Specialization</strong>: different experts learn different input patterns</li>
</ul>

<h3 id="the-parameter-compute-mismatch">The Parameter-Compute Mismatch</h3>

<p>This is the central insight of the paper. In a dense transformer with $N_\text{total}$ parameters, FLOPs per token is approximately $6N_\text{total}$, so parameters and computation scale in lockstep.</p>

<p>For MoE, per-token computation is approximately $6N_\text{active}$ where $N_\text{active} \propto K$ while $N_\text{total} \propto E$, and $K \ll E$.</p>

<p><strong>Concrete example</strong>: DeepSeek-V3 has 685B total parameters but only 37B active per token, an <strong>18x gap</strong>.</p>

<p>This creates the <strong>Three Walls</strong>:</p>

<table>
  <thead>
    <tr>
      <th>Wall</th>
      <th>Root Cause</th>
      <th>Manifestation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Memory</strong></td>
      <td>All $E$ experts’ params/grads/optimizer states in memory, but only $K$ activate</td>
      <td>199.5 GB per GPU for DeepSeek-V3</td>
    </tr>
    <tr>
      <td><strong>Communication</strong></td>
      <td>EP requires all-to-all collectives to route tokens across GPUs</td>
      <td>20-60% of training time</td>
    </tr>
    <tr>
      <td><strong>Compute</strong></td>
      <td>Small per-expert GEMMs underutilize Tensor Cores; many kernel launches</td>
      <td>Host-boundedness</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="2-moe-layer-architecture-four-stage-forward-pass">2. MoE Layer Architecture: Four-Stage Forward Pass</h2>

<p><strong>Stage 1: Route.</strong> A learned linear projection maps each token’s hidden state to $E$ logits: $\mathbf{l} = \mathbf{W}_r^\top \mathbf{x} \in \mathbb{R}^E$. A score function, either softmax or sigmoid as used in DeepSeek-V3, converts these into probabilities. Top-$k$ selection chooses the active experts.</p>

<p><strong>Stage 2: Dispatch.</strong> Tokens are permuted so those destined for the same expert are contiguous. Three backends are described: AllGather, All-to-All, and Flex via DeepEP/HybridEP.</p>

<p><strong>Stage 3: Expert Computation.</strong> All local experts run in a single Grouped GEMM call. Each expert is a two-layer MLP with optional gating:
\(E_i(\mathbf{x}) = \mathbf{W}_2^{(i)} \phi(\mathbf{W}_1^{(i)} \mathbf{x})\)</p>

<p><strong>Stage 4: Combine.</strong> Inverse communication returns tokens, unpermutation restores order, and shared expert output is added.</p>

<hr />

<h2 id="3-parallel-folding-and-multi-dimensional-parallelism">3. Parallel Folding and Multi-Dimensional Parallelism</h2>

<h3 id="the-dense-sparse-mismatch">The Dense-Sparse Mismatch</h3>

<p>A single Transformer block contains two fundamentally different computation patterns:</p>

<table>
  <thead>
    <tr>
      <th>Aspect</th>
      <th>Attention (Dense)</th>
      <th>MoE (Sparse)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>TP</strong></td>
      <td>Large QKV matrices benefit from high TP</td>
      <td>Small per-expert dims make high TP counterproductive</td>
    </tr>
    <tr>
      <td><strong>CP</strong></td>
      <td>Long sequences benefit from high CP</td>
      <td>No sequence dependency; CP is irrelevant</td>
    </tr>
    <tr>
      <td><strong>EP</strong></td>
      <td>Not applicable</td>
      <td>Essential for distributing experts</td>
    </tr>
  </tbody>
</table>

<p>Prior frameworks forced <code class="language-plaintext highlighter-rouge">World Size = TP x CP x PP x DP</code>, where <code class="language-plaintext highlighter-rouge">EP &lt;= DP</code>. This creates three problems:</p>
<ol>
  <li><strong>Multiplicative GPU requirements</strong>: EP=8 forces DP&gt;=8, and with CP=8 the minimum becomes 64 GPUs.</li>
  <li><strong>Forced suboptimal parallelism</strong>: high TP fragments small experts, while low TP underparallelizes attention.</li>
  <li><strong>Cross-node communication</strong>: EP constrained within DP forces all-to-all across slow interconnects.</li>
</ol>

<h3 id="parallel-folding-solution">Parallel Folding Solution</h3>

<p><strong>Core idea</strong>: decouple attention and MoE parallelism mappings.</p>

<ul>
  <li><strong>Attention layers</strong> form groups over <code class="language-plaintext highlighter-rouge">TP x CP x DP x PP</code></li>
  <li><strong>MoE layers</strong> form groups over <code class="language-plaintext highlighter-rouge">ETP x EP x EDP x PP</code></li>
  <li><strong>Only constraint</strong>: PP must remain consistent</li>
</ul>

<p>Key benefits:</p>
<ol>
  <li><strong>Breaks EP &lt;= DP</strong>: EP can fold across TP x CP groups. Example: attention TP=4, CP=2, DP=8, PP=4 on 256 GPUs. Traditionally EP&lt;=8; with folding EP=64 becomes possible.</li>
  <li><strong>Reduces minimum GPUs</strong>: CP=8 and EP=8 traditionally requires 64 GPUs; with folding, only 8.</li>
  <li><strong>Independent optimization</strong>: attention uses high TP, while MoE uses ETP=1 for full expert width.</li>
  <li><strong>NVLink locality</strong>: both CP and EP all-to-all stay within the NVLink domain.</li>
</ol>

<h3 id="gradient-handling">Gradient Handling</h3>

<p>Expert gradients are scaled by <code class="language-plaintext highlighter-rouge">edp_size / dp_size</code> to account for the different effective batch sizes seen by experts versus dense layers.</p>

<hr />

<h2 id="4-breaking-the-memory-wall">4. Breaking the Memory Wall</h2>

<h3 id="memory-anatomy-deepseek-v3-pp4-x-vpp4-x-ep64-256-gpus">Memory Anatomy (DeepSeek-V3, PP4 x VPP4 x EP64, 256 GPUs)</h3>

<table>
  <thead>
    <tr>
      <th>Component</th>
      <th>Memory/GPU</th>
      <th>Optimization</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Weights &amp; Gradients</td>
      <td>36.4 GB</td>
      <td>PP, EP, TP sharding</td>
    </tr>
    <tr>
      <td>Main Weights &amp; Optimizer States</td>
      <td>32.1 GB</td>
      <td>Distributed optimizer, BF16 moments</td>
    </tr>
    <tr>
      <td>Activations</td>
      <td><strong>131.0 GB</strong></td>
      <td>Low precision, recomputation, offloading</td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td><strong>199.5 GB</strong></td>
      <td> </td>
    </tr>
  </tbody>
</table>

<p><strong>Key insight</strong>: activations dominate, exceeding weights and optimizer states combined.</p>

<h3 id="memory-efficient-permutation-zero-overhead">Memory-Efficient Permutation (Zero Overhead)</h3>

<p>Standard formulation applies routing weights <strong>after</strong> expert computation:
\(y = \sum_{i \in \mathcal{T}(\mathbf{x})} p_i \cdot \mathbf{W}_2^{(i)} \phi(\mathbf{W}_1^{(i)} \mathbf{x})\)</p>

<p>Memory-efficient version absorbs $p_i$ <strong>before</strong> the second linear layer:
\(y = \sum_{i \in \mathcal{T}(\mathbf{x})} \mathbf{W}_2^{(i)} \left(p_i \cdot \phi(\mathbf{W}_1^{(i)} \mathbf{x})\right)\)</p>

<p>Since $\mathbf{W}_2^{(i)}$ is a pure linear map with no bias, scalar multiplication commutes:
\(p_i \cdot \mathbf{W}_2^{(i)} \mathbf{h} = \mathbf{W}_2^{(i)} (p_i \cdot \mathbf{h})\)</p>

<p><strong>Why this saves memory</strong>: in the standard version, computing $\partial\mathcal{L}/\partial p_i$ requires retaining each expert output $E_i(\mathbf{x})$. In the efficient version, $p_i$ multiplies $\phi(\mathbf{z}_i)$ directly, so $\partial\mathcal{L}/\partial p_i$ only depends on $\phi(\mathbf{z}_i)$, which can be recomputed from $\mathbf{z}_i = \mathbf{W}_1^{(i)} \mathbf{x}$ already saved for SwiGLU backward. This saves roughly <strong>26.3 GB per GPU</strong> for DeepSeek-V3 with essentially zero extra compute.</p>

<h3 id="fp8fp4-activations">FP8/FP4 Activations</h3>

<p>Linear layer inputs stored in FP8 instead of BF16 reduce memory by 50% per tensor. For DeepSeek-V3, that is roughly <strong>16 GB saved</strong>. FP4 pushes this further to a 75% reduction.</p>

<h3 id="fine-grained-recomputation">Fine-Grained Recomputation</h3>

<p>Two composable techniques:</p>
<ol>
  <li><strong>Granular recomputation</strong>: selectively recompute only specific operations such as activation functions, LayerNorm, and MLA up-projection, typically with under 5% compute overhead.</li>
  <li><strong>Output-discarding recomputation</strong>: release checkpointed module outputs immediately after downstream consumption and restore them via recomputation during backward.</li>
</ol>

<table>
  <thead>
    <tr>
      <th>Recomputation Target</th>
      <th>Memory Saved/GPU</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>MLA Up-Projection</td>
      <td>30.4 GB</td>
    </tr>
    <tr>
      <td>SwiGLU Activation</td>
      <td>3.8 GB</td>
    </tr>
    <tr>
      <td>LayerNorm</td>
      <td>8.2 GB</td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td><strong>42.4 GB</strong></td>
    </tr>
  </tbody>
</table>

<p><strong>Critical insight</strong>: full-layer recomputation of MoE is especially expensive because it re-triggers EP all-to-all communication. Fine-grained recomputation avoids that penalty.</p>

<h3 id="fine-grained-activation-offloading">Fine-Grained Activation Offloading</h3>

<p><strong>Forward</strong>: input activations are offloaded to CPU through a dedicated D2H stream, overlapping with the next module’s computation.</p>

<p><strong>Backward</strong>: Layer-Staggered Reload reloads the same module type from the <strong>next</strong> layer while gradients are computed for the current layer. Only one activation per module type resides on GPU at a time.</p>

<p><strong>Peak memory advantage over full recomputation</strong>:</p>
<ul>
  <li>Full recomputation: $L \times \text{layer_input} + 1 \times \text{layer_intermediate}$</li>
  <li>Offloading: $1 \times \text{layer_input} + 1 \times \text{layer_intermediate}$</li>
</ul>

<p>Results: <strong>10-18% memory reduction</strong> with only <strong>1.6-2% throughput overhead</strong>. For Qwen3-235B, offloading enabled a lower TP degree and about <strong>15% throughput improvement</strong>.</p>

<h3 id="precision-aware-optimizer">Precision-Aware Optimizer</h3>

<p>Adam stores first and second moments. The optimization here is to store moments in BF16 or FP8, then cast to FP32 inside TransformerEngine’s FusedAdam kernel for the actual update.</p>

<p>Memory per parameter per DP rank decreases from $6 + 12/d$ bytes to $6 + 8/d$ bytes, saving on the order of <strong>10-12 GB</strong> from the 32.1 GB optimizer-state budget.</p>

<h3 id="fsdp-for-moe">FSDP for MoE</h3>

<p><strong>Dual DeviceMesh design</strong>: dense layers shard across the full DP group, while expert layers shard across the EDP group.</p>

<p>Two key optimizations:</p>
<ol>
  <li><strong>Non-uniform sharding</strong>: flatten and concatenate module parameters, then shard non-uniformly so shard boundaries align with communication buffers for zero-copy collectives.</li>
  <li><strong>Persistent double buffers with NCCL User Buffer Registration</strong>: pre-allocate two persistent buffers and cycle between them. This reduces SM footprint from 8-32 SMs to 1-4 SMs.</li>
</ol>

<hr />

<h2 id="5-breaking-the-communication-wall">5. Breaking the Communication Wall</h2>

<h3 id="communication-anatomy">Communication Anatomy</h3>

<p>For DeepSeek-V3: 58 MoE layers times 2 operations per layer gives <strong>116 dispatch/combine operations per forward pass</strong>. Backward doubles this. At 50 GB/s inter-node bandwidth, a single 200 MB dispatch already costs milliseconds, which compounds rapidly over an iteration.</p>

<h3 id="hybridep">HybridEP</h3>

<p>Developed by NVIDIA for NVLink-rich topologies such as NVL72.</p>

<p><strong>Dispatch</strong>: reads data from global memory into shared memory based on routing info, then writes to destinations via FIFO queues. For inter-node traffic, GPUs with the same local index across nodes exchange first, then forward within the node.</p>

<p><strong>Combine</strong>: fuses reduction into the communication kernel itself. Cross-node data is reduced first, then the remaining intra-node reduction completes locally.</p>

<p>Performance on GB200 with EP64:</p>
<ul>
  <li>HybridEP dispatch: <strong>675 us</strong> vs all-to-all <strong>930 us</strong></li>
  <li>HybridEP combine: <strong>744 us</strong> vs all-to-all <strong>827 us</strong></li>
</ul>

<h3 id="ep-communication-overlapping">EP Communication Overlapping</h3>

<p><strong>1F1B forward-backward overlap</strong> merges the forward pass of one microbatch with the backward pass of another.</p>

<p>Two key optimizations:</p>
<ol>
  <li><strong>Stream Separation</strong>: compute stream and communication stream run in parallel.</li>
  <li><strong>W/D Split</strong>: backward MLP is split into weight gradient (<code class="language-plaintext highlighter-rouge">W/mlp</code>) and data gradient (<code class="language-plaintext highlighter-rouge">D/mlp</code>). Only <code class="language-plaintext highlighter-rouge">D/mlp</code> depends on backward dispatch, which opens more room to hide communication.</li>
</ol>

<p>Result: EP communication overhead drops from <strong>30-40% to under 5%</strong> of iteration time for DeepSeek-V3 on H100.</p>

<hr />

<h2 id="6-breaking-the-compute-efficiency-wall">6. Breaking the Compute Efficiency Wall</h2>

<h3 id="grouped-gemm">Grouped GEMM</h3>

<p>DeepSeek-V3’s 256 small experts produce GEMMs with M dimensions around 128 tokens per expert, far below the regime needed for peak Tensor Core efficiency.</p>

<p>Four implementations are discussed:</p>
<ol>
  <li><strong>Multi-stream cuBLASLt GEMMs</strong>: individual GEMMs launched into multiple CUDA streams</li>
  <li><strong>CUTLASS Grouped GEMM</strong>: a fused single-kernel path</li>
  <li><strong>cuBLASLt Grouped GEMM</strong> (device-initiated): reads shapes from device memory, making it CUDA-Graph-compatible</li>
  <li><strong>cuteDSL Grouped GEMM</strong>: fuses SwiGLU activation and FP8 quantization into the GEMM epilogue</li>
</ol>

<h3 id="permutation-fusion">Permutation Fusion</h3>

<p>Three-stage pipeline:</p>
<ul>
  <li>Preprocessing: generate the Row ID map once</li>
  <li>Permute: move tokens according to offset maps</li>
  <li>Unpermute: inverse permutation plus FP32 accumulation</li>
</ul>

<h3 id="router-and-aux-loss-fusion">Router and Aux-Loss Fusion</h3>

<p>Three fused kernels are described:</p>
<ul>
  <li>score computation with top-$k$ and softmax/sigmoid</li>
  <li>score computation for auxiliary loss</li>
  <li>auxiliary loss computation</li>
</ul>

<h3 id="cuda-graphs">CUDA Graphs</h3>

<p><strong>Full vs Partial</strong>: Full CUDA Graphs capture the entire forward-backward pass, but only for drop-and-pad MoE. <strong>Partial CUDA Graphs</strong> capture only the static components:</p>
<ul>
  <li><strong>Graphable</strong>: attention, router, EP preprocessing, shared experts, dense MLP</li>
  <li><strong>Not graphable</strong>: token dispatch, expert GEMM with dynamic M dimension, token combine</li>
</ul>

<p>Memory optimizations:</p>
<ul>
  <li><strong>Graph count reduction</strong>: without PP, microbatches share graphs (<code class="language-plaintext highlighter-rouge">L x 2</code>); with PP, each microbatch needs its own (<code class="language-plaintext highlighter-rouge">L x M x 2</code>). An <code class="language-plaintext highlighter-rouge">is_first_microbatch</code> GPU flag controls microbatch-specific behavior.</li>
  <li><strong>Pool sharing</strong>: all graphs share one pool when captured in execution order.</li>
  <li><strong>Buffer reuse</strong>: static I/O buffers are reused across graphs per PP execution order.</li>
</ul>

<p>Result: about <strong>10% end-to-end speedup</strong> with roughly <strong>7 GB</strong> extra memory on DeepSeek-V3 GB200.</p>

<h3 id="full-cuda-graphs-for-dropless-moe-three-complementary-techniques">Full CUDA Graphs for Dropless MoE: Three Complementary Techniques</h3>

<p><strong>Challenge 1</strong>: kernel launch without knowing problem size.
Solution: <strong>device-initiated Grouped GEMM</strong>, where cuBLASLt reads shapes directly from device memory. The cuteDSL path can also fuse SwiGLU and quantization into the epilogue.</p>

<p><strong>Challenge 2</strong>: memory allocation without knowing actual size. Without mitigation, worst-case buffers waste $O(\text{EP_size})$ more memory.</p>

<h4 id="echo-elastic-cloning-for-hot-experts">ECHO (Elastic Cloning for Hot Experts)</h4>

<p>Popular experts can receive far more tokens than others. ECHO dynamically clones hot experts onto spare slots on underutilized ranks.</p>

<p><strong>Forward</strong>: the ECHO planner identifies hot experts, generates the hot expert map and updated routing map, copies weights to spare slots through HybridEP, and routes tokens to both the home and cloned experts.</p>

<p><strong>Backward</strong>: expert-gradient dispatch collects gradients from cloned experts back to the home experts.</p>

<h4 id="paged-stashing">Paged Stashing</h4>

<p>Decouples worst-case computation buffers from activation storage:</p>
<ul>
  <li><strong>Single tmp buffer</strong> sized for worst case and shared across all layers</li>
  <li><strong>Paged stashing buffer</strong> stores only the actual tokens used per layer</li>
</ul>

<p>This reduces memory from $O(\text{layers} \times \text{worst_case})$ to $O(\text{worst_case} + \text{actual_total})$.</p>

<p>Implementation detail: <code class="language-plaintext highlighter-rouge">PagedStashBuffer</code> uses 64 tokens per page with a circular-buffer free list. Stash and reload kernels are device-initiated and overlap with computation via dedicated Pack and Unpack CUDA streams.</p>

<hr />

<h2 id="7-reduced-precision-training-fp8fp4">7. Reduced-Precision Training (FP8/FP4)</h2>

<h3 id="strategy-selective-precision">Strategy: Selective Precision</h3>

<p>Three principles:</p>
<ol>
  <li><strong>Protect routing</strong>: keep the router in FP32</li>
  <li><strong>Preserve key components</strong>: embeddings, output layers, main gradients, master weights, and optimizer states stay high precision</li>
  <li><strong>Quantize bulk computation</strong>: expert GEMMs and activations go low precision</li>
</ol>

<h3 id="fp8-recipes">FP8 Recipes</h3>

<p><strong>Per-Tensor FP8</strong> (Hopper and Blackwell): one scale per tensor. Two variants are discussed:</p>
<ul>
  <li><strong>Delayed scaling</strong>: uses historical <code class="language-plaintext highlighter-rouge">amax</code>, but is not the recommended path</li>
  <li><strong>Current/live scaling</strong>: computes scale just in time</li>
</ul>

<p>The hybrid format uses E4M3 for inputs and weights and E5M2 for gradients.</p>

<p><strong>Blockwise FP8</strong> (recommended on Hopper): uses E4M3 for all tensors, with activations and gradients quantized in <code class="language-plaintext highlighter-rouge">1 x 128</code> tiles and weights in <code class="language-plaintext highlighter-rouge">128 x 128</code> blocks. The paper notes this recipe has already been proven at scale on models such as DeepSeek-V3, Minimax-M2, and Ant Ling-2.0.</p>

<p><strong>MXFP8</strong> (recommended on Blackwell): uses <code class="language-plaintext highlighter-rouge">1 x 32</code> element granularity with native Blackwell Tensor Core support and hardware-accelerated scaling. The important caveat is that parameter AllGather still communicates in BF16 because MXFP8 uses different quantization directions for forward and backward.</p>

<h3 id="nvfp4">NVFP4</h3>

<p>Uses E2M1 with <strong>two-level microscaling</strong>:</p>
<ul>
  <li><strong>Per-tensor FP32 scale</strong>: remaps the overall distribution into a range compatible with block scaling</li>
  <li><strong>Per-block E4M3 scale</strong>: blocks of 16 elements then map into FP4 range</li>
</ul>

<p>Three critical algorithmic additions are required for stable training:</p>
<ol>
  <li><strong>Random Hadamard Transforms (RHT)</strong>: applied to weight-gradient computation to reduce outlier impact</li>
  <li><strong>2D scaling</strong>: <code class="language-plaintext highlighter-rouge">16 x 16</code> weight-block scaling keeps forward and backward consistent</li>
  <li><strong>Stochastic rounding</strong>: applied on gradients to reduce rounding bias during FP4 conversion</li>
</ol>

<h3 id="fp8fp4-primary-weights">FP8/FP4 Primary Weights</h3>

<p>The framework eliminates a redundant BF16 copy by casting directly from FP32 to FP8 or FP4. The distributed-optimizer quantization flow is:</p>
<ol>
  <li>Get local abs-max from the master weights</li>
  <li>AllReduce to get global abs-max</li>
  <li>Use global abs-max plus master weights for the partial cast</li>
</ol>

<p>For the blockwise recipe, specialized kernels aware of the 2D weight layout compute abs-max over <code class="language-plaintext highlighter-rouge">128 x 128</code> blocks.</p>

<h3 id="moe-specific-fp8fp4-challenges">MoE-Specific FP8/FP4 Challenges</h3>

<p><strong>Padding alignment</strong>: FP8 GEMMs require alignment to 16 for per-tensor and blockwise FP8, or 32 for MXFP8 and NVFP4. Because token dimension varies dynamically, the paper describes two solutions: routing-map padding and fusing padding directly into permutation.</p>

<p><strong>Grouped quantization</strong>: multiple expert input tensors are quantized in a single fused kernel, and the implementation is CUDA-Graphable.</p>

<p><strong>NVFP4 quantization fusion</strong>: RHT is fused with quantization to avoid extra BF16 traffic. Hadamard is still computed twice, once for <code class="language-plaintext highlighter-rouge">amax</code> and once inside the fused quantization path, but this remains faster than materializing a full BF16 buffer. Stochastic rounding uses <code class="language-plaintext highlighter-rouge">cuRANDDx</code> for in-kernel random-number generation.</p>

<hr />

<h2 id="8-long-context-moe-training">8. Long-Context MoE Training</h2>

<h3 id="the-computational-shift">The Computational Shift</h3>

<p>At 64K tokens, <strong>SDPA consumes 69% of FLOPs</strong>, versus roughly 10-15% at short sequence lengths. SDPA scales as $O(s^2)$ while MoE scales as $O(s)$, so attention becomes the dominant cost.</p>

<p><strong>Key recommendation</strong>: do not recompute core attention at long sequence lengths. At 64K, SDPA recomputation adds about <strong>18% compute overhead</strong> but saves only <strong>9 GB</strong> of memory. Recomputing non-SDPA components instead saves <strong>89.8 GB</strong> with lower performance impact.</p>

<h3 id="cp-vs-tp-trade-offs">CP vs TP Trade-offs</h3>

<ul>
  <li><strong>P2P CP</strong>: preferred across nodes because ring-style KV exchange overlaps naturally with SDPA</li>
  <li><strong>All-to-all CP</strong>: converts sequence-sharded layouts to head-sharded layouts before SDPA</li>
  <li><strong>TP</strong>: preferred within nodes for sharding linear weights</li>
</ul>

<p>Practical guideline: <strong>all-to-all CP + TP inside nodes; P2P CP across nodes</strong>.</p>

<h3 id="dynamic-context-parallelism">Dynamic Context Parallelism</h3>

<p>For variable-length sequences, the system selects CP degree per microbatch based on actual sequence lengths. Multiple CP groups are pre-constructed per rank during initialization. The per-token loss is:</p>

\[\mathcal{L} = \frac{\sum_{t \in \mathcal{V}} \ell_t}{|\mathcal{V}|}\]

<p>where $\mathcal{V}$ is the set of valid non-padding tokens.</p>

<hr />

<h2 id="9-production-features">9. Production Features</h2>

<h3 id="load-balancing-strategies">Load Balancing Strategies</h3>

<p>Three approaches are discussed:</p>
<ul>
  <li><strong>Auxiliary loss</strong>: gradient-based, differentiable, soft balance</li>
  <li><strong>Sinkhorn</strong>: assignment-based, non-differentiable, hard balance; iterates row and column normalization to convergence</li>
  <li><strong>Aux-loss-free / Expert Bias</strong>: feedback-based, non-differentiable, adaptive; updates expert bias based on token-count feedback</li>
</ul>

<h3 id="latent-moe">Latent MoE</h3>

<p>Latent MoE inserts a shared down-projection before expert dispatch and an up-projection after combine:</p>

\[\text{output}(\mathbf{x}) = \mathbf{W}_\uparrow \cdot \left(\sum_{i \in \mathcal{T}_{K,E}} p_i E_i(\mathbf{W}_\downarrow \cdot \mathbf{x}; \ell)\right) + \sum_j E_j^\text{shared}(\mathbf{x}; d)\]

<p>The compression ratio $\alpha = d / \ell$ reduces both all-to-all volume and per-expert weight size by a factor of $\alpha$.</p>

<h3 id="flexible-asymmetric-vpp">Flexible Asymmetric VPP</h3>

<p>Allows different numbers and types of layers per virtual pipeline stage. For DeepSeek-V3 with 61 decoder layers plus 1 MTP layer, <code class="language-plaintext highlighter-rouge">PP=16</code>, and <code class="language-plaintext highlighter-rouge">VPP=2</code>, the first stage holds the embedding plus 3 dense decoder layers to match the cost of 2 MoE layers, while the MTP layer sits in its own standalone stage and the loss is separated.</p>

<h3 id="upcycling">Upcycling</h3>

<p>Converts a dense checkpoint into MoE via virtual-group initialization: shard MLP weights in the intermediate dimension (<code class="language-plaintext highlighter-rouge">4h -&gt; 2h</code>), duplicate the shards, then initialize half the router weights and duplicate them. This guarantees Top-2 routing initially selects one expert from each shard pair, so the MoE output exactly matches the dense model at the start of training.</p>

<h3 id="multi-token-prediction-mtp">Multi-Token Prediction (MTP)</h3>

<p>MTP optimizes the model to predict multiple consecutive future tokens at each position, densifying the supervision signal. Unlike parallel independent predictions, it preserves <strong>causal dependencies</strong> between predictions through hidden-state transitions, which improves convergence and generation quality. During inference, the model falls back to ordinary single-token prediction for compatibility.</p>

<p>Flexible pipeline parallelism allows MTP layers to be placed strategically within the VPP layout. In the DeepSeek-V3 example with <code class="language-plaintext highlighter-rouge">PP=16</code> and <code class="language-plaintext highlighter-rouge">VPP=2</code>, the MTP layer is placed in a standalone pipeline stage on PP rank 14 to balance the workload.</p>

<h3 id="muon-optimizer">Muon Optimizer</h3>

<p>Unlike AdamW, which performs element-wise updates, Muon applies a <strong>matrix-aware</strong> optimization by orthogonalizing entire weight matrices. The production integration provides:</p>
<ol>
  <li><strong>Split QKV support</strong>: efficient orthogonalization even when attention projection matrices are stored as separate Q, K, and V tensors</li>
  <li><strong>Distributed optimizer integration</strong>: optimizer states are sharded across data-parallel ranks while preserving correct orthogonalization semantics</li>
  <li><strong>CPU offloading</strong>: Muon’s orthogonalization buffers can be offloaded when GPU memory is tight</li>
</ol>

<p><strong>MuonClip</strong> addresses a separate stability problem in trillion-parameter training, where query-key dot products can grow without bound and trigger attention explosions. The paper notes hardware-accelerated implementations in cuDNN, <code class="language-plaintext highlighter-rouge">cudnn-frontend</code>, and Transformer Engine.</p>

<hr />

<h2 id="10-performance-results">10. Performance Results</h2>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>System</th>
      <th>#GPUs</th>
      <th>Dtype</th>
      <th>TFLOPS/GPU</th>
      <th>Tokens/s/GPU</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>DeepSeek-V3</td>
      <td>GB300</td>
      <td>256</td>
      <td>MXFP8</td>
      <td><strong>1,233</strong></td>
      <td>4,730</td>
    </tr>
    <tr>
      <td>DeepSeek-V3</td>
      <td>GB200</td>
      <td>256</td>
      <td>MXFP8</td>
      <td><strong>1,048</strong></td>
      <td>4,020</td>
    </tr>
    <tr>
      <td>DeepSeek-V3</td>
      <td>GB200</td>
      <td>256</td>
      <td>BF16</td>
      <td>857</td>
      <td>3,298</td>
    </tr>
    <tr>
      <td>DeepSeek-V3</td>
      <td>H100</td>
      <td>1,024</td>
      <td>FP8-BLK</td>
      <td>368</td>
      <td>1,412</td>
    </tr>
    <tr>
      <td>Qwen3-235B</td>
      <td>GB300</td>
      <td>256</td>
      <td>MXFP8</td>
      <td><strong>974</strong></td>
      <td>6,583</td>
    </tr>
    <tr>
      <td>Qwen3-235B</td>
      <td>GB200</td>
      <td>256</td>
      <td>MXFP8</td>
      <td>919</td>
      <td>6,212</td>
    </tr>
    <tr>
      <td>Qwen3-235B</td>
      <td>GB300</td>
      <td>128</td>
      <td>MXFP8 (131K seq)</td>
      <td>1,150</td>
      <td>1,556</td>
    </tr>
  </tbody>
</table>

<p>GB200 and GB300 deliver roughly <strong>3x higher token throughput</strong> than H100. Long-context DeepSeek-V3 at 256K tokens still reaches <strong>88% of short-context MFU</strong>.</p>

<hr />

<h2 id="11-case-study-deepseek-v3-gb200-vs-h100">11. Case Study: DeepSeek-V3 GB200 vs H100</h2>

<table>
  <thead>
    <tr>
      <th>Config</th>
      <th>GB200 (256 GPUs)</th>
      <th>H100 (1,024 GPUs)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>TP/PP/EP</strong></td>
      <td>1/4/64</td>
      <td>2/8/64</td>
    </tr>
    <tr>
      <td><strong>Precision</strong></td>
      <td>MXFP8</td>
      <td>FP8-Blockwise</td>
    </tr>
    <tr>
      <td><strong>Dispatcher</strong></td>
      <td>HybridEP</td>
      <td>DeepEP</td>
    </tr>
    <tr>
      <td><strong>Recompute</strong></td>
      <td>mlp only</td>
      <td>mlp, mla_up_proj, moe_act, layernorm</td>
    </tr>
    <tr>
      <td><strong>CUDA Graphs</strong></td>
      <td>Enabled</td>
      <td>-</td>
    </tr>
    <tr>
      <td><strong>EP Overlap</strong></td>
      <td>-</td>
      <td>Enabled</td>
    </tr>
    <tr>
      <td><strong>Performance</strong></td>
      <td>1,048 TFLOPS/GPU</td>
      <td>368 TFLOPS/GPU</td>
    </tr>
  </tbody>
</table>

<p><strong>Key insight</strong>: the same model requires fundamentally different strategies on different hardware. GB200’s 192 GB memory and NVL72 topology largely eliminate the communication wall, shifting the bottleneck toward CPU overhead and making CUDA Graphs essential. H100’s 80 GB memory and NVL8 topology instead make cross-node communication the main bottleneck, so EP overlap becomes essential, and FP8 is what frees enough memory to hold the extra overlap buffers.</p>

<hr />

<h2 id="12-systematic-optimization-workflow">12. Systematic Optimization Workflow</h2>

<p>A three-phase workflow emerges from tuning Mixtral, DeepSeek-V3, and Qwen3 across GB200 and H100. The process is inherently <strong>iterative</strong>: solving one bottleneck often exposes the next.</p>

<h3 id="phase-1-establish-memory-feasible-parallelism">Phase 1: Establish Memory-Feasible Parallelism</h3>

<p>Memory feasibility is the first hard constraint. The paper explicitly frames the impact of each parallelism strategy on per-GPU memory:</p>

<table>
  <thead>
    <tr>
      <th>Strategy</th>
      <th>Peak Activation</th>
      <th>Weight Memory</th>
      <th>Optimizer States</th>
      <th>Comm (Per-Layer)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>TP</td>
      <td>$1/d$ (with SP)</td>
      <td>$1/d$</td>
      <td>$1/d$</td>
      <td>High</td>
    </tr>
    <tr>
      <td>EP</td>
      <td>~1 (load-dependent)</td>
      <td>$1/d$ (MoE only)</td>
      <td>$1/d$</td>
      <td>Medium</td>
    </tr>
    <tr>
      <td>PP</td>
      <td>1 (&gt;1 with VPP)</td>
      <td>$1/d$</td>
      <td>$1/d$</td>
      <td>Medium</td>
    </tr>
    <tr>
      <td>CP</td>
      <td>$1/d$</td>
      <td>1</td>
      <td>$1/d$*</td>
      <td>Medium</td>
    </tr>
    <tr>
      <td>DP</td>
      <td>1</td>
      <td>1</td>
      <td>$1/d$*</td>
      <td>Low</td>
    </tr>
  </tbody>
</table>

<p><code class="language-plaintext highlighter-rouge">*</code> Requires distributed optimizer.</p>

<p><strong>Practical tip</strong>: use <code class="language-plaintext highlighter-rouge">--fake-init-process-group</code> to emulate distributed training on a single GPU for rapid iteration on parallelism configurations without allocating a full cluster.</p>

<h3 id="phase-2-select-optimal-parallelism-strategy">Phase 2: Select Optimal Parallelism Strategy</h3>

<p>Five guidelines:</p>
<ol>
  <li><strong>Minimize model parallelism, maximize data parallelism</strong>: keep TP, EP, PP, and CP as small as possible while still avoiding OOM. Use the distributed optimizer (<code class="language-plaintext highlighter-rouge">--use-distributed-optimizer</code>) to shard optimizer states across DP ranks.</li>
  <li><strong>Keep EP and TP within the NVLink domain</strong>: make sure <code class="language-plaintext highlighter-rouge">EP x TP</code> fits inside the local NVLink island, typically 8 GPUs per node or 72 GPUs for NVL72. When scaling beyond that, prefer PP over stretching TP or EP across nodes.</li>
  <li><strong>Use pipeline parallelism for multi-node scaling</strong>: PP distributes layers across nodes. Enable VPP to reduce pipeline bubbles when <code class="language-plaintext highlighter-rouge">PP &gt;= 2</code>, and balance work across VPP ranks.</li>
  <li><strong>Prefer EP over TP for expert layers</strong>: EP yields larger local GEMMs, lower communication overhead, simpler computation graphs, and eliminates local token permutation when <code class="language-plaintext highlighter-rouge">EP = num_experts</code>. The paper gives a concrete example: Mixtral-8x7B with <code class="language-plaintext highlighter-rouge">EP8 x TP1</code> outperforms <code class="language-plaintext highlighter-rouge">EP4 x TP2</code>.</li>
  <li><strong>Enable context parallelism for long sequences</strong>: use CP when sequence length is at least about 8K tokens. For sequences shorter than about 4K, the CP overhead can exceed the benefit.</li>
</ol>

<h3 id="phase-3-profile-and-optimize-bottlenecks">Phase 3: Profile and Optimize Bottlenecks</h3>

<p>Diagnose which wall dominates, then apply targeted fixes.</p>

<p><strong>Memory bottleneck</strong>: symptom is forced full recomputation or overly aggressive parallelism just to avoid OOM.</p>

<table>
  <thead>
    <tr>
      <th>Optimization</th>
      <th>Overhead</th>
      <th>Config Flag</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>FP8 Training</td>
      <td>Low</td>
      <td><code class="language-plaintext highlighter-rouge">--fp8-format --fp8-recipe</code></td>
    </tr>
    <tr>
      <td>Selective Recomputation</td>
      <td>Low</td>
      <td><code class="language-plaintext highlighter-rouge">--recompute-granularity --recompute-modules</code></td>
    </tr>
    <tr>
      <td>Precision-Aware Optimizer</td>
      <td>Low</td>
      <td><code class="language-plaintext highlighter-rouge">--use-precision-aware-optimizer</code></td>
    </tr>
    <tr>
      <td>Activation Offloading</td>
      <td>Medium</td>
      <td><code class="language-plaintext highlighter-rouge">--fine-grained-activation-offloading --offload-modules</code></td>
    </tr>
    <tr>
      <td>Optimizer Offloading</td>
      <td>Medium</td>
      <td><code class="language-plaintext highlighter-rouge">--offload-optimizer-states</code></td>
    </tr>
  </tbody>
</table>

<p><strong>Communication bottleneck</strong>: symptom is a profile dominated by collectives.</p>

<table>
  <thead>
    <tr>
      <th>Communication Type</th>
      <th>Config Flag</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>DP gradient reduce/param gather</td>
      <td><code class="language-plaintext highlighter-rouge">--overlap-grad-reduce --overlap-param-gather</code></td>
    </tr>
    <tr>
      <td>TP communication</td>
      <td><code class="language-plaintext highlighter-rouge">--tp-comm-overlap</code></td>
    </tr>
    <tr>
      <td>EP dispatcher</td>
      <td><code class="language-plaintext highlighter-rouge">--moe-token-dispatcher-type</code></td>
    </tr>
    <tr>
      <td>EP all-to-all hiding</td>
      <td><code class="language-plaintext highlighter-rouge">--overlap-moe-expert-parallel-comm</code></td>
    </tr>
    <tr>
      <td>PP send/recv</td>
      <td><code class="language-plaintext highlighter-rouge">--pipeline-model-parallel-layout</code></td>
    </tr>
  </tbody>
</table>

<p><strong>CPU overhead bottleneck</strong>: symptom is Nsight Systems showing gaps between GPU kernels.</p>

<table>
  <thead>
    <tr>
      <th>Optimization</th>
      <th>Config Flag</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Disable Python GC</td>
      <td><code class="language-plaintext highlighter-rouge">--manual-gc --manual-gc-interval 10</code></td>
    </tr>
    <tr>
      <td>Reduce kernel launches</td>
      <td>Decrease TP or increase MBS</td>
    </tr>
    <tr>
      <td>Enable CUDA Graphs</td>
      <td><code class="language-plaintext highlighter-rouge">--cuda-graph-impl transformer_engine</code></td>
    </tr>
  </tbody>
</table>

<p><strong>Computation bottleneck</strong>: symptom is low SM utilization even after communication and CPU gaps are under control.</p>

<table>
  <thead>
    <tr>
      <th>Optimization</th>
      <th>Config Flag</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Grouped GEMM</td>
      <td><code class="language-plaintext highlighter-rouge">--moe-grouped-gemm</code></td>
    </tr>
    <tr>
      <td>Kernel fusions</td>
      <td><code class="language-plaintext highlighter-rouge">--moe-router-fusion --moe-permute-fusion</code></td>
    </tr>
    <tr>
      <td>FP8 precision</td>
      <td><code class="language-plaintext highlighter-rouge">--fp8-format --fp8-recipe</code></td>
    </tr>
  </tbody>
</table>

<p><strong>Key insight</strong>: the same model on different hardware needs different optimization priorities. On NVL8, where EP crosses nodes, the Communication Wall dominates and can consume 30-50% of step time in all-to-all. On NVL72, where EP stays inside NVLink, enabling FP8 often shifts the bottleneck to CPU overhead instead.</p>

<h3 id="iterative-nature">Iterative Nature</h3>

<p>The ordering matters: memory in Phase 1, then parallelism in Phase 2, then profiling in Phase 3. But the process is cyclical. Memory optimizations may enable smaller parallelism degrees, which pushes you back to Phase 1. Phase 3 optimizations such as EP communication overlap and CUDA Graphs also consume memory, which can force you to revisit earlier choices.</p>

<hr />

<h2 id="13-rl-post-training-insights">13. RL Post-Training Insights</h2>

<ul>
  <li><strong>Router Replay</strong>: logs expert assignments during inference and replays them during training for more stable optimization</li>
  <li><strong>Packing-aware dynamic batch size</strong>: keeps total effective tokens per batch more consistent</li>
  <li><strong>Attention cost metric</strong>: sorts microbatches by $\sum (\text{seq_len})^2$ in serpentine order to reduce synchronization bubbles</li>
  <li><strong>Dynamic CP</strong>: selects CP degree per microbatch rather than provisioning for the worst-case sequence mix</li>
</ul>

<hr />

<h2 id="14-conclusion">14. Conclusion</h2>

<p>MoE sparsity introduces two fundamental challenges:</p>
<ol>
  <li><strong>Parameter-compute mismatch</strong>, which creates the Three Walls of memory, communication, and compute efficiency</li>
  <li><strong>Dense-sparse mismatch</strong>, which requires decoupled parallelism between attention and expert layers</li>
</ol>

<p>Key contributions summarized:</p>

<table>
  <thead>
    <tr>
      <th>Contribution</th>
      <th>Impact</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Parallel Folding</td>
      <td>Breaks <code class="language-plaintext highlighter-rouge">EP &lt;= DP</code>; enables flexible parallelism mapping</td>
    </tr>
    <tr>
      <td>Memory Optimization</td>
      <td>199.5 GB to under 80 GB per GPU for DeepSeek-V3</td>
    </tr>
    <tr>
      <td>Communication Optimization</td>
      <td>All-to-all shifts from foreground bottleneck to mostly background work</td>
    </tr>
    <tr>
      <td>Compute Efficiency</td>
      <td>Grouped GEMM plus CUDA Graphs plus sync-free execution</td>
    </tr>
    <tr>
      <td>FP8/FP4 Training</td>
      <td>Cross-cutting improvements across all three walls</td>
    </tr>
    <tr>
      <td>Long-Context Training</td>
      <td>Keeps sub-sequence lengths manageable through CP and TP scaling</td>
    </tr>
    <tr>
      <td>RL Support</td>
      <td>Packed sequences, Dynamic-CP, and router replay</td>
    </tr>
  </tbody>
</table>

<p>Final performance: DeepSeek-V3 reaches <strong>1,233 / 1,048 TFLOPS/GPU</strong> on GB300 and GB200 with 256 GPUs, versus <strong>368 TFLOPS/GPU</strong> on H100 with 1,024 GPUs. GB200 and GB300 deliver about <strong>3x higher token throughput</strong> than H100.</p>

<p>The broader takeaway is that Megatron-Core MoE is a full-stack systems response to these two mismatches. Memory savings from FP8, recomputation, and offloading are not isolated wins; they enable communication overlap, better parallelism choices, and eventually CUDA Graph coverage. Large-scale MoE training becomes feasible only when routing, parallelism, kernels, optimizer states, and long-context execution are all tuned together.</p>]]></content><author><name>Jianyu Huang</name><email>jianyu0huang@gmail.com</email></author><category term="MoE" /><category term="MoE" /><category term="Megatron" /><category term="NVIDIA" /><category term="distributed-training" /><category term="FP8" /><summary type="html"><![CDATA[Reading the following paper: Scalable Training of Mixture-of-Experts Models with Megatron Core]]></summary></entry><entry><title type="html">A Two-Year-Old’s Milestone: A Heartfelt Reflection on Time</title><link href="https://jianyuh.github.io/life/non-tech/2026/03/10/two-year-old.html" rel="alternate" type="text/html" title="A Two-Year-Old’s Milestone: A Heartfelt Reflection on Time" /><published>2026-03-10T00:00:00+00:00</published><updated>2026-03-10T00:00:00+00:00</updated><id>https://jianyuh.github.io/life/non-tech/2026/03/10/two-year-old</id><content type="html" xml:base="https://jianyuh.github.io/life/non-tech/2026/03/10/two-year-old.html"><![CDATA[<h3 id="chapter-i-the-night-life-broke-through-the-shell">Chapter I: The Night Life Broke Through the Shell</h3>

<p>Time moves so quickly — before I could even begin to count the days, you are already two years old.</p>

<p>Looking back to the night before you were born, it feels like only yesterday. That morning, your mother felt unusually intense movement, and contractions began arriving in waves. I hurriedly packed our bags and drove her to the hospital on the west side. The night that followed was long and heart-wrenching: in the ward, your mother lay tethered to tubes and monitors, holding on with the help of anesthesia and an epidural.</p>

<p>It wasn’t until the early hours of the next morning that the doctors began the induction. Your mother summoned every ounce of her strength, and finally — you arrived. I cut the umbilical cord myself, witnessing the very first moment of your independent life. But before the joy could even settle, a complication arose, and your mother was rushed back to the operating room for an emergency procedure. Thankfully, she came through safely. Meanwhile, you announced your arrival with a thunderous wail, and we — guided by patient nurses — clumsily learned to feed you and change diapers every three hours. On the day we left the hospital, the moment you were buckled into your car seat, your constant crying miraculously stopped. It was your first truce with this world.</p>

<h3 id="chapter-ii-the-chaos-of-new-parenthood">Chapter II: The Chaos of New Parenthood</h3>

<p>That first night home was the start of a frantic scramble. You cried so hard it felt like our hearts were being torn apart, and as rookie parents, we didn’t even realize you were simply hungry. For the entire first month, you woke every three hours, and we bid farewell to unbroken sleep.</p>

<p>From there, life shifted rapidly through countless firsts: we took you to see the cherry blossoms in Seattle; we watched your first rollover, your first crawl, your first wobbly steps. We heard you call out “Mama” and “Papa” for the first time in your small, stumbling voice. That year, you accompanied your mother to her PhD graduation ceremony and gazed up at the Aurora Borealis with us. Grandparents took turns flying in from afar to help care for you. You went through weaning off the pacifier, doctor’s check-ups, and gleeful swinging at the amusement park. Life unfolded in small details, but it became vivid because of you.</p>

<p>What remains most unforgettable was the end of 2024. At only nine months old, you followed us on cross-continental flights between China and the U.S., spending a month in our home country. I also remember a deep winter night when the power went out. With no warm milk to drink, you simply sat there in the dark, laughing and content. In that moment, your effortless optimism dissolved all our exhaustion.</p>

<h3 id="chapter-iii-waiting-at-the-window">Chapter III: Waiting at the Window</h3>

<p>On your first birthday, we invited friends and colleagues to celebrate. You sat in a little push-car, spinning happily around the living room.</p>

<p>After turning one, you started daycare. With it came the inevitable cycle of illnesses and trips to the urgent care, but it was also where you learned to socialize and play with children your age. Because of work, I began traveling frequently between California and Washington. Your mother tells me that midweek, you often press yourself against the window, watching and waiting for Papa to come home. Because of this, I hold our weekends sacred — even if it’s just a walk in the park or taking you to your little soccer class on Saturday mornings.</p>

<p>At home these days, you are a perpetual motion machine. Aside from when you’re sleeping, you almost never stop — running laps between the kitchen and the living room. You’ve grown especially attached to your mother lately, occasionally declaring in your baby voice, “Don’t want Papa, want Mama.” It stings a little, I won’t lie. But I’m deeply comforted to see you expressing your feelings so boldly.</p>

<h3 id="chapter-iv-a-second-birthday-and-an-ordinary-day">Chapter IV: A Second Birthday and an Ordinary Day</h3>

<p>Your second birthday was spent, once again, in a rush. I flew back from California just the day before and hurried home from work at 5 PM on your birthday.</p>

<p>Your mother had set up a backdrop with your favorite excavator balloons, and you dragged them around for ages. We gathered to sing “Happy Birthday,” and you asked with a perfectly serious face, “Where did the cake go?” When the cake finally appeared, you blew out the candles with delight alongside Papa, Mama, and Grandma. You ate until your face was covered in cream — a little frosted kitten.</p>

<p>The celebration was simple and grounded. After blowing out the candles, we went to Costco as usual to restock for the week. The day after your birthday, we took you to a nearby park and then out for hot pot. You didn’t wake from your afternoon nap until late. Grandma and Mama spent the afternoon playing with you, and after dinner, I took you for one last walk by the water. As the sun set and the sky darkened, watching you run and pause along the shore, I felt a tranquility that was almost unbearably precious.</p>

<h3 id="chapter-v-the-cycles-of-time">Chapter V: The Cycles of Time</h3>

<p>Time flies. I often think that in perhaps sixteen years, you will be as I once was — packing your bags and leaving this warm little nest to chase your own mountains and seas.</p>

<p>Today, I find myself having lived through another full cycle: seventeen years since I first left my hometown at eighteen. The world is changing fast, especially amid the surging wave of AI, where everything seems to accelerate. But I know that some things never change.</p>

<p>I will cherish this time together. I hope to walk beside you in health and happiness through all the years to come.</p>

<p><strong>Not to squander our time; not to waste the beauty of these years.</strong></p>]]></content><author><name>Jianyu Huang</name><email>jianyu0huang@gmail.com</email></author><category term="Life" /><category term="Non-tech" /><category term="Birthday" /><category term="Parenting" /><category term="Reflection" /><category term="Memoir" /><summary type="html"><![CDATA[Chapter I: The Night Life Broke Through the Shell]]></summary></entry><entry><title type="html">FlashAttention-4 and the Challenge of Asymmetric Hardware Scaling</title><link href="https://jianyuh.github.io/flashattention/2026/03/06/FA4.html" rel="alternate" type="text/html" title="FlashAttention-4 and the Challenge of Asymmetric Hardware Scaling" /><published>2026-03-06T00:00:00+00:00</published><updated>2026-03-06T00:00:00+00:00</updated><id>https://jianyuh.github.io/flashattention/2026/03/06/FA4</id><content type="html" xml:base="https://jianyuh.github.io/flashattention/2026/03/06/FA4.html"><![CDATA[<p>Reading the following paper:</p>
<ul>
  <li><a href="https://arxiv.org/pdf/2603.05451">FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling</a></li>
</ul>

<p><strong>Overview</strong></p>

<p>FlashAttention-4 is the latest iteration of the wildly successful hardware-aware attention algorithm, designed specifically to tackle the architectural shifts in NVIDIA’s Blackwell (B200/GB200) GPUs. While FlashAttention-3 was heavily optimized for the Hopper architecture (H100), the transition to Blackwell introduces a phenomenon “asymmetric hardware scaling”. This paper provides a masterclass in algorithmic and kernel co-design, demonstrating how to shift compute paradigms when the hardware bottlenecks unexpectedly change.</p>

<p><strong>The Core Problem: Asymmetric Hardware Scaling</strong></p>

<p>The fundamental challenge addressed in FlashAttention-4 is that not all parts of the GPU got faster at the same rate. On the Blackwell B200, the Matrix Multiply-Accumulate (MMA) tensor core throughput doubled compared to Hopper, reaching a massive 8192 ops/clock/SM (2.25 PFLOPS for FP16/BF16). However, the shared memory (SMEM) bandwidth remained flat at 128 bytes/clock/SM, and the multi-function unit (MUFU), which handles exponential operations for softmax, remained at 16 ops/clock/SM.</p>

<p>Roofline analysis in the paper reveals a surprising reality for Blackwell: matrix multiplication is no longer the primary bottleneck for attention. Instead, <strong>shared memory traffic and exponential operations now dominate execution time</strong>, exceeding MMA compute by 25-60%.</p>

<p><strong>Key Technical Innovations</strong></p>

<p><strong>1. Software-Emulated Exponential &amp; Conditional Rescaling</strong></p>

<p>Because the hardware MUFU cannot keep up with the doubled tensor core speed, the exponential calculation in the softmax becomes a major chokepoint.</p>

<ul>
  <li>
    <p><strong>Polynomial Approximation:</strong> To bypass the MUFU bottleneck, it leveraged a software emulation of $2^x$ using floating-point FMA units via polynomial approximation. By distributing the exponential computations across both the MUFU and FMA units (using emulation for 10-25% of entries), they effectively increased the exponential throughput. The genius here is recognizing that while a degree-3 polynomial has a higher FP32-level error than the hardware MUFU, once the output is rounded to BF16 (the standard precision for attention), the quantization error dominates, making the software emulation virtually indistinguishable from hardware.</p>
  </li>
  <li>
    <p><strong>Conditional Rescaling:</strong> FlashAttention relies on online softmax, which requires rescaling previous results when a new maximum value is encountered. FlashAttention-4 introduces a threshold-based skip mechanism: if the new maximum is not significantly larger than the old one, it skips the vector multiplication for rescaling and resolves the normalization at the very end.</p>
  </li>
</ul>

<p><strong>2. Taming Shared Memory with 2-CTA MMA Mode</strong></p>

<p>In the backward pass, SMEM bandwidth becomes an acute bottleneck because the algorithm requires five different MMA operations, forcing operands to be repeatedly read from shared memory.</p>

<p>To solve this, FlashAttention-4 leverages a new Blackwell feature: <strong>2-CTA tensor core MMA mode</strong>. In this mode, two Cooperative Thread Arrays (CTAs) within the same cluster cooperatively execute a single MMA.</p>

<ul>
  <li>
    <p><strong>Halving SMEM Traffic:</strong> Each CTA stages only half of operand B in its own shared memory, while the hardware consumes the combined B tile during the multiply.</p>
  </li>
  <li>
    <p><strong>Halving Atomic Reductions:</strong> In the gradient accumulation step (dQ), the researchers use distributed shared memory (DSMEM) to exchange half of the gradient of the softmax (dS) between the two CTAs. This repacking allows each CTA to write only half of the dQ tile, cutting the number of expensive global atomic reductions in half.</p>
  </li>
</ul>

<p><strong>3. Exploiting TMEM and Full Asynchrony</strong></p>

<p>Blackwell introduces Tensor Memory (TMEM), a 256 KB on-chip memory per SM specifically for storing intermediate tensor core results. Unlike Hopper, where MMAs wrote to registers and caused massive register pressure, Blackwell’s MMAs write asynchronously directly to TMEM. FlashAttention-4 redesigns the software pipeline to aggressively overlap the fully asynchronous tensor core operations with softmax and memory operations, utilizing the larger 128x128 MMA tiles.</p>

<p><strong>4. A Pythonic Shift: CuTe-DSL</strong></p>

<p>From a developer ecosystem standpoint, one of the most exciting updates is the move away from notoriously slow C++ template metaprogramming. FlashAttention-4 is written entirely in <strong>CuTe-DSL embedded in Python</strong>. This framework lowers to PTX and compiles just-in-time (JIT), reducing compile times by 20-30x (down to ~1.4–2.5 seconds from 45–55 seconds) while retaining full low-level expressivity.</p>

<p><strong>Performance Outcomes</strong></p>

<p>By directly addressing the shifting hardware bottlenecks, FlashAttention-4 achieves impressive performance on B200 GPUs:</p>
<ul>
  <li>Up to <strong>1.3x speedup over cuDNN 9.13</strong> and <strong>2.7x over Triton</strong> for BF16.</li>
  <li>Achieves up to <strong>1613 TFLOPs/s</strong>, utilizing 71% of the theoretical maximum compute.</li>
</ul>

<p><strong>Insights</strong></p>

<ol>
  <li>
    <p><strong>The End of “Compute-Bound” Attention (For Now):</strong> We are entering an era where keeping the tensor cores fed is harder than the actual matrix multiplication. Kernel developers can no longer treat GPUs as uniform compute scaling machines; they must acutely monitor the ratio of MMA throughput to memory bandwidth and non-linear function units.</p>
  </li>
  <li>
    <p><strong>Software Emulation is Viable at Lower Precisions:</strong> The decision to emulate the exponential function using FMA units is a brilliant application of precision-aware optimization. It highlights a broader insight for AI system design: if your target data type (like BF16) has a high quantization error, you have “budget” to use cheaper, faster mathematical approximations without harming the end result.</p>
  </li>
  <li>
    <p><strong>Python for Low-Level GPU Kernels is Maturing:</strong> The use of CuTe-DSL proves that writing bare-metal, highly optimized GPU kernels no longer strictly requires the painful compilation cycles of complex C++ libraries like CUTLASS. This lowering of the barrier to entry will likely accelerate community experimentation with new attention variants.</p>
  </li>
</ol>]]></content><author><name>Jianyu Huang</name><email>jianyu0huang@gmail.com</email></author><category term="FlashAttention" /><category term="FlashAttention" /><summary type="html"><![CDATA[Reading the following paper: FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling]]></summary></entry></feed>