NVIDIA Blackwell SM100: TMEM, TMA, and the New Tensor Core Roofline

Reading notes based primarily on:

Dissecting NVIDIA Blackwell Tensor Cores

Blackwell is not just Hopper with bigger Tensor Cores. The software contract changed. On SM100, kernel performance now depends on explicitly managing Tensor Memory (TMEM), understanding CTA-scoped tensor instructions, and recognizing that the limiting resource is often no longer raw tensor math, but the bandwidth needed to feed it.

This also helps explain why several recent Blackwell-era kernels, including FlashAttention-4 and MXFP8 Training, had to substantially rethink their pipelines instead of simply porting Hopper code.

1. Blackwell changes who “owns” MMA execution

Three architectural shifts matter most.

First, Blackwell introduces Tensor Memory (TMEM), a new software-managed level in the memory hierarchy used to hold MMA accumulators. On Hopper, MMA results were tightly coupled to the issuing warpgroup and register file. On Blackwell, accumulators live in TMEM, which decouples result ownership from any particular thread. That sounds subtle, but it fundamentally changes kernel structure: epilogues, accumulator lifetimes, and overlap strategies all need to be redesigned around TMEM.

Second, tcgen05 instructions are CTA-scoped. A single thread issues the instruction on behalf of the entire CTA. This is a major departure from Hopper’s warpgroup-scoped wgmma model. The practical consequence is that threads are no longer symmetric participants in tensor-core issue. Some threads orchestrate, others move data, and the whole CTA acts as the execution unit.

Third, Blackwell adds TPC-scoped TMA and MMA instructions through cta_group::2 in PTX, or 2CTA in SASS. Two CTAs, spanning two SMs, can collaboratively execute the same tcgen05.mma. Combined with native support for sub-byte and microscaled datatypes, this gives Blackwell a much more flexible tensor-core pipeline, but only if the kernel is explicitly written to use it.

The broader pattern is clear: Blackwell rewards kernels that think in terms of asynchronous clusters and shared on-chip resources, not warp-synchronous loops.

2. Physical topology now leaks into software decisions

The execution hierarchy above the SM matters more than it used to.

CTAs grouped into clusters are guaranteed to co-schedule on the same Graphics Processing Cluster (GPC). That is essential for 2CTA execution and for efficient use of distributed shared memory. But there is a catch: if a persistent kernel launches one CTA per SM and the chosen cluster size does not evenly divide the SMs yielded by each GPC, the leftover CTAs can serialize. The result is that “one CTA per SM” is no longer automatically the right launch policy.

On B200, the topology is even more visible because the package spans two dies. Pointer-chasing measurements that intentionally fill L2 expose an average die-to-die latency penalty of roughly 300 cycles. For kernels relying on cluster-local reuse, this means that topology-blind scheduling can turn a theoretically local path into a very real latency tax.

In other words, Blackwell kernel tuning now includes placement as well as instruction selection.

3. LDGSTS and TMA serve different regimes

Blackwell offers two strong but very different ways to move data into shared memory.

LDGSTS: fast to ramp, fragile at scale

cp.async / LDGSTS writes directly into shared memory without staging through registers, which reduces register pressure and makes it attractive for irregular data movement. Its DRAM throughput saturates at roughly 6.6 TB/s with about 32 KiB in-flight per SM.

The problem is latency scaling. The baseline latency is about 600 ns, but once the in-flight footprint grows, the MIO subsystem becomes the bottleneck:

At 8 KiB in-flight, latency rises to about 1229 cycles
At 12 KiB in-flight, latency can spike to about 2177 cycles

So LDGSTS is excellent when the kernel needs responsiveness and flexibility, but it becomes increasingly fragile once too many warps are competing for the copy path.

TMA: higher ceiling, slower to fill

Tensor Memory Accelerator (TMA) is issued by a single thread, handles address generation in hardware, and can perform the swizzling needed by tensor-core layouts asynchronously. Its peak throughput is higher, approaching 7.2 TB/s, but it needs much more data in flight, typically more than 64 KiB, before it reaches that ceiling.

That makes TMA the better fit for large, regular, deeply pipelined tiles, while LDGSTS remains attractive for sparse or irregular patterns.

This tradeoff shows up in real kernels. A reasonable rule of thumb is:

Use TMA for large, predictable tiles with enough buffering to hide setup cost
Use LDGSTS for irregular or dynamic page fetches where responsiveness matters more than peak bandwidth

Even within LDGSTS-heavy kernels, adding more stages and more copy-participating threads continues to help until register allocation becomes the limiting factor.

4. Multicast and DSMEM are powerful, but not free

Blackwell’s cluster features are only as good as the access pattern driving them.

TMA multicast and the L2 Request Coalescer

With TMA multicast, a single load can populate shared memory on multiple SMs in a cluster. This is serviced through the L2 Request Coalescer (LRC). There is also an “implicit” form of multicast where multiple CTAs simply request the same data and rely on the hardware to merge requests.

Implicit multicast can reach roughly the same effective shared-memory fill throughput as explicit multicast, but the LRC stops saving much L2 traffic once the implicit requests exceed about 64 bytes in-flight. So if the objective is not just SMEM fill rate but also lower L2 pressure, explicit multicast remains the cleaner tool.

Remote shared memory is not local shared memory

The gap between local and remote shared-memory access is severe:

Local ld.shared: about 128 B/clk
Naive remote ld.shared::cluster: about 21 B/clk

The reason is painful but simple: the compiler often lowers remote loads to generic LD instructions rather than optimized LDS instructions. For high-throughput inter-CTA exchange, developers should rely on cp.async.bulk (UBLKCP in SASS), which pushes distributed shared-memory throughput up to about 32 B/clk.

The lesson is that Blackwell’s cluster features are not self-optimizing. The fast path usually has to be spelled out explicitly.

5. The real roofline is often shared-memory bandwidth

One of the most important Blackwell insights is that many MMA instructions are no longer math-bound.

For 1SM MMA, under-sized shapes are heavily penalized:

M=64 uses only about half the datapath
M=128 reaches near-full utilization

For 2SM MMA, M=256 is the sweet spot because it maps to 128 rows per SM, which keeps both SMs well utilized.

The deeper issue is operand movement. Blackwell supports:

SS mode: both A and B come from shared memory
TS mode: A comes from TMEM, B comes from shared memory

In SS mode, the instruction is entirely bound by shared-memory bandwidth for N < 128.

Consider an FP16 1SM MMA with shape M=128, N=64, K=16:

\[\text{A bytes} = 2 \times M \times K = 4096\] \[\text{B bytes} = 2 \times N \times K = 2048\] \[\text{Total FLOPs} = 2 \times M \times N \times K = 262{,}144\]

Assuming Blackwell shared memory sustains 128 B/clk, the shared-memory service time is:

\[\text{SMEM cycles} = \frac{4096 + 2048}{128} = 48\]

If the effective Tensor Core throughput for this instruction regime is 8,192 FLOPs/clk, the math time is:

\[\text{Math cycles} = \frac{262{,}144}{8{,}192} = 32\]

So shared memory still dominates:

\[48 \text{ SMEM cycles} > 32 \text{ Math cycles}\]

That is the core result. For N=64, the instruction is physically SMEM-bound, not Tensor-Core-bound. Only when N=128 do the two sides align at roughly 64 cycles each, marking the transition into a math-limited regime.

This produces a distinctly sloped roofline at exactly 128 B/clk. On Blackwell, feeding the Tensor Cores is often harder than using them.

6. Why 2SM MMA can scale by more than 2x

This shared-memory bottleneck explains a counterintuitive result: in SS mode and for small shapes, 2SM MMA can achieve greater than 2x strong scaling over 1SM MMA.

That is not magic. It is bottleneck removal.

When the work is split across two SMs, each SM contributes its own shared-memory bandwidth. The kernel is no longer constrained by the single-SM SMEM ceiling that held back the 1SM path. In effect, the architecture doubles both the compute resources and the on-chip bandwidth feeding them, so the observed speedup can exceed the naive 2.0x expectation.

In TS mode, where operand A comes from TMEM rather than shared memory, scaling behaves much more cleanly and sits near the expected 2.0x.

7. Latency and data format still matter

Single-instruction latency reveals more of the underlying machine behavior.

Latency grows roughly linearly from N=64 to N=128, then spikes at N=256. Data format also changes the ordering:

\[\text{S8} < \text{BF16} = \text{E4M3} = \text{F4} < \text{MXF8} = \text{MXF4}\]

The intuition is straightforward:

S8 is fastest because integer tensor operations are power-efficient and simple
Microscaled formats pay a small extra cost to derive and apply scale factors

Even if an instruction is well chosen, issue efficiency remains a separate problem. To truly approach speed-of-light throughput, a kernel likely needs on the order of 256 to 1024 in-flight MMA instructions so that issue overhead and commit waits are fully amortized.

Most real kernels are nowhere near that. They often carry only 1 to 4 in-flight MMAs, which artificially caps throughput around 78% to 80% of speed-of-light. That is why maximizing MMA instruction size per shared-memory tile is not optional on Blackwell; it is one of the few levers strong enough to move the roofline.

8. Practical rules for kernel writers

Blackwell tuning can be summarized in a few rules:

Design around TMEM explicitly. Accumulators no longer belong to registers or warps, so the pipeline has to be structured around TMEM residency and transfer boundaries.
Treat CTA clusters as first-class hardware. Launch geometry, cluster size, and GPC packing all affect whether the kernel actually runs in parallel.
Choose the copy path by access pattern, not ideology. TMA wins on large regular tiles; LDGSTS wins on responsiveness and irregularity.
Do not treat remote DSMEM like local SMEM. Use cp.async.bulk for real inter-CTA throughput.
Expect shared memory to be the bottleneck before tensor math. For SS-mode kernels, shape selection and operand staging dominate achievable performance.
Use larger MMA shapes and deeper in-flight pipelines whenever possible. Blackwell leaves a lot of performance stranded when kernels are too shallow.

The headline message is simple: Blackwell’s Tensor Cores got faster, but the software problem got harder. The best kernels are no longer the ones that merely maximize FLOPs. They are the ones that understand where the new bottlenecks moved, and then reorganize the entire pipeline around those new limits.