Read Outperforming cuBLAS on H100: a Worklog and Distributed GEMM.

Summary of CUDA matmul kernel optimization on H100

The goal is to create a simplistic matmul kernel that is faster than cuBLAS for large matrices. The kernel should work for general large matrix sizes but not varying matrix sizes like those used in LLM inference. The author notes that cuBLAS performs well for varying matrix sizes.

The H100 is a more powerful machine than previous generations of GPUs and requires different algorithms. It has more SMs, faster global memory, a faster clock speed, more shared memory, and a larger and faster L2 cache. Most importantly, the H100 has tensor cores, which are special hardware units that can execute small matrix-matrix multiplications in a single instruction.

The tensor core instructions are asynchronous meaning they can be batched together and sent to the tensor cores to run in parallel. This allows for the full utilization of all tensor cores.

TMA (Tensor Memory Accelerator) loads support the swizzling patterns required by tensor cores and are much faster than custom loads. TMA loads are issued from a single thread.

The steps to optimize the CUDA matmul kernel on the H100 are as follows:

Use tensor core instructions: By using tensor core instructions, the kernel can achieve a significant performance boost over the previous kernel.
Handle larger output tiles: Increasing the tile size can take advantage of larger wgmma instructions and reuse memory loads, but there is a limit on how large they can be due to the SMEM and register size limits.
Hide load latencies: By using warp specialization, the kernel can run loads and tensor core operations in parallel, hiding load latencies.
Use 2 consumer warpgroups: To avoid register spilling, which slows down the kernel, the kernel can use 2 warpgroups to process the output tile.
Hide store latencies: By scheduling thread blocks so that the stores for one tile overlap with the loads for the next tile, the kernel can hide store latencies.
Use L2 cache: By scheduling nearby tiles at the same time, the kernel can take advantage of the L2 cache to avoid loading tiles from global memory.
Use faster barriers: The kernel can use a faster barrier implementation that is available in PTX to improve performance.
Use thread block clusters: By using thread block clusters, the kernel can group multiple thread blocks together and use TMA multicast to load the same tile from global memory to multiple SMs, improving performance.
Micro-optimizations: A series of micro-optimizations can further improve performance by reordering stores, skipping the L1/L2 cache for writes, and avoiding resetting registers to 0.
Async stores: The kernel can use asynchronous stores to move data from registers to global memory using TMA, improving performance.
Hilbert curves: By using Hilbert curves to schedule the output tiles, the kernel can improve L2 cache hit rates and performance.

The author implemented these optimization techniques and was able to outperform cuBLAS by 7% for N=4096. The final kernel reached 764 TFLOPs. The author notes that this performance is not consistent across different values of N. For example, the kernel was only 2% faster than cuBLAS for N = 512. They also note that further performance improvement is possible through auto-tuning of kernel parameters and tweaking GPU settings to favor tensor core operations over L2 cache operations.

Summary on Distributed GEMM

Distributed GEMM leverages the high-bandwidth, low-latency provided by NVLink networks to implement tensor parallelism for GEMMs. This approach uses NVIDIA’s open-source linear algebra framework, CUTLASS, and additional calls to the runtime and driver APIs.

Tensor parallelism is a model parallelism technique where weights are distributed across multiple processors. This is important for larger models, like LLMs, where memory capacity is a concern.

Distributed GEMM aims to overlap communication and computation to reduce the communication overhead. Tensor parallelism relies heavily on communication primitives like AllGather and ReduceScatter to gather distributed data or scatter partial results, respectively. These operations can be broken down into multiple stages, which allows for the pipelining of communication and computation. By using point-to-point communication, the first stage of the pipeline can begin working on its locally available shard immediately. In theory, this allows for the elimination of exposed communication.

The traditional method of implementing tensor parallelism relies on explicit communication primitives from libraries such as NCCL or by fusing communication into specific kernels. In contrast, Distributed GEMM uses host-side CuTe layout logic, peer-to-peer access, and CUDA graph APIs to pipeline local GEMMs. This approach is more extensible and avoids register spills that can happen when using communication primitives.

Distributed GEMM works by prepending a few instructions to the beginning of the GEMM kernel code. These instructions are independent of the underlying structure and programming model of the kernel. This means that existing GEMM kernels can be easily transformed into Distributed GEMM kernels.

The following are some of the key advantages of using Distributed GEMMs:

Implemented entirely using CUTLASS, so they can take advantage of the performance-optimized kernels provided by NVIDIA.
Require very few code changes, so they are easy to implement.
Allow for the pipelining of local GEMMs, which can improve performance by hiding communication latency.

Implementation using CUTLASS:

CUTLASS is a framework for writing high-performance kernels that target NVIDIA Tensor Cores. CUTLASS kernels are typically designed to run on a single GPU, but they can be easily adapted to run in a distributed setting using Distributed GEMMs.

To create a Distributed GEMM kernel, the first step is to define a tensor parallelism strategy using CuTe primitives. This strategy defines the sharding, tiling, and buffering patterns that will be used. Then a GEMM kernel can be instantiated with a Distributed GEMM wrapper. The rest of the implementation is similar to running a single-GPU GEMM, except that the CUDA context and stream are set up on multiple devices.

Performance:

Distributed GEMM has been shown to achieve significant performance improvements over traditional tensor parallelism implementations. In some cases, it has been shown to achieve up to 24% speedup. The performance of Distributed GEMM is dependent on the problem size and the communication latency of the underlying network.

Distributed GEMM is a promising new approach for implementing tensor parallelism on NVLink-enabled systems. It offers a number of advantages over traditional approaches, including ease of implementation, performance, and extensibility. It has the potential to become a key technology for scaling up large language models and other deep learning applications.

Usage of Programmatic Dependent Launch (PDL)

PDL (Ref1 and Ref2) is a feature new to the Hopper architecture.

PDL is used to further reduce the latency between kernel launches in the pipelined GEMMs. CUDA Graphs allow for the asynchronous launch of kernels, but there is still a small amount of latency associated with each launch. PDL may allow for the elimination or reduction of this latency, which would further improve the performance of Distributed GEMMs.