Breaking the FSDP Bottleneck with veScale-FSDP for Structure-Aware Training
Reading the following paper:
The evolution of LLM has heavily relied on Fully Sharded Data Parallel (FSDP) and ZeRO systems for memory-efficient training. However, as model architectures and training recipes grow more sophisticated, conventional FSDP frameworks like DeepSpeed, PyTorch FSDP1/FSDP2, and Megatron-FSDP are hitting structural and performance walls.
This reading note dives into veScale-FSDP, an open-source system developed by ByteDance that redesigns FSDP to natively support modern structure-aware training methods while scaling linearly to tens of thousands of GPUs.
1. The Core Problem: The Misalignment of Sharding and Structure
Modern state-of-the-art models (e.g., Gemini, DeepSeek-V3) increasingly rely on structure-aware training techniques, such as:
- Non-element-wise optimizers: Matrix-based optimizers like Shampoo and Muon require calculations on the original 2D shape of the parameter matrix.
- Block-wise quantization: Optimizers like 8-bit Adam require slicing tensors into discrete 2D blocks to calculate localized scaling factors.
The problem is that existing FSDP systems break these structural boundaries. DeepSpeed and FSDP1 use element-wise sharding, flattening and partitioning parameters arbitrarily across devices, which destroys tensor shape and stride information. PyTorch FSDP2 uses row-wise sharding, dividing tensors evenly along a dimension, but this still fails to align with the specific block sizes required by quantization or matrix optimizers. Consequently, model developers are forced to write complex, intrusive code to handle padding, cross-boundary metadata synchronization, and interleaved memory copies.
2. The Solution Space: Three Pillars of veScale-FSDP
To bridge the gap between PyTorch-native eager APIs and massive-scale cluster performance, veScale-FSDP introduces three tightly coupled innovations:
A. RaggedShard: A Flexible DTensor Placement
Inspired by single-device Jagged/Nested Tensors, veScale introduces RaggedShard, a new placement format for PyTorch Distributed Tensors (DTensor).
- Custom Block Granularity: Instead of blind element or row splits, RaggedShard defines the atomic, non-shardable unit as a customizable tensor block (e.g., a 32x32 matrix tile).
- Composability: It integrates seamlessly with existing DTensor placements. If a model applies Tensor Parallelism (TP) as
Shard(0), RaggedShard adapts via aStridedRaggedShardmechanism to preserve the correct granularity, avoiding cuts into the parallelized dimension.
B. Structure-Aware Planning Algorithm
Collective communications require grouping or “bucketing” tensors to maximize network bandwidth utilization. However, natively packing block-wise ragged shards into a contiguous communication buffer creates issues: it can fracture blocks, cause interleaved memory access, and create imbalanced workloads across devices.
veScale models this grouping as an NP-hard optimization problem (reducible from the classic Partition problem). To solve it practically at runtime (taking less than 0.3 seconds), they implemented a polynomial-time dynamic-programming heuristic.
- How it works: By keeping the default model tensor order (leveraging Transformer regularity) and applying a dynamic programming search, the algorithm finds optimal endpoints to place tensors into a globally balanced communication buffer.
- The Result: It strictly pads between tensors rather than within them, minimizing padding overhead to less than 3% even for massive MoE models like DeepSeek-v3-671B, keeping blocks perfectly intact for zero-communication local quantization.
C. Distributed Buffer (DBuffer)
To back the planned RaggedShard layouts, veScale introduces the DBuffer.
- Zero-Copy Collectives: By leveraging the planning algorithm’s layout, DBuffer provides persistent address mappings to the tensor’s data pointers, achieving zero-copy in-place collective communications.
- Batched Memory Allocation: Standard FSDP systems suffer from memory fragmentation due to implicit stream dependencies and eager per-parameter allocations, artificially inflating peak reserved memory. DBuffer drastically reduces fragmentation via deterministic, batched allocations.
- Group-Level Operators: Instead of launching separate fragmented CUDA kernels for each tensor (like scale or add), DBuffer fuses identical kernels across the group prior to communication, minimizing host blocking time.
3. Key Insights and Performance Gains
The integration of these systems yields remarkable results across various benchmarks (LLaMA-3-70B, GPT-OSS-120B, and internal 2.4T MoE models):
- Throughput & Scaling: veScale-FSDP achieves 5% to 66% higher throughput than existing baseline systems. It scales linearly to 10,000 GPUs and effortlessly maintains Model FLOPS Utilization (MFU) on models up to 2.4 Trillion parameters without performance degradation.
- Memory Efficiency: Peak per-GPU memory usage is reduced by 16% to 30%. By circumventing PyTorch caching allocator fragmentation and eliminating Megatron’s massive 33% padding-inflated buffers, veScale prevents expensive device-side frees that typically stall training.
- Enabling the Cutting Edge: Because RaggedShard naturally preserves block boundaries, implementing 8-bit Adam requires no manual cross-device communication for quantization factors. Similarly, complex optimizers like Muon are implemented cleanly in standard SPMD fashion: the system natively unshards parameters to a root device, executes the Newton-Schulz update on the full 2D matrix, and dynamically redistributes the result.
Conclusion
veScale-FSDP represents a pivotal shift in distributed training infrastructure. Instead of treating large-scale distributed memory as a monolithic flat array to be sliced aggressively, it elevates the system to be structure-aware. By bridging custom block granularity with heavily optimized collective buffers, it allows AI researchers to decouple complex algorithmic innovations (like block-wise quantization and non-element-wise optimizers) from distributed systems engineering.