Summary on StreamRL

Concept

Unlocking Scalable and Cost-Effective LLM Training with Reinforcement Learning: StreamRL’s Disaggregated Approach

Reinforcement Learning (RL) has emerged as a crucial post-training technique for large language models (LLMs). It allows LLMs to learn by trial and error from reward signals, complementing traditional pre-training methods like next-token prediction. This approach has been adopted by leading models to enhance capabilities in complex tasks like coding and mathematics. The RL workflow for LLMs typically involves two main stages: generation and training. First, the LLM generates samples online based on given prompts. Then, these samples are used to calculate rewards, which are then used to train the LLM to update its parameters. This two-stage process forms the core of modern RL training for LLMs. Initially, RL training frameworks for LLMs like OpenRLHF and NeMo adopted a disaggregated architecture. In this setup, dedicated computational resources are assigned separately to the generation and training stages. For instance, the generation stage might use inference frameworks like vLLM, and the training stage might use frameworks like DeepSpeed or Megatron-LM. Updated model weights are transferred back to the generation stage for the next iteration. This architecture allowed for quick deployment by reusing existing infrastructure. However, it suffered from a notable drawback: resource idleness, where resources for one stage remain idle while the other stage is active due to the serial nature of the workflow.

To combat resource idleness, recent frameworks like verl, ReaL, and RLHFuse shifted to a colocated architecture. In this approach, both the generation and training stages share the same GPU resources, using time-division multiplexing with context switching between stages. The conventional belief became that this colocated architecture was superior due to its improved resource utilization.

The Limitations of Colocation at Scale

However, as LLM training scales and model sizes grow, the colocated architecture reveals a fundamental limitation: resource coupling. The two stages are forced to use the same resources, which can compromise scalability and cost-efficiency. This is problematic because the workloads of the generation and training stages are fundamentally different. Generation is often memory-bandwidth-bound, meaning its speed is primarily limited by memory bandwidth. Training, in contrast, is typically compute-bound. Forcing them to share coupled resources prevents tailoring resource allocation and parallelism strategies to their distinct needs, limiting overall efficiency.

Reconsidering Disaggregation: Flexibility and Scalability

Despite its initial challenges, the disaggregated architecture offers unique advantages that warrant a second look. By eliminating resource coupling, it allows for flexible resource allocation tailored to each stage’s workload. It also supports the use of heterogeneous hardware, enabling the selection of the most suitable machines for generation and training, potentially improving cost-effectiveness.

Furthermore, disaggregation enhances scalability, particularly for training across multiple data centers. Traditional LLM training requires extensive communication, making cross-datacenter deployment difficult due to high bandwidth demands. Disaggregated RL, however, requires relatively low inter-stage communication; primarily transferring generated samples (small) and model weights (point-to-point transmission). This is well-suited for inter-datacenter links, making cross-datacenter training feasible. Independent generation instances can also be distributed across data centers to utilize global resource pools.

StreamRL: Unlocking Disaggregation’s Full Potential

To fully realize the potential of the disaggregated architecture and overcome its drawbacks, the paper introduces StreamRL. Designed from first principles with disaggregation in mind, StreamRL specifically addresses two key performance bottlenecks in existing disaggregated RL frameworks: pipeline bubbles and skewness bubbles.

Tackling Pipeline Bubbles: Pipeline bubbles arise from the serial execution dependency between stages, leading to GPU idleness. StreamRL addresses this through stream generation. Instead of waiting for all samples to be generated before sending them to the trainer, StreamRL’s Stream Generation Service (SGS) returns completed samples to the Trainer in a stream fashion. This allows the Trainer to begin processing samples as soon as they are ready, enabling overlapping computation. This “dynamic-batch pipelining” significantly reduces idle time in the training stage. For asynchronous RL algorithms, streaming enables “fully asynchronous pipelining,” removing weight transmission from the critical path and allowing generation and training to proceed truly in parallel, minimizing bubbles even with workload fluctuations.
Tackling Skewness Bubbles: Skewness bubbles result from the long-tail distribution of output lengths in LLM generation. A small number of samples are much longer than the rest, and as generation proceeds, only these long-tail samples remain, leading to under-utilized GPUs. Decoding, a memory-bandwidth-bound operation, needs large batch sizes for efficiency, but long sequences consume more memory, forcing smaller batches and lower utilization for these long samples. StreamRL tackles this with skewness-aware dispatching and scheduling. A key innovation is the output-length ranker model. This model estimates the relative ranks of output lengths before generation starts, identifying the long-tail samples. Although predicting exact lengths is hard, ranking based on prompt difficulty is feasible and accurate enough to identify the longest samples. With this information, StreamRL dispatches long-tail samples to dedicated generation instances where they can be processed with smaller batch sizes to decode faster, while regular samples are batched together for high utilization. Within each instance, samples are scheduled using a greedy longest-processing-time-first (LPT) approach to minimize completion time.

Performance and Benefits

Experiments demonstrate the effectiveness of StreamRL. Compared to existing state-of-the-art colocated systems like verl, StreamRL improves throughput by up to 2.66×. It also shows significant cost-effectiveness improvements, up to 1.33×, in heterogeneous, cross-datacenter settings. The communication overhead for cross-datacenter training was found to be minimal, especially for weight updates which only require point-to-point transmission. Furthermore, the one-step asynchronous RL approach used in StreamRL-Async was shown to achieve comparable reward curves and convergence to synchronous PPO, indicating that algorithmic-system co-design can improve efficiency without sacrificing model performance for specific tasks.

Conclusion

StreamRL challenges the conventional view that colocated architecture is always superior for RL training of LLMs. By effectively addressing the critical issues of pipeline bubbles and skewness bubbles through streaming and skewness-aware techniques, StreamRL fully unleashes the inherent advantages of the disaggregated architecture: flexible resource allocation, support for heterogeneous hardware, and cross-datacenter scalability. This work highlights the potential benefits of revisiting disaggregated designs and encourages the community to explore this promising direction for future large-scale LLM training.