Read on the following papers:

Optimizing LLM/RL training for efficiency and performance is a complex endeavor that involves innovations across reinforcement learning algorithms, model architectures, and underlying infrastructure: ByteDance’s DAPO, Ant Research’s AReaL, MistralAI’s Magistral, MiniMax’s MiniMax-M1, and Google DeepMind’s Gemini 2.5 models.

Reinforcement Learning (RL) Training Recipes/Algorithms

The core of LLM training often involves large-scale Reinforcement Learning, which aims to elicit complex reasoning and align model behavior. Each system introduces novel algorithmic modifications to enhance this process.

  • DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization): This algorithm, fully open-sourced, is designed to overcome common issues like entropy collapse, reward noise, and training instability observed in traditional PPO/GRPO.

    • Clip-Higher: Promotes diversity and prevents entropy collapse by decoupling the lower ($\epsilon_{low}$) and higher ($\epsilon_{high}$) clipping ranges for the importance sampling ratio, specifically allowing low-probability “exploration” tokens more room to increase their probability. This enhances the policy’s entropy and leads to more diverse samples.

    • Dynamic Sampling: Improves training efficiency and stability by filtering out prompts where all outputs are either fully correct (accuracy=1) or fully incorrect (accuracy=0), ensuring that each training batch contains samples with effective gradients. This helps maintain a consistent number of prompts with meaningful gradients and can even accelerate convergence.

    • Token-Level Policy Gradient Loss: Addresses challenges in long Chain-of-Thought (CoT) reasoning by calculating loss at the token level rather than the sample level. This ensures that tokens within longer responses contribute proportionally to the overall loss, preventing unhealthy increases in entropy and response length from low-quality long samples.

    • Overlong Reward Shaping: Reduces reward noise and stabilizes training by introducing a soft length penalty for excessively long responses instead of a punitive reward for truncation. This helps the model avoid excessively long outputs without penalizing sound reasoning that happens to be lengthy.

    • Rule-based Reward Modeling: Directly uses the final accuracy of a verifiable task (e.g., math problems) as the outcome reward, transforming answers into integers for accurate rule-based signals and avoiding reward hacking.

  • AReaL (Asynchronous Reinforcement Learning System): Focuses on stabilizing RL training in an asynchronous environment where data staleness is a significant concern.

    • Staleness-Aware Training: Introduces a hyperparameter ($\eta$) to control the maximum permitted data staleness in each training batch. This is implemented by dynamically controlling the throughput of generation requests and prioritizing older trajectories from the data buffer.

    • Decoupled PPO Objective: Modifies the PPO algorithm by disentangling the behavior policy (used for sampling trajectories) from a proximal policy (a recent target for regularization). This crucial innovation allows AReaL to effectively leverage samples generated from much older model versions without performance degradation, which is vital for asynchronous systems where training batches may contain data from multiple policy versions. It also helps maintain algorithmic correctness when combining interruptible generation with policy updates.

    • Rule-based Reward Function: Similar to DAPO, it uses a rule-based reward that provides feedback on the final action, indicating answer correctness.

  • Magistral: Builds upon Group Relative Policy Optimization (GRPO) with several modifications for improved stability and performance.

    • Eliminating KL Divergence: Removes the KL penalty term, as it found the policy to diverge substantially regardless, and maintaining a reference model copy incurs unjustified compute cost.

    • Loss Normalization: Normalizes loss by the total length of generations in a group to prevent length biases between responses within a single group.

    • Advantage Normalization: Estimates advantage for each token as the reward relative to the group mean, and further normalizes advantages within each minibatch.

    • Relaxing Trust Region’s Upper Bound (Clip-Higher): Adopts the Clip-Higher strategy, similar to DAPO, by increasing $\epsilon_{high}$ (tuned between 0.26 and 0.28) to promote exploration of rare but insightful reasoning steps and prevent entropy collapse.

    • Eliminating Non-Diverse Groups: Filters out groups of generations where all responses are either entirely correct or entirely wrong (resulting in zero advantage) to reduce noise and ensure effective gradients in training batches.

    • Sophisticated Reward Shaping: Evaluates model generations on four axes: formatting, correctness, length, and language consistency.

      • Formatting: Rewards adherence to specific output structures (e.g., tags, \boxed{} for math, markdown for code).

      • Correctness: Utilizes rule-based verifiers for math (SymPy) and execution-based unit tests for code.

      • Length Penalty: Applies a soft length penalty for excessively long responses.

      • Language Consistency: Incorporates a novel reward based on a fastText classifier to ensure all parts of a response (problem, thoughts, answer) use the same language, promoting multilingual reasoning without code-switching.

    • Adaptive Training/Curriculum Learning: Involves gradually increasing data difficulty, maximal allowed completion length, and adjusting concurrent requests/batch sizes as model performance improves to manage memory burden.

  • MiniMax-M1: Introduces a novel RL algorithm and strategies to manage instability during scaling.

    • CISPO (Clipped IS-weight Policy Optimization): A novel algorithm that clips importance sampling (IS) weights instead of token updates. This design avoids dropping tokens (even those associated with large updates), inherently maintains entropy for stable exploration, and leverages all tokens for gradient computations. It empirically achieves 2x speedup compared to DAPO. Similar to other recent works, it removes the KL penalty term and incorporates dynamic sampling and length penalty from DAPO.

    • Early Truncation via Repetition Detection: Implements a heuristic to halt generation if 3,000 consecutive tokens each have a probability above 0.99, which prevents pathologically long and repetitive responses and improves generation throughput.

    • Hybrid Reward Models: Uses rule-based verifiers for reasoning-intensive tasks (mathematical, logical, competitive programming, software engineering), and generative reward models for general domain tasks (QA, creative writing).

    • Addressing Bias of Generative Reward Models: Employs continuous online monitoring of length bias during RL training, triggering GenRM recalibration upon detecting detrimental length-seeking behavior. Supplements with RL-side techniques like reward shaping, value clipping, and normalization.

    • Curriculum of Diverse Data: Implements a carefully managed curriculum, starting with reasoning-intensive tasks with rule-based rewards and gradually mixing in general domain tasks. This prevents catastrophic forgetting of specialized skills while fostering broader generalization.

    • Staged Length Scaling Strategy: Gradually increases the output length (e.g., from 40K to 80K tokens) based on empirical indicators like perplexity convergence and the 99th percentile of output lengths approaching the current context window limit.

    • Addressing Training Instability During Scaling: Identifies pattern collapse (incoherent text in later stages of long-sequence training) caused by disproportionately large negative gradients. Solutions include repetitive pattern detection with early stopping, combined sample-level and token-level normalization, and decreasing gradient clipping thresholds and $\epsilon_{IS_high}$.

  • Gemini 2.5: Focuses on post-training advancements including RL for model refinement.

    • Thinking Models: A key innovation where models are trained with RL to utilize additional inference-time compute (“Thinking” stage) to arrive at more accurate answers. This “Thinking” is natively integrated across all domains and allows for a controllable “thinking budget” to trade off performance and cost.

    • Verifiable and Model-based Generative Rewards: Emphasizes leveraging both verifiable rewards and sophisticated model-based generative rewards to provide scalable and nuanced feedback signals for RL.

    • Model-Assisted Quality Control: Leverages the model itself to assist in data quality control during post-training, enhancing efficiency.

    • Algorithmic Enhancements for Stability: Includes algorithmic changes to the RL process that have improved stability during longer training.

Model Architectures

The architectural choices directly impact a model’s capacity for reasoning, context handling, and computational efficiency.

  • DAPO: While DAPO is a system for LLM RL, it primarily operates on a pre-trained base model, specifically Qwen2.5-32B. It does not introduce a novel core LLM architecture.

  • AReaL: Similar to DAPO, AReaL is an RL system designed for existing LLMs. Its architectural innovation lies in its system design (decoupling generation from training) rather than in the LLM’s internal structure itself.

  • Magistral: Built on Mistral Medium 3 and Mistral Small 3 base models. Its contribution is in the RL framework and system architecture that enables efficient training of these existing models for reasoning.

  • MiniMax-M1: Introduces a novel model architecture explicitly designed for efficiency and long-context reasoning.

    • Hybrid Mixture-of-Experts (MoE): Features a sparse MoE architecture. The model has 456 billion parameters in total, with 45.9 billion parameters activated per token, across 32 experts.

    • Lightning Attention: Combines the MoE with Lightning Attention, an I/O-aware linear attention variant. The design alternates between a transformer block with softmax attention and multiple transnormer blocks with lightning attention. This enables efficient scaling of reasoning lengths to hundreds of thousands of tokens, consuming significantly fewer FLOPs (e.g., 25% of DeepSeek R1’s FLOPs at 100K tokens generation length).

    • Native Long Context: This hybrid attention mechanism allows MiniMax-M1 to natively support up to 1 million tokens context length and 80K tokens generation length, greatly exceeding other open-weight LRMs.

  • Gemini 2.5: Leverages Google’s flagship model family with integrated architectural advancements.

    • Sparse Mixture-of-Experts (MoE) Transformer: Built as sparse MoE transformers, which allows decoupling total model capacity from computation and serving cost per token by dynamically activating only a subset of parameters.

    • Natively Multimodal: Supports text, vision, and audio inputs. Architectural changes improved image and video understanding, allowing processing of up to 3 hours of video content within a 1M token context window.

    • Enhanced Training Stability: The Gemini 2.5 series made considerable progress in enhancing large-scale training stability, signal propagation, and optimization dynamics, boosting performance directly out of pre-training.

    • Long Context Processing: Continues to excel in processing long-context queries, capable of handling sequences up to 1 million tokens.

Infrastructure Support

Efficient infrastructure is critical for handling the immense computational demands of LLM training, especially for large-scale RL.

  • DAPO:

    • Framework: Built on the verl framework, which is open-sourced.

    • Optimizer: Uses AdamW optimizer.

    • Dataset: Employs a carefully curated and processed dataset, which is also open-sourced.

  • AReaL: Focuses on a completely asynchronous and decoupled system design.

    • Asynchronous System Architecture: Fundamentally decouples LLM generation (rollout) from training. Rollout workers continuously generate outputs without waiting, while training workers update the model whenever a batch is collected. This resolves GPU underutilization and poor scalability of synchronous systems.

    • Components: Comprises Interruptible Rollout Workers, Reward Service, Trainer Workers, and a Rollout Controller.

    • Interruptible Rollout Workers: These workers can be interrupted to load new model parameters, discarding and recomputing KV caches with the new weights, and then resume ongoing generations.

    • System-Level Optimizations: Decouples GPU computation from CPU operations (like reward computation and data transfer) by running them in separate threads and pipelining the workflow.

    • Dynamic Batching: Employs a padding-free sequence packing strategy and a dynamic allocation algorithm to balance token distribution across micro-batches, maximizing GPU memory utilization and minimizing forward-backward passes for variable-length outputs.

    • Hardware: Evaluated on an H800 GPU cluster, utilizing NVLink for intra-node and RoCE for inter-node communication.

    • Framework Integration: Implemented using Python and PyTorch, combining SGLang for generation serving and Megatron-Core for training backend, managed by SLURM.

  • Magistral: Emphasizes a scalable, online RL infrastructure.

    • Asynchronous Pipeline: Generators operate continuously at maximum throughput without waiting for trainers. New weights are broadcast via NCCL to generators without discarding in-flight sequences, reducing update time. Key-value caches may be slightly outdated but recomputing is found unnecessary due to off-policy corrections.

    • Trainer Optimization: Defines a batch as a fixed number of completions rather than tokens. Uses a greedy collation algorithm for microbatches by sorting sequences by descending size, reducing padding by 19% and ensuring homogeneous workload.

    • Data Selection/Filtering: Implements a two-stage difficulty filtering pipeline for math problems, using an initial weaker model and then a stronger RL-trained model to curate “goldilocks” difficulty datasets. For code, it processes contest data, generates tests, and ensures test agreement.

    • Model Variants & Distillation: Explores training Magistral Medium with pure RL and Magistral Small with SFT traces from Magistral Medium followed by RL. This shows distillation can be used for smaller models to approximate teacher performance.

  • MiniMax-M1: Aims for efficient and scalable RL training on a novel architecture.

    • Continual Pre-training: Builds on MiniMax-Text-01 with 7.5T additional tokens from a carefully curated, reasoning-intensive corpus. This includes refined Web and PDF parsing, heuristic cleaning rules for math and code, semantic deduplication, and an increased proportion (70%) of STEM, code, book, and reasoning-related data. It uses a smooth extension of context length (32K to 1M) across four stages to avoid gradient explosion.

    • Supervised Fine-Tuning (SFT): Conducts SFT to instill specific Chain-of-Thought (CoT) reasoning patterns using high-quality examples, covering diverse domains like math, coding, STEM, writing, and QA.

    • Hardware: Full RL training completed in three weeks using 512 H800 GPUs for a rental cost of approximately $0.53M USD.

    • Computational Precision Mismatch Fix: Identified and resolved a precision mismatch between training-mode and inference-mode probability calculations, particularly in the LM head, by increasing its precision to FP32. This improved correlation between probabilities and enabled successful reward increase.

    • Optimizer Hyperparameter Tuning: Fine-tuned AdamW parameters ($\beta_1=0.9, \beta_2=0.95, \epsilon=1e-15$) to stabilize training, especially for the wide range of gradient magnitudes observed in MiniMax-M1.

    • Diverse Data Curation: Includes verifiable problems (math, logical reasoning, competitive programming, real-world software engineering with sandboxes for execution-based rewards) and general domain tasks with model-based generative rewards.

  • Gemini 2.5: Leverages Google’s advanced TPU infrastructure and comprehensive data strategies.

    • TPUv5p Architecture: The first models trained on TPUv5p architecture, utilizing synchronous data-parallel training across multiple chip pods distributed across data centers.

    • Pathways System: The single-controller design of Google’s Pathways system enables slice-granularity elasticity and split-phase SDC (Silent Data Corruption) detection.

      • Slice-Granularity Elasticity: Automatically continues training with fewer TPU slices during localized hardware failures, reducing downtime from over 10 minutes to tens of seconds per interruption and maintaining high throughput (around 97%).

      • Split-Phase SDC Detection: Uses lightweight deterministic replay and per-device intermediate checksums to quickly detect and localize data corruption errors, preventing long debugging and rollback delays.

    • Data Quality: Significant focus on improved data quality through enhanced filtering and deduplication for both pre-training and post-training datasets. Leverages the model itself to assist in data quality control.

    • Distillation for Smaller Models: For smaller models (Flash size and below), distillation is used to transfer knowledge from larger teacher models, approximating the teacher’s next token prediction distribution using a k-sparse distribution to reduce cost and improve quality.

    • Code Data & Evaluation: Intensified focus on incorporating a greater volume and diversity of code data from repository and web sources, along with enhanced evaluation metrics for assessing code capabilities.

    • Multilinguality: Comprehensive strategy refining pre- and post-training data quality, advancing tokenization, and innovating core modeling across over 400 languages.

    • Audio and Video Data: Expanded pre-training data (over 200 languages) and improved post-training recipes for audio. Significantly expanded pre-training and post-training video understanding data, enabling efficient processing of long video content.

Summary

  1. Innovating RL Algorithms (DAPO’s dynamic clipping and sampling, AReaL’s decoupled PPO, Magistral’s GRPO refinements, MiniMax-M1’s CISPO) to improve training stability, sample efficiency, and exploration.
  2. Designing Efficient Architectures (MiniMax-M1’s hybrid MoE with Lightning Attention, Gemini’s sparse MoE with native multimodality) to handle long contexts and scale test-time compute more effectively.
  3. Building Robust and Scalable Infrastructures (AReaL’s asynchronous decoupling, Gemini’s TPUv5p with elasticity and SDC detection, Magistral’s continuous generation with mid-update weights) to maximize hardware utilization, minimize downtime, and manage complex training pipelines for diverse data.
  4. Implementing Sophisticated Data Strategies (Magistral’s and MiniMax-M1’s multi-axis reward shaping, two-stage difficulty filtering, curriculum learning, Gemini’s model-assisted quality control) to provide high-quality feedback signals and curated datasets.