DAPO Reading Note

Read on DAPO:

DAPO: An Open-Source LLM Reinforcement Learning System at Scale.

“Reproducibility crisis” in reasoning-focused LLMs (like OpenAI o1 and DeepSeek-R1). While Reinforcement Learning (RL) is known to elicit Chain-of-Thought (CoT) capabilities, the community has struggled to reproduce results using naive GRPO (Group Relative Policy Optimization), often encountering entropy collapse and training instability.

The authors propose DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization). Using Qwen2.5-32B as the base model, DAPO achieves 50% accuracy on AIME 2024, significantly outperforming the naive GRPO baseline (30%) and surpassing the reproduced DeepSeek-R1-Zero-Qwen-32B (47%) while using 50% fewer training steps.

Technical Methodology: The 4-Step Recipe

The authors argue that naive GRPO fails because standard RL safeguards (like symmetric clipping) are ill-suited for the “Eureka” moments required in mathematical reasoning. DAPO introduces four specific technical modifications.

A. Clip-Higher (Solving Entropy Collapse)

The Problem: Standard PPO/GRPO uses a symmetric clipping range, typically $\epsilon=0.2$. The authors identify this as the primary cause of entropy collapse (where the model becomes deterministic too early).

Mathematical Intuition: If an exploration token has a low probability $p=0.01$, the upper bound $(1+\epsilon)p = 0.012$ restricts its growth significantly. Conversely, a high-probability token $p=0.9$ can easily grow to $1.0$. This asymmetry makes it mathematically difficult for the model to “change its mind” and explore unlikely paths.

The DAPO Solution: Decouple the clipping range to create an asymmetric trust region.

$\epsilon_{low} = 0.2$: Keeps the lower bound tight to prevent the sampling space from collapsing to zero.
$\epsilon_{high} = 0.28$: Loosens the upper bound, allowing “low-probability exploration tokens” to increase their likelihood more rapidly.

B. Dynamic Sampling (Solving Zero Gradients)

The Problem: In GRPO, advantages are normalized within a group of $G$ outputs. $\hat{A}_{i,t} = \frac{r_i - \text{mean}(R)}{\text{std}(R)}$ If a prompt results in all correct answers ($R=1$) or all incorrect answers ($R=0$), the standard deviation is 0 and the advantage is 0. As the model improves, the number of “fully solved” batches increases, causing the effective batch size to shrink and gradient signals to vanish.

The DAPO Solution: The algorithm introduces a constraint to the objective function: $0 < |\{o_i \mid \text{is_equivalent}(a, o_i)\}| < G$ The system over-samples and buffers prompts, filtering out instances where accuracy is exactly 0 or 1. This ensures every training step utilizes data with high variance, functioning as an automated “curriculum learning” that focuses compute on problems at the boundary of the model’s capability.

C. Token-Level Policy Gradient Loss (Solving Length Bias)

The Problem: Naive GRPO calculates loss at the sample level (averaging loss per token before averaging across the group).

Consequence: This treats a 100-token response and a 5,000-token response equally. It fails to penalize repetitive patterns or gibberish hidden within long responses, leading to unhealthy length bloating.

The DAPO Solution: DAPO shifts to a summation of token-level losses across the entire group: $J_{DAPO}(\theta) = \mathbb{E} \left[ \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \min(\dots) \right]$ By normalizing by the total token count $\sum |o_i|$ rather than the group size $G$, longer reasoning chains contribute more to the gradient update, providing finer-grained feedback on complex reasoning steps.

D. Overlong Reward Shaping (Solving Reward Noise)

The Problem: Truncating a valid reasoning chain at the max length ($L_{max}$) acts as an arbitrary penalty, confusing the model (valid logic gets punished solely for length).

The DAPO Solution:

Filtering: Mask the loss of truncated samples so they don’t contribute noise.
Soft Punishment: A piecewise linear penalty that kicks in only near the cache limit ($L_{cache}=4096$). $R_{length}(y) = \frac{(L_{max} - L_{cache}) - |y|}{L_{cache}} \quad \text{for } |y| > L_{max} - L_{cache}$ This signals the model to wrap up reasoning naturally rather than chopping it off abruptly.

Implementation Details & Dataset

Base Model: Qwen2.5-32B.
Infrastructure: Built on verl (HybridFlow).
Hyperparameters:
- Group size $G=16$.
- Max length = 20,480 tokens (16k generation + 4k buffer).
- Learning rate: $1 \times 10^{-6}$ (Constant with warmup).
- KL Divergence: Explicitly removed ($\beta=0$), as reasoning models must diverge significantly from the prior.

Dataset Transformation (DAPO-Math-17K): To avoid “reward hacking” where models exploit complex regex parsers (e.g., answering $11-2\sqrt{6}$ vs format $k-m\sqrt{n}$), the authors transformed the dataset.

Method: They prompted an LLM to rewrite questions such that all answers are integers (e.g., “Find $k+m+n$”).
Benefit: Reduces the reward function to a binary integer check, ensuring high-fidelity reward signals.

Quantitative Analysis & Results

The ablation study (Table 1) quantifies the marginal gain of each technique, showing that stability is a prerequisite for capability.

Technique	Cumulative Accuracy (AIME)	Analysis
Naive GRPO	30%	Baseline failure due to instability.
+ Overlong Filtering	36%	(+6%) Removes noise from truncation.
+ Clip-Higher	38%	(+2%) Prevents early entropy collapse.
+ Soft Punishment	41%	(+3%) Guides natural conclusion.
+ Token-level Loss	42%	(+1%) Minor score gain, major stability gain.
+ Dynamic Sampling	50%	(+8%) Largest single jump in capability.

Efficiency: DAPO reaches 50% accuracy in roughly 1,500 steps, whereas the DeepSeek-R1 reproduction required ~3,000 steps to reach 47%.

Emergent Behaviors & Insights

Spontaneous Reflection: The model began generating phrases like “Wait a moment, let’s rethink…” and backtracking on incorrect geometric assumptions, despite this behavior not being present in the supervised starting point.
The “Entropy” Hypothesis: The success of Clip-Higher confirms that standard PPO is too conservative for reasoning. If the policy becomes deterministic (entropy drops) before the model discovers the complex reasoning chain, it never learns. DAPO maintains higher entropy throughout training.
Data Efficiency: The success of Dynamic Sampling suggests that RL for LLMs is becoming a data selection problem. Discarding “solved” or “impossible” prompts to focus on the “Zone of Proximal Development” doubles training efficiency.