Experience Replay for LLM RL: Breaking the Generate-Then-Discard Paradigm
Paper: Efficient RL Training for LLMs with Experience Replay.
Modern LLM post-training relies heavily on RL pipelines (PPO, GRPO) to unlock advanced reasoning capabilities, as seen in DeepSeek-R1 and OpenR1. However, state-of-the-art RL post-training is notoriously compute-heavy because it defaults to strict on-policy training: trajectories are generated, used for a single gradient update, and immediately discarded. Inference costs for generating these rollouts can consume over 80% of the training budget.
In classical deep RL (e.g., DQN, SAC), Experience Replay—storing and reusing past trajectories—is a foundational method to improve sample efficiency. Yet, it has been largely ignored in LLM post-training due to the prevailing assumption that off-policy data degrades performance.
This work challenges that consensus. By incorporating a well-designed replay buffer into an asynchronous LLM training pipeline, it demonstrates that we can cut inference compute by up to 40% without losing accuracy, while actually stabilizing training dynamics.
1. The Architecture: Asynchronous RL with a Replay Buffer
In a compute-efficient LLM pipeline, hardware is split into $W$ inference workers and $T$ trainers.
- Without a buffer (Baseline): Workers push rollouts to a queue, trainers pull them, compute gradients, and discard them. Total compute per parameter update is $C(1 + \mu)$, where $\mu$ is the ratio of rollout generation cost to training cost. To avoid GPU downtime, the optimal hardware ratio $W/T$ is exactly $\mu$.
- With a Replay Buffer: Workers continuously push rollouts to a buffer of size $N$, and trainers continuously sample from it (without removing the data). This decouples rollout generation from consumption. The total compute per parameter update drops to $C(1 + W/T)$.
By decreasing the $W/T$ ratio relative to $\mu$, we reduce the compute cost of each gradient step. However, this introduces a critical three-way trade-off:
- Compute Efficiency: Reusing samples makes steps cheaper.
- Off-Policiness (Staleness): Reused samples are generated by older versions of the policy, inducing variance.
- Sample Diversity: High replay ratios lead to the same samples being reused frequently, reducing local and global data diversity.
2. Mathematical Framework and Derivations
This trade-off is formalized using a non-convex stochastic optimization framework to study the convergence of policy gradient with stale data. We want to minimize an objective $F(\theta)$. At step $t$, we generate $R$ new rollouts and sample a minibatch of size $B$ from a buffer of size $N$. The update is $\theta_{t+1} = \theta_t - \eta g_t$, where $g_t = \frac{1}{B} \sum_{j=1}^B G(\theta_t, z_{t, i_j})$.
Fundamental Assumptions
To bound the optimization trajectory, three key assumptions are established:
- Assumption 4.1 ($L$-Smoothness): $|\nabla F(y) - \nabla F(x)| \le L |y - x|$.
- Assumption 4.2 (Bias from Staleness): Off-policy correlation introduces gradient bias, bounded by the parameter drift: $|\mathbb{E}[\epsilon_{t,i} \mid \mathcal{F}t]| \le \kappa |\theta_t - \theta{t_i}|$, where $t_i$ is the rollout generation time.
- Assumption 4.3 (Variance and Coupling): The per-sample variance grows with staleness: $\mathbb{E}[|\epsilon_{t,i}|^2] \le \sigma^2(t - t_i)$. Furthermore, samples in the buffer influence future parameters, creating a statistical correlation bounded by $\rho \frac{\lvert t_i - t_j \rvert}{N}$.
Derivation of the Convergence Bound (Theorem 4.4)
Using a Taylor expansion bounded by $L$-smoothness, the parameter update yields:
\[F(\theta_{t+1}) \le F(\theta_t) + \langle \nabla F(\theta_t), \theta_{t+1} - \theta_t \rangle + \frac{L}{2} \|\theta_{t+1} - \theta_t\|^2\]Substituting $\theta_{t+1} - \theta_t = -\eta(\nabla F(\theta_t) + \epsilon_t)$:
\[F(\theta_{t+1}) \le F(\theta_t) - \left(\eta - \frac{L\eta^2}{2}\right)\|\nabla F(\theta_t)\|^2 - (\eta - L\eta^2)\langle \nabla F(\theta_t), \epsilon_t \rangle + \frac{L\eta^2}{2}\|\epsilon_t\|^2\]Summing this over $T$ steps, taking the expectation, and applying Cauchy-Schwarz bounds the gradient norm using two key terms: the Bias and the Variance.
- Bounding Bias: Drift $|\theta_t - \theta_{t-\tau}|^2$ is bounded by past gradients. Under the condition $2H^2\kappa^2\eta^2 \le 1/4$ (where $H = N/R$ is the staleness horizon), the bias is controlled.
- Bounding Variance: The total variance $\mathbb{E}[|\xi_t|^2]$ incorporates the per-sample variance $\bar{\sigma}^2(H)$ and the cross-sample correlation $\rho$. Summing these yields:
Combining these gives Theorem 4.4, showing the gradient norm decays but is bottlenecked by a variance parameter $\mathcal{V} = \bar{\sigma}^2(N/R)\left(\frac{1}{B} + \frac{1}{N} + \frac{\rho}{R}\right)$.
Derivation of Optimal Buffer Design (Theorem 4.5)
Given an asymptotic compute budget $C = (B + \mu R)T$, optimizing the learning rate $\eta \to 0$ simplifies the goal to minimizing a core functional $\mathcal{J}$ involving the staleness horizon $x = N/R$ and the replay ratio $y = B/R$:
\[\mathcal{J}(x, y) = \bar{\sigma}^2(x) \left(1 + \frac{y}{\mu}\right) \left(\frac{1}{y} + \frac{1}{x} + \rho\right)\]Setting the partial derivative $\partial_y \mathcal{J}(x,y) = 0$ provides a closed-form ratio for the optimal replay amount:
\[y^* = \sqrt{\frac{\mu}{\rho + 1/x^*}}\]Insight: This proves mathematically that as inference cost $\mu$ grows, the optimal strategy dictates heavily increasing the replay ratio $y^*$.
By plugging $y^$ back into $\mathcal{J}$, the optimal staleness horizon $x^$ minimizes:
\[\mathcal{I}(x) = \bar{\sigma}^2(x) \left(\frac{1}{\sqrt{\mu}} + \sqrt{\rho + \frac{1}{x}}\right)^2\]If variance scales as a power law $\sigma(x) \propto x^\alpha$, minimizing this derivative gives a distinct quadratic equation, definitively mapping out the optimal buffer size based solely on generation cost and off-policy decay properties.
3. Empirical Insights: Why Does it Actually Work?
While the math proves we should use a buffer to save compute, the empirical results reveal fascinating side effects on the training dynamics of models like Qwen2.5-7B on mathematical reasoning (MATH) datasets.
- The 40% Compute Savings: A finely tuned buffer (e.g., $(W,T) = (5,3)$ instead of baseline $(6,2)$) cuts inference compute dramatically, matching or beating the baseline on identical compute budgets.
- Regularization Against Collapse: In standard RL fine-tuning, models often reach a peak accuracy and then catastrophically overfit or “collapse” into nonsensical policies. The staleness introduced by the buffer acts as a regularizer. Mixing older trajectories limits the policy from aggressively overfitting to its own immediate outputs, delaying or entirely preventing training collapse.
- Preservation of Output Diversity (Entropy): The entropy mechanism in RL tends to shrink output diversity (hurting
pass@kfor $k>1$). Remarkably, experience replay improvespass@kmetrics even more thanpass@1. The diverse historical prompts in the buffer prevent the model from narrowing down to single monolithic reasoning paths. - Wall-time Smoothness: Because standard asynchronous loops are prone to stalls (e.g., inference workers wait on trainers, or vice versa, due to variable reward latency), a replay buffer smooths out these latency spikes. This results in significant wall-clock speedups purely from eliminating hardware idle time.
4. Advanced Refinements: Pushing the Pareto Frontier
Beyond a basic FIFO uniform sampling buffer, advanced heuristics push efficiency even further:
- Positive-Bias Sampling: Off-policy variance is most harmful on failed/incorrect trajectories. By ensuring a fraction $\delta$ of the buffer is strictly reserved for correct past rollouts, the model learns more efficiently from staleness.
- Alternative Loss Functions (AsymRE): Standard GRPO uses importance sampling, which exacerbates variance under high staleness. Replacing it with Asymmetric REINFORCE (AsymRE) avoids explicit importance ratio correction, yielding demonstrably stabler learning curves at high replay ratios.
Takeaways
The “generate-then-discard” philosophy in LLM RL is computationally suboptimal. By rigorously balancing inference cost ($\mu$), staleness-induced variance ($\sigma^2$), and temporal correlation ($\rho$), the addition of a simple replay buffer fundamentally transitions LLM post-training from an optimization of “performance per step” to “performance per unit of compute”. For teams trying to run reasoning fine-tuning pipelines (like R1-Zero variants) without frontier-level GPU clusters, decoupling training from inference through replay buffers will likely become standard practice.