Reasoning Limits of LLMs Under RLVR

Read on the following paper:

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?.

1. Core Question and Motivation

Reinforcement Learning with Verifiable Rewards (RLVR) has been widely successful in improving the reasoning performance of large language models (LLMs) in tasks like mathematics and programming. It is commonly believed that RLVR enables LLMs to continuously self-improve and acquire novel reasoning abilities that surpass their corresponding base models.

This study systematically investigates this assumption, posing the fundamental question: Does current RLVR truly enable LLMs to acquire novel reasoning abilities, or does it merely utilize reasoning patterns already present in the base model?.

2. Methodology

To rigorously assess the reasoning capacity boundaries, the authors adopted the pass@k metric.

pass@k reflects the proportion of problems a model can potentially solve within $k$ trials, providing a robust view of its reasoning boundary, unlike average-case metrics which may underestimate a model’s true potential.
The study conducted extensive experiments across various LLM families (e.g., Qwen2.5, LLaMA-3.1-8B), model sizes, RL algorithms (including PPO, GRPO, RLOO, ReMax, Reinforce++, and DAPO), and three domains: mathematics, code generation, and visual reasoning.
For math tasks, the zero-RL setting (applying RL directly to the pretrained base model) was followed.

3. Key Findings: Reduced Coverage and Bounded Capacity

The research uncovered several surprising findings regarding the effectiveness of current RLVR training:

Increased Sampling Efficiency at Low $k$: RLVR-trained models generally outperform their base models when $k$ is small (e.g., $k=1$). This suggests that RLVR significantly improves the sampling efficiency towards correct paths.
Narrower Reasoning Coverage at High $k$: Surprisingly, as the number of samples ($k$) increases, base models consistently surpass RLVR-trained models across all benchmarks and LLM families. This indicates that current RLVR training does not expand, and may even reduce the scope of reasoning over solvable problems (referred to as Reduced Scope of Reasoning Capacity in Figure 1).
No Novel Reasoning Introduced: The core finding is that current RLVR training rarely elicits fundamentally new reasoning patterns.
- Reasoning Paths are Bounded: Analysis suggests that the reasoning paths generated by RLVR models are already included in the base models’ sampling distribution.
- Perplexity Analysis: The perplexity distribution of RL-generated responses, when measured against the base model, closely matched the responses the base model already tends to generate, indicating that RLVR sharpens the distribution within the base model’s prior rather than expanding beyond it.
- Coverage Analysis: Problems solved by RLVR models are nearly a subset of those solvable by the base model. On the AIME24 benchmark, $13.3\%$ of problems were solvable by the base model but not by the RLVR model, while $0.0\%$ were solvable exclusively by the RLVR model.

4. Deep Analysis and Comparison with Distillation

RL Algorithm Performance:

Six popular RLVR algorithms (PPO, GRPO, RLOO, ReMax, Reinforce++, DAPO) performed similarly.
The Sampling Efficiency Gap ($\Delta$SE), quantifying how close the RL model’s pass@1 is to the base model’s upper bound (pass@k, where $k=256$ is used as a proxy), remained consistently large (above 40 points) across all algorithms, highlighting that existing methods are far from optimally leveraging the base model’s full potential.
As RL training progresses, average performance (pass@1) improves, but the coverage (pass@256) decreases.

RLVR vs. Distillation:

RLVR and distillation are fundamentally different.
While RLVR improves scores by sampling existing high-reward outputs more efficiently, distillation can introduce new reasoning patterns from a stronger teacher to the student.
Consequently, distilled models demonstrate an expanded reasoning scope that surpasses the reasoning boundary of the base model, unlike RLVR-trained models.

Underlying Factors for RLVR Limitations: The paper posits that the limitations of current RLVR stem from two main differences compared to traditional RL (e.g., AlphaGo):

Vast Action Space: The action space in LLMs is exponentially larger, making effective exploration difficult.
Pretrained Prior: RLVR starts with a pretrained base model with a useful prior, which acts as a “double-edged sword”. In this combinatorial space, the policy gradient algorithms maximize the log-likelihood of positive reward responses within the base model’s prior, constraining the trained policy from discovering truly novel, out-of-prior reasoning patterns.

5. Conclusion and Future Directions

The study concludes that current RLVR methods, while effective for sharpening the existing skills and improving sampling efficiency, have not yet realized the potential of reinforcement learning to elicit genuinely novel reasoning abilities in LLMs.

To unlock this potential, future research should focus on improved RL paradigms, specifically addressing the exploration challenge:

Efficient Exploration: Developing high-level exploration mechanisms (e.g., in program-level abstraction) to facilitate the discovery of out-of-prior reasoning patterns.
Data Scale via Curriculum: Implementing a curriculum that trains on easier subproblems first to hierarchically reduce the exploration space and lift performance on challenging tasks.
Process Reward: Incorporating fine-grained credit assignment and intermediate signals (process reward) instead of relying purely on binary outcome rewards to guide the reasoning trajectory.
Agentic RL: Utilizing a multi-turn agentic RL framework, featuring richer interactions and feedback with the environment, to allow models to generate novel experiences and learn from them, initiating an “era of experience”.

Note: The key finding that RLVR models suffer from reduced coverage relative to their base models, despite achieving higher average accuracy (pass@1), highlights that optimization focusing purely on immediate reward success can inadvertently prune diverse and viable reasoning pathways already known to the underlying model. This is like a chef who, after mastering one popular recipe (high pass@1), forgets all the other niche dishes they used to know (low pass@k coverage). While they execute the popular recipe perfectly, their overall culinary range has shrunk.