Read Scaling Laws – O1 Pro Architecture, Reasoning Training Infrastructure, Orion and Claude 3.5 Opus “Failures”.

Summary

Scaling laws in AI: how they continue to drive the rapid improvement in LLM capabilities. Despite concerns about the end of scaling laws, large AI labs and hyperscalers are increasing their investments in data centers and capital expenditure, indicating their confidence in the continued relevance of scaling laws. New paradigms for scaling, such as reasoning models, have emerged, pushing the boundaries of AI development.

One of the key challenges in scaling pre-training is the increasing difficulty of data collection. The internet’s data, while expanding, is not growing proportionally to compute power. This has led to the use of synthetic data generation to augment training datasets.

Post-training techniques are crucial for optimizing LLMs. Supervised Fine-Tuning (SFT) uses curated input-output pairs to improve model performance in specific domains. However, the limited availability of high-quality human-generated data has led to the increased adoption of synthetic data. Reinforcement Learning (RL) techniques, particularly Reinforcement Learning with Human Feedback (RLHF), have been instrumental in aligning LLMs and enhancing their usefulness. However, the scalability and expense of RLHF have driven the exploration of Reinforcement Learning with AI Feedback (RLAIF), where AI models provide feedback instead of humans.

Reasoning models, which utilize Chain of Thought (CoT) reasoning, are a new frontier in AI development. These models break down complex problems into a series of reasoning steps, allowing for better accuracy and the ability to self-correct. Inference-time scaling, where increased compute power at inference time leads to better results, has become a significant area of focus. Techniques such as Monte Carlo Tree Search and Self-Consistency/Majority Vote are used to explore multiple reasoning paths and enhance accuracy.

OpenAI’s o1 and o1 Pro models are examples of reasoning models that leverage inference-time scaling. o1 follows a single chain of thought, while o1 Pro utilizes Self-Consistency/Majority Vote to improve reliability. o1’s training infrastructure, referred to as berry training, involves generating vast amounts of data using Monte Carlo tree search and pruning it using functional verifiers. This process is computationally intensive, requiring multiple forward passes for each backward pass during post-training.

Scaling inference-time compute is more expensive than scaling training, primarily due to 1) the increased memory requirements of larger KV Caches and 2) the quadratic scaling of FLOP requirements with respect to sequence length. However, scaling pre-training can still significantly reduce inference costs by enabling overtraining. It emphasizes the need for continued innovation in AI hardware and architectures to support the increasing compute demands of reasoning models and other advanced techniques.

o1 reasoning model

OpenAI’s o1 is a groundbreaking reasoning model that showcases the power of inference-time scaling and Chain of Thought (CoT) reasoning. Here’s what makes it special:

Navigates a Single Chain of Thought: Unlike other reasoning models that explore multiple reasoning paths, o1 follows a singular chain of thought to arrive at an answer. While this limits it to a “pass@1” approach at inference time, it allows for focused reasoning and potentially deeper analysis along a chosen path.
Emergent Self-Correction (Backtracking): One of o1’s most remarkable abilities is its capacity to self-correct and backtrack on its reasoning chain. This wasn’t specifically engineered but rather emerged as a consequence of scaling inference time compute. When o1 encounters an illogical conclusion or dead end, it can revisit earlier steps and adjust its reasoning path, showcasing an emergent form of problem-solving.
Trained Using “Berry Training” Infrastructure: o1 is trained using a unique system called “berry training” which involves generating massive amounts of data using a Monte Carlo tree search with numerous concurrent “rollouts.” This data, comprising trillions of tokens and thousands of answer “trajectories” for millions of problems, is then pruned using functional verifiers and Outcome Reward Models (ORMs). This approach is computationally intensive and highlights the emphasis on data quality and model verification in o1’s training.
Reliance on Functional Verifiers: o1’s training process heavily utilizes functional verifiers to provide feedback during training. These verifiers act as independent “sandboxes” to check the accuracy of generated data, whether it involves running code or verifying mathematical calculations. This focus on verification contributes to o1’s accuracy, particularly in domains like coding and mathematics where answers can be objectively assessed.
High Cost Due to Inference-Time Compute: o1’s reasoning capabilities come at a computational cost. The model’s reliance on long sequence lengths for CoT reasoning leads to increased memory requirements (due to larger KV Caches) and a quadratic scaling of FLOP requirements. This makes inference significantly more expensive than traditional LLMs like GPT-3, explaining the price difference of 6x per token between GPT-3 and o1, even though they share the same architecture and size.

In essence, o1 represents a significant advancement in AI, demonstrating the potential of reasoning models and inference-time scaling. However, its computational demands and associated costs highlight the ongoing challenges in making such powerful models widely accessible and economically viable for deployment.

Alignment

Reference: Deliberative alignment: reasoning enables safer language models

Alignment in the context of LLM refers to the process of ensuring that the model’s behavior and outputs are in accordance with human values and intentions. It involves techniques that guide the model to generate responses that are helpful, harmless, and aligned with the desired ethical and societal norms.

One of the key methods employed in alignment is Reinforcement Learning with Human Feedback (RLHF). This technique involves using human feedback to train a reward model that assesses the quality and alignment of the model’s responses. Human annotators provide preferences for different responses, which are used to train the reward model. This model then acts as a guide, providing feedback to the LLM during training, pushing it towards generating outputs that are more aligned with human expectations.

Another important aspect of alignment is the use of synthetic data. Synthetic data, generated using various methods, plays a crucial role in post-training fine-tuning, especially for tasks like coding, math, and reasoning. By using synthetic data, researchers can train models on a wide range of scenarios and edge cases, improving their ability to generalize and respond appropriately in various situations.

RLAIF, or Reinforcement Learning with AI Feedback, is a technique that replaces human feedback with AI-generated feedback. This approach offers several advantages, including scalability and speed, as AI models can generate feedback much faster than humans. RLAIF can also be used to address more nuanced aspects of alignment, such as ethical dilemmas and cultural norms.

The goal of alignment is to create AI systems that are not only powerful but also responsible and beneficial to society. The techniques discussed above represent important steps in this direction, helping to ensure that LLMs like OpenAI’s o1 are aligned with human values and intentions.

PPO vs. DPO

Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) are two techniques used in the alignment and fine-tuning of LLMs.

1. PPO

PPO is a type of reinforcement learning algorithm. In PPO, a “policy model” dictates the actions of an LLM, “proximal” refers to the algorithm’s strategy of updating the policy gradually, and “optimization” refers to the iterative process of refining the policy through feedback from a reward model, ultimately maximizing the expected cumulative reward. PPO uses both policy-based and value-based methods, where an “actor” determines actions and a “critic” evaluates those actions. PPO is well-suited for tasks like alignment, where the objective is to guide the LLM’s behavior towards desired outcomes.

Strengths:

Effective at aligning LLMs and improving their performance on complex tasks.
Can scale well, particularly when using synthetic data instead of human feedback.
Offers more stable learning compared to some other RL algorithms.

Weaknesses:

Can be slow and computationally expensive, especially when relying on human feedback (RLHF).
Gathering and processing high-quality human preference data can be a challenge.

2. DPO

DPO, on the other hand, simplifies the process by directly optimizing the policy to favor preferred outputs, as determined by human preference data, without relying on a separate reward model. It employs a binary cross-entropy loss function to compare probability ratios between the current model and a reference model (usually the pre-fine-tuned version). This approach ensures that the LLM learns to produce preferred responses while staying close to its original behavior.

Strengths:

Simpler to implement and often less computationally expensive than PPO, especially when avoiding a full reward model.
Can achieve comparable or better results than RLHF in certain cases.

Weaknesses:

Doesn’t scale as well as PPO, especially for larger models and complex alignment tasks.
Relies heavily on the quality of the preference data set, making data gathering and curation crucial.

Several models have utilized PPO in their training process. These include:

OpenAI’s InstructGPT
Anthropic’s Claude models
Google’s Gemini

Llama 3.1 used DPO in the post-training process, highlighting its advantages over PPO for their specific circumstances: “explored on-policy algorithms such as PPO, but found that DPO required less compute for large-scale models and performed better”. This suggests that for Llama 3.1, DPO was deemed more efficient and effective than PPO. The decision to prioritize DPO seems to stem from a focus on practical considerations and the desire to maximize performance while minimizing computational costs. The paper notes that DPO “performed better, especially on instruction following benchmarks”, which were likely a key priority in the development of Llama 3.1.

Despite a general industry trend toward favoring PPO for its scalability in handling larger models and complex tasks, Llama 3.1 demonstrates a case where DPO proved to be the superior choice. This highlights that the selection of alignment techniques is not always straightforward and depends heavily on the specific goals, model scale, and computational constraints of the project.

However, DPO didn’t scale as effectively as PPO (shown in Llama 3.3). This shift indicates that PPO is generally preferred for scaling post-training efforts, particularly in aligning larger and more complex models.

PPO vs. RLHF

PPO is a specific type of Reinforcement Learning (RL) algorithm, while RLHF is a broader technique for aligning language models with human preferences.
Think of it this way: RLHF provides the framework and the data for training, while PPO is one of the tools used to implement RLHF training process
- RLHF involves collecting human feedback on model outputs and using that data to train a reward model. This reward model then guides the LLM during the reinforcement learning process.
- PPO is one algorithm that can be used within the RLHF framework to optimize the model’s policy based on the rewards provided by the reward model.

Several other RL algorithms besides PPO can be utilized within RLHF. The choice of algorithm depends on factors like the complexity of the task, the size of the model, and the desired balance between efficiency and performance.

Reasoning model: CoT + Reinforcement Learning

Reinforcement learning is applied to align a base LLM’s behavior towards CoT and improve the accuracy using several other separate models and LLMs:

The Generator: This model is responsible for producing reasoned-out solutions in multiple steps, forming a Chain-of-Thought. It is typically separate from the base LLM and fine-tuned specifically for generating these reasoning steps.
The Verifier Model: This model evaluates the correctness of the solutions generated by the Generator and provides a corresponding reward. There are different ways to train a Verifier Model:
- Human annotation: Humans can directly assess the reasoning steps.
- Automatic process annotation: This involves generating multiple reasoning paths and then evaluating them based on whether they lead to the correct answer.
Process Annotation: This refers to the process of labeling or scoring the reasoning steps produced by the Generator. This annotation is used to train the Reward Model. The source mentions different methods of process annotation, such as:
- Hard Estimation: Marking a step as good if it leads to the correct final answer.
- Soft Estimation: Assigning a score based on the frequency with which a step leads to the correct solution.
The Reward Model: It learns from the labeled reasoning steps (process annotation) and is used to guide the Generator in producing better, more accurate reasoning chains. There are two types: Outcome Reward Models (ORMs) and Process Reward Models (PRMs).

ORM vs. PRM

Outcome Reward Models (ORMs) and Process Reward Models (PRMs) are two types of reward models used in the context of RL, particularly in training reasoning models that utilize CoT.

ORMs focus on the final outcome of a reasoning process. They evaluate the correctness of the ultimate solution generated by the model and provide a reward based solely on whether that solution is right or wrong. ORMs are commonly used in ranking-based approaches, where multiple possible answers are generated and the highest-ranked one is selected.
PRMs, on the other hand, evaluate and assign scores to each individual step within the reasoning chain. This allows for a more fine-grained assessment of the model’s reasoning process. PRMs can identify specific errors or weaknesses in the chain of thought, even if the final answer happens to be correct.
Relationship with RLHF and CoT: Both ORMs and PRMs play crucial roles in the RLHF process, particularly when training reasoning models that employ CoT prompting.
RLHF provides a framework for aligning LLMs with human preferences by incorporating human feedback into the reward model.
CoT prompting encourages the model to explicitly generate a step-by-step reasoning process before arriving at a final answer.

Here’s how ORMs and PRMs fit into this:

ORMs within RLHF: Human feedback can be used to train an ORM by having humans rank different model outputs or simply label them as correct or incorrect. The ORM then learns to predict which solutions are more likely to be preferred by humans.
PRMs within RLHF: Human annotators can evaluate the individual reasoning steps generated by the model, providing feedback on the quality and logic of each step. This feedback is used to train a PRM that rewards models for generating coherent and accurate reasoning chains.
ORMs and CoT: ORMs can be used with CoT by evaluating the final answer produced at the end of the thought chain. However, they do not provide insights into the reasoning process itself.
PRMs and CoT: PRMs are particularly valuable when used with CoT as they can evaluate each step in the chain, providing more specific guidance for improving the model’s reasoning abilities.

Majority vote vs. best-of-N sampling

Both majority vote and best-of-N sampling are techniques used to improve the performance of LLMs, especially in the context of reasoning tasks that involve CoT prompting. They represent different approaches to leveraging multiple output generations from the model to arrive at a more reliable and accurate final solution.

Majority Vote (Self-Consistency)

In majority vote, also known as self-consistency, the same prompt is run through the model multiple times, generating a set of independent responses. The final answer is then chosen based on the response that appears most frequently among the generated samples. This method assumes that the most common answer is more likely to be the correct one. OpenAI’s GPT-4 Pro utilizes this technique.

Best-of-N Sampling

Best-of-N sampling involves generating N different solutions for a given prompt and then employing a verifier model to identify the chain-of-thought that leads to the most likely correct answer. This approach differs from majority vote in a couple of key aspects:

Focus on reasoning process: Best-of-N emphasizes selecting the best reasoning path rather than simply the most frequent answer. It relies on the verifier model to assess the quality and logic of each chain of thought.
Reliance on a verifier: The effectiveness of best-of-N is heavily dependent on the capabilities of the verifier model. This verifier needs to be able to reliably distinguish between correct and incorrect reasoning chains.

Key Differences and Considerations

Here is a table summarizing the key differences between majority vote and best-of-N sampling:

Feature	Majority Vote	Best-of-N Sampling
Selection Criteria	Frequency of a specific answer	Quality of reasoning chain assessed by a verifier model
Verifier Model	Not required	Essential
Applicability	Suitable for various tasks	More limited to tasks with reliable verifiers
Computational Cost	Relatively lower	Potentially higher due to verifier model
Sensitivity to Verifier	Not applicable	Highly sensitive

The choice between majority vote and best-of-N sampling depends on several factors:

Task Characteristics: Majority vote is more versatile and can be applied to a wider range of tasks, while best-of-N is more suitable for problems that have reliable and efficient verification methods.
Verifier Availability: The availability of a robust verifier model is crucial for best-of-N sampling.
Computational Constraints: Best-of-N can be more computationally expensive due to the need to run the verifier model on multiple candidate solutions.

Overall, both techniques offer valuable ways to improve the reliability and accuracy of LLMs, particularly for reasoning tasks. Majority vote provides a simple and efficient approach for tasks where identifying the correct answer is sufficient, while best-of-N sampling offers a more nuanced approach for tasks where evaluating the reasoning process is critical.