Read DeepSeek-R1 and Kimi-k1.5.

DeepSeek-R1 Summary

The world of LLMs is constantly evolving, with new models and techniques emerging to push the boundaries of what’s possible. DeepSeek-R1, a new family of reasoning models developed by DeepSeek-AI, represents a significant step forward in this evolution. It explores the key innovations and achievements of DeepSeek-R1 based on the provided research paper excerpt.

Pushing the Limits of Reasoning with Reinforcement Learning

DeepSeek-R1 is built on the foundation of reinforcement learning (RL), a powerful technique where models learn through trial and error based on rewards. The researchers took a bold step by applying RL directly to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach, embodied in DeepSeek-R1-Zero, allowed the model to freely explore and discover effective reasoning strategies.

DeepSeek-R1-Zero demonstrated remarkable capabilities, including self-verification, reflection, and the ability to generate long and complex chains of thought (CoT).
Evaluations on benchmarks like AIME 2024 and MATH-500 showcased significant performance improvements through RL. However, DeepSeek-R1-Zero also faced challenges, particularly in terms of readability and language consistency. To address these issues and further enhance performance, DeepSeek-AI introduced DeepSeek-R1, which incorporated a multi-stage training pipeline:
1. Cold Start: A small set of high-quality CoT data was used to fine-tune the base model, providing a more readable and structured starting point for RL.
2. Reasoning-Oriented RL: Focused RL training on reasoning-intensive tasks further improved performance, while a language consistency reward encouraged human-friendly outputs.
3. Rejection Sampling and Supervised Fine-Tuning: The RL checkpoint was used to collect SFT data, combining reasoning examples with data from other domains to enhance general capabilities.
4. RL for All Scenarios: A final RL stage focused on improving helpfulness and harmlessness across all task types.

Results and Open-Source Contribution

DeepSeek-R1’s performance is on par with OpenAI’s o1-1217, demonstrating its effectiveness across a wide range of tasks, including:

Reasoning: Achieving state-of-the-art results on benchmarks like AIME 2024, MATH-500, and coding competitions.
Knowledge: Excelling in educational tasks, surpassing other closed-source models on benchmarks like MMLU, MMLU-Pro, and GPQA Diamond.
General Capabilities: Demonstrating strong performance in creative writing, question answering, editing, summarization, and long-context understanding. DeepSeek-AI has generously open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six distilled models (1.5B to 70B parameters) based on Qwen and Llama. This contribution empowers the research community to build upon these advancements and further explore the potential of reasoning in LLMs.

Distillation: Making Reasoning Accessible

The researchers also explored distilling the reasoning capabilities of DeepSeek-R1 into smaller, more efficient models. By fine-tuning open-source models like Qwen and Llama using the data generated by DeepSeek-R1, they achieved impressive results:

DeepSeek-R1-Distill-Qwen-7B outperformed non-reasoning models like GPT-4o-0513 across various benchmarks.
DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Llama-70B surpassed OpenAI’s o1-mini on most benchmarks. These findings highlight the effectiveness of distillation in making advanced reasoning capabilities accessible to a wider range of users and applications.

Looking Ahead: The Future of DeepSeek-R1

The DeepSeek-R1 research paper also outlines future directions for the model’s development, including:

Improving general capabilities in areas like function calling and complex role-playing.
Addressing language mixing issues when handling queries in languages other than Chinese and English.
Optimizing prompt engineering for better performance.
Enhancing performance in software engineering tasks through further RL training. DeepSeek-R1 represents a significant step forward in the quest for more intelligent and capable LLMs. By embracing the power of reinforcement learning and open-source collaboration, DeepSeek-AI has opened up new possibilities for the future of reasoning in artificial intelligence.

Kimi-k1.5 Summary

Kimi k1.5 is a new multimodal LLM trained with reinforcement learning (RL). The researchers behind Kimi k1.5 aimed to explore RL as a new axis for AI scaling, as language model pretraining with next-token prediction is limited by the amount of available training data.

Key Ingredients of Kimi k1.5

Kimi k1.5’s design and training revolve around several key ingredients:

Long Context Scaling: The researchers scaled Kimi k1.5’s context window for RL to 128k and found that performance continued to improve as the context length increased. To make training with such a large context window more efficient, they used partial rollouts, which sample new trajectories by reusing large chunks of previously generated ones.
Improved Policy Optimization: Kimi k1.5 uses a formulation of RL with long chain-of-thought (CoT) and a variant of online mirror descent for robust policy optimization.
Simplistic Framework: Combining long context scaling and improved policy optimization methods created a simple but effective RL framework. As the context length increases, the number of search steps increases, allowing Kimi k1.5 to achieve strong performance without complex techniques like Monte Carlo tree search, value functions, or process reward models.
Multimodalities: Kimi k1.5 is trained on both text and vision data, enabling it to reason over both modalities.

Long-CoT Supervised Fine-tuning and RL

Before beginning RL, Kimi k1.5 underwent a long-CoT supervised fine-tuning stage. The researchers created a small but high-quality dataset of reasoning paths for both text and image inputs using prompt engineering. This primed the model to internalize key cognitive processes like planning, evaluation, reflection, and exploration. RL with Kimi k1.5 involved training a policy model to accurately solve problems in a dataset by generating a sequence of intermediate reasoning steps (chain-of-thought) and a final answer. The model is rewarded for arriving at the correct answer, which encourages it to explore different reasoning paths. The researchers argue that traditional RL methods using value functions for credit assignment might not be suitable for this context because penalizing incorrect reasoning steps could hinder exploration.

Improving Training Efficiency

The authors used several techniques to make RL training more efficient:

Length Penalty: To prevent the model from generating excessively long responses, they introduced a length reward that penalizes longer responses and promotes shorter ones, especially among responses that arrive at the correct answer.
Sampling Strategies:
- Curriculum Sampling: Starts by training on easier tasks and gradually introduces harder tasks.
- Prioritized Sampling: Focuses on problems with low success rates to improve the model’s weakest areas.

Long2short: Bringing Long-CoT Benefits to Short-CoT Models

Though long-CoT models perform well, they use more tokens during inference. To address this, the researchers developed several “long2short” methods to transfer insights from long-CoT models to shorter ones. These methods included:

Model Merging: Averages the weights of a long-CoT model and a shorter model.
Shortest Rejection Sampling: Samples the same question multiple times and selects the shortest correct response for supervised fine-tuning.
Direct Preference Optimization (DPO): Uses the shortest correct response as a positive sample and longer responses (both correct and incorrect) as negative samples to train the model.
Long2short RL: Applies a length penalty during a second RL training phase to further penalize overly long responses.

Results and Conclusions

Kimi k1.5 achieved state-of-the-art results on various reasoning and vision benchmarks in both its long-CoT and short-CoT versions. Notably, the researchers demonstrated the importance of scaling context length for improving reasoning ability. Their experiments showed that smaller models could achieve comparable performance to larger ones by leveraging long CoTs optimized through RL, though larger models were more token-efficient. They also found that their RL method was more sample-efficient than ReST, which does not apply negative gradients to penalize incorrect responses. The authors conclude that scaling context length and improving policy optimization are crucial for continued LLM improvement. They believe that future research should focus on improving credit assignment in RL and reducing overthinking without hindering the model’s ability to explore different reasoning paths. They also see potential in long2short methods for improving short-CoT model performance.