Read on the following blogs:

This note summarizes three blog posts from Thinking Machines Lab concerning advanced techniques in LLM fine-tuning, parameter efficiency, and optimization geometry.


I. LoRA Without Regret (September 29, 2025)

This blog investigates the effectiveness of Low-Rank Adaptation (LoRA), a leading Parameter Efficient Fine-Tuning (PEFT) method, compared to full fine-tuning (FullFT).

LoRA Fundamentals and Advantages

LoRA modifies a weight matrix $W$ to $W’ = W + \gamma BA$, where the parameter count of $B$ and $A$ is significantly smaller than $W$. LoRA is motivated by the intuition that post-training updates, which use small datasets focusing on narrow domains, should not require adjusting the terabit scale of the original base weights.

Operational advantages of LoRA include:

  1. Multi-tenant serving: A single inference server can hold many adapters (the A and B matrices) in memory and sample from them simultaneously.
  2. Layout size for training: LoRA requires far less memory than FullFT since it trains fewer weights, making training more accessible and efficient.
  3. Ease of loading and transfer: LoRA adapters are fast and easy to set up or move between machines due to their smaller size.

Key Findings: The Low-Regret Regime

The core finding is that LoRA can match the performance and sample efficiency of FullFT when certain critical details are correctly implemented, characterizing a “low-regret regime” which covers most post-training scenarios.

Conditions for LoRA to match FullFT performance:

  1. LoRA must be applied to all layers of the network, particularly the MLP (and MoE) layers. Attention-only LoRA significantly underperforms even when attempting to match the parameter count using higher ranks. This preference for MLP layers is also supported by approximations based on the empirical neural tangent kernel (eNTK).
  2. LoRA must not be capacity constrained; the number of trainable parameters must exceed the amount of information to be learned, which is typical for small-to-medium post-training datasets. LoRA underperforms FullFT in settings that resemble pre-training (very large datasets).

Training Dynamics and Hyperparameters

  • Optimal Learning Rate (LR) Ratio: The optimal LR for LoRA is consistently approximately 10 times higher than the optimal LR for FullFT in both supervised learning and reinforcement learning settings.
  • Rank and Learning Rate Invariance: The inclusion of the $1/r$ scaling factor (where $r$ is the rank) in the LoRA parametrization $W’ = W + \frac{\alpha}{r}BA$ makes the optimal learning rate approximately independent of the rank.
  • Batch Size Effects: LoRA shows less tolerance for large batch sizes than FullFT, incurring a larger penalty in loss as batch size increases beyond some point. This is thought to be a property of the product-of-matrices parametrization ($BA$), independent of rank.
  • Compute Efficiency: LoRA typically requires slightly more than ⅔ of the FLOPs that full fine-tuning requires per forward-backward pass, resulting in compute efficiency advantages when performance is plotted over FLOPs instead of training steps.

RL Specifics

LoRA performs equivalently to FullFT for RL even when using very small ranks (as low as 1). This is attributed to an information-theoretic argument: policy gradient methods provide only $O(1)$ bits of information per episode, regardless of the number of tokens, meaning RL inherently requires very low capacity.


II. Modular Manifolds (September 26, 2025)

This post introduces the idea of applying manifold constraints to neural network weight matrices to ensure “healthy” tensors (preventing numerical instability and making training algorithms easier to design).

Manifold Optimization Concepts

Manifold optimization involves constraining weight tensors to a curved surface (a manifold). The optimization steps are taken within the tangent space (the local flat approximation of the manifold) to equate the learning rate with the actual length of the optimization step.

The general procedure for a first-order manifold optimizer involves three steps:

  1. Find the unit tangent vector that maximizes the direction of the gradient.
  2. Multiply this direction by the learning rate and apply the update.
  3. Use a “retraction map” to project the updated weights back to the manifold.

Manifold Muon and the Stiefel Manifold

The post proposes Manifold Muon as an optimizer where the weight matrix $W$ is constrained to the Stiefel manifold.

  • Stiefel Manifold ($\mathsf{Stiefel}(m,n)$): This constraint requires that $W^T W = I_n$ (for tall matrices $m \geq n$). A matrix constrained to the Stiefel manifold has all its singular values equal to exactly one, which prevents the matrix from having excessively small or large stretching effects on input vectors.
  • Distance Measure: Manifold Muon uses the spectral norm as the distance function, which measures the largest singular value of the update matrix $A$.
  • Solution: Solving the constrained optimization problem for Manifold Muon involves reformulating it as a dual problem and solving it via dual ascent. Experiments show that Manifold Muon can achieve higher accuracy than AdamW and forces the final singular value distribution of the weights to cluster around 1.

The Theory of Modular Manifolds

The theory of modular manifolds provides an abstraction for combining layers and budgeting learning rates across a complex network.

  • A neural network module is defined by three attributes: a forward function $f$, a weight submanifold constraint $\mathcal{M}$, and a norm $|\cdot|$.
  • When composing two modules, the new manifold ($\mathcal{M}_3$) is the Cartesian product of the two existing manifolds.
  • The new norm ($|\cdot|_3$) is determined by the max of the two existing norms weighted by special scalar coefficients ($s_1, s_2$). These coefficients budget the learning rates across the composed layers. This construction links the budget of learning rates directly to the Lipschitz sensitivity of the network output with respect to the weights.

III. On-Policy Distillation (October 27, 2025)

This blog explores On-Policy Distillation (OPD) as an efficient post-training method that overcomes the limitations of traditional SFT and RL.

Combining On-Policy Sampling and Dense Reward

Post-training methods are typically divided into:

  • Off-policy (SFT/Distillation): Uses external teacher outputs (trajectories), providing a dense reward signal. Drawback: the student learns in contexts frequented by the teacher, leading to compounding error (exposure bias) when the student diverges.
  • On-policy (RL): Samples rollouts from the student itself, providing a sparse reward signal. Drawback: inefficiency due to sparse feedback ($O(1)$ bits per episode).

OPD samples trajectories from the student model (on-policy) but uses a high-performing teacher model to grade each token of the trajectory (dense reward).

Implementation and Efficiency

  • Loss Function: OPD minimizes the per-token reverse KL divergence ($\text{KL}(\pi_\theta || \pi_\text{teacher})$). Minimizing reverse KL is “mode seeking” (it learns the teacher’s specific behavior) and reduces exposure bias.
  • Efficiency Gains: OPD is highly compute-efficient. The Qwen team reported achieving superior performance (74.4% on AIME’24) at one-tenth the cost of RL using OPD. Overall, OPD demonstrated a cost reduction of 9-30x compared to estimated SFT costs and an approximate 50-100x reduction in compute compared to RL when measuring gradient steps.
  • Nature of Learning: The source suggests that RL spends most compute on search (rolling out policies and assigning credit) to explore the space of semantic strategies, whereas OPD acts as a shortcut by distilling the final strategy learned, saving the compute required for modeling intermediate strategies.

Applications: Reasoning and Continual Learning

  1. Reasoning: OPD significantly improves efficiency for training models on mathematical reasoning tasks compared to SFT extrapolation or RL. It also allows for data efficiency, as multi-epoch training on a single complex prompt can be sufficient to distill the teacher’s performance, contrasting with RL’s tendency toward memorization in multi-epoch training.
  2. Continual Learning/Personalization: OPD is highly effective for recovering specialized behaviors (like instruction following) that were lost due to catastrophic forgetting during mid-training on new domain knowledge (e.g., internal company documents). SFT on a model’s own samples degrades performance over time due to accumulation of non-zero gradient updates, while OPD inherently stays on-policy, making it a promising tool for continual learning.

Analogy: If fine-tuning a language model is like learning to pilot a plane:

  • Full Fine-Tuning means replacing every part of the plane after every flight, regardless of whether a small adjustment or a complete overhaul was needed.
  • LoRA is like installing specialized modular actuators (the A and B matrices) only where adjustments are needed, making the plane much lighter, easier to store, and faster to update, while still achieving the same flight precision as a rebuilt plane.
  • RL is trying to learn pilot skill through crashing or succeeding (sparse feedback). It takes many flights (compute) to figure out the right set of controls (strategy).
  • On-Policy Distillation (OPD) is flying the plane yourself (on-policy) while an expert copilot (teacher) instantaneously grades every single action you take (dense reward), allowing you to learn the expert’s successful strategy much faster than through trial and error.

More notes about OPD

On-Policy Distillation is proposed as a post-training paradigm that hybridizes the on-policy sampling characteristic of RL with the dense reward signal obtained via knowledge distillation. This synthesis overcomes key limitations in both conventional RL and Supervised Fine-Tuning (SFT).

I. Core Mechanism and Objective Function

OPD functions by executing rollouts from the student policy ($\pi_\theta$) (i.e., on-policy sampling) and then leveraging a high-performing teacher model ($\pi_{\text{teacher}}$) to grade every token of the student-generated trajectory. This provides a dense reward signal throughout the sequence, unlike sparse environment rewards typical in policy gradient methods.

The Loss Function: Reverse KL Divergence

The core objective minimizes the per-token reverse KL divergence ($\text{KL}(\pi_\theta || \pi_{\text{teacher}})$). The loss pushes the student to approximate the teacher’s behavior conditional on the trajectory that the student itself sampled:

\[\text{KL}\Bigl(\pi_\theta \lvert\rvert \pi_\text{teacher}\Bigr) = \mathbb{E}_{x \sim {\pi_\theta}} \Bigl[ \log \pi_\theta(x_{t+1} | x_{1..t}) - \log \pi_\text{teacher}(x_{t+1} | x_{1..t}) \Bigr]\]

In practice, the per-token advantage used for the policy update is set to the negative reverse KL.

Key theoretical properties of minimizing the reverse KL in this context include:

  1. Mode Seeking: Reverse KL is “mode seeking,” meaning the student learns one specific desired behavior—that of the teacher—rather than spreading its probability mass across several potentially suboptimal options.
  2. Exposure Bias Reduction: By sampling from the student’s own policy, OPD ensures that the model learns to recover from mistakes or navigate states that the teacher may not typically visit in its trajectories, directly addressing the compounding error (exposure bias) observed in off-policy distillation.
  3. Reward Integrity: The reverse KL is considered “unhackable” in the sense that low KL divergence universally corresponds to a high probability of desirable behavior from the teacher model’s perspective. For simplicity, a discount factor of zero is used, optimizing only the immediate next token.

II. Information Density and Compute Efficiency Analysis

For a seasoned RL researcher, the primary advantage of OPD lies in the dramatic increase in information absorbed per gradient step compared to conventional policy gradient methods.

Information-Theoretic Comparison

The sources contrast the information absorption capacity of the two on-policy methods:

Method Information Density Theoretical Basis
Policy Gradient RL $O(1)$ bits per episode Learning is driven by the scalar advantage function. Mutual information $I(G; R | \text{history})$ is bounded by the entropy of the advantage $H(\text{Adv})$. This low density holds regardless of the number of tokens in the episode.
On-Policy Distillation (OPD) $O(N)$ bits per episode (where $N$ is token count) Distillation provides dense, per-token supervision.

This severe difference in information density means RL spends most compute on search (rolling out policies and assigning credit to explore semantic strategies), whereas OPD serves as a shortcut by directly distilling the final strategy learned, saving the compute required for modeling intermediate strategies.

Quantitative Efficiency Gains

OPD translates this information density into massive empirical cost reductions compared to RL:

  • Gradient Steps: When compared starting from the same initialization, OPD reaches the teacher’s performance in approximately 7 to 10 times fewer gradient steps than policy gradient RL.
  • Total Compute: This reduction in steps, combined with other efficiencies (like working effectively with shorter context lengths), leads to an approximate cumulative compute reduction on the order of 50 to 100 times compared to RL when measuring gradient steps.
  • Industry Benchmark: The Qwen team demonstrated that OPD achieved superior performance (74.4% on AIME’24) at one-tenth the cost of RL (1,800 GPU hours versus 17,920 GPU hours reported for RL).

III. Application to Continual Learning and Strategy Recovery

OPD is presented as a powerful tool for continual learning and personalization, specifically for recovering specialized behaviors lost during mid-training on new domain knowledge (catastrophic forgetting).

While SFT on a model’s own samples (acting as a forward KL regularizer) often fails, leading to performance degradation even when the expected KL divergence is zero, OPD succeeds because it leverages a fixed teacher policy.

In the case of personalization, OPD successfully restored instruction-following behavior—originally trained with expensive RL—after a subsequent fine-tuning phase on internal documents. The ability to use an earlier, highly-skilled version of the model as the fixed teacher policy to “re-invoke” lost capabilities makes OPD promising for phase-alternating continual learning.

This robustness stems from the fact that OPD always stays on-policy relative to the fixed teacher, leading to convergence on the teacher’s desirable behavior without the regression seen when SFT trains on noisy finite batches of its own samples.


In Summary: OPD effectively replaces the “search” component of conventional policy gradients with “distillation” by converting the sparse sequence reward into a dense, token-level reward derived from the teacher’s log probabilities, drastically improving sample and compute efficiency while maintaining the necessary exploration of the student’s policy space to overcome exposure bias. OPD functions as a highly efficient path for distilling expert semantic strategies.