Yaxiang Zhang*, Yingru Li* †, Jiacai Liu, Ziniu Li, Jiawei Xu, Qian Liu
† Corresponding Author *Co-first Authors First published at Dec. 20.

Figure 1(a): Training-inference mismatch indicator under baseline(green line, constant learning rate) and learning rate decay(yellow line). We train Qwen3-4B with initial learning rate 1e-6, batch size=ppo_mini_batch_size=64, i.e completely on policy. Oversampling and rejection sampling(for the groups with all 0/all 1 rewards) are applied. Our experiments suggest that training-inference mismatch can be effectively suppressed near 3k steps just by shrinking the update size, thus demonstrating this mismatch is not static random noise which stems solely from numerical precision limits, but rather a dynamic issue in the training process.

Figure 1(b): Corresponding validation performance.

Figure 1(c): Pseudo-code for proposed lr scheduler.
TL;DR
The Problem: Reinforcement Learning (RL) training for LLMs is notoriously unstable. While recent studies attribute this to “training-inference mismatch” (caused by hybrid engines), standard fixes like Importance Sampling might fail during longer training runs.
The Insight: We analyze this instability through an optimization lens. We find that as training progresses, gradient noise and training-inference mismatch increases simultaneously. This suggests that the “mismatch” is not merely a static numerical issue, but a dynamic problem coupled with the model’s optimization trajectory.
The Solution: A specialized Learning Rate (LR) Scheduler.
Mechanism: By decaying the learning rate as gradient noise rises, we can consistently stabilize RL training and keep the training-inference mismatch at a safe level.
Heuristic: We propose a novel method to time this decay based on Response Length. The surge in response length serves as a reliable early indicator of impending instability, signaling exactly when to reduce the learning rate.
Citation
@online { title = { Beyond Precision: Why Training - Inference Mismatch is an Optimization Problem and How Simple LR Scheduling Fixes It }, author = { Yaxiang Zhang, Yingru Li, Jiacai Liu, Ziniu Li, Jiawei Xu, Qian Liu }, year = { 2025 }, month = Dec, url = { https:/ / richardli.xyz / mismatch - lr - schedule } }
1. Noisy Gradient Causes Training Collapse
Reinforcement Learning (RL) has proven capable of incentivizing LLMs to perform better on reasoning and other complex tasks. Nevertheless, RL is also famous for its training instability. Some prior work suggests that this issue may stem from the use of hybrid engines in RL training, which introduces a mismatch between training and inference. To measure the degree of this mismatch, we first introduce the Log Perplexity (Log ppl) of a trajectory :
This measures the average perplexity discrepancy among different sequences. Taking the training of Qwen3-4B on the dapo_filter dataset as an example, we observe two key phenomena:
Validation performance degrades significantly after reaching a peak: Figure 1 verifies this pattern. There is an obvious drop in accuracy around 350-400 steps.
Training-inference mismatch grows: As Figure 2 shows, the sharp increase in the difference between the log perplexity of the training and inference engines around 350-400 steps matches the drop in performance.

Figure 2 (a): validation accuracy on aime24 dataset.

Figure 2 (b): validation accuracy on aime25 dataset.

Figure 2(c): The figure shows the indicator for the degree of training-inference mismatch.
Noise includes bias and variance, which both grow with response length. For a more detailed discussion on why noise of gradient estimation is severe in RL, readers may refer to https://richardli.xyz/post/rl-collapse-part1/, https://richardli.xyz/post/rl-collapse-part2/, and The Optimal Token Baseline.

Figure 3: Time-smoothed gradient norm of Qwen3-4B Base vs. training step. The gradient norm is cliped at 1 to avoid the influence of extreme value in smoothing; also, the gradient norm can increase due to longer response, so it’s normalized by response length before smoothing. The increase in gradient norm matches the degradation of performance and the growth of training-inference mismatch.
Finding
If we assume the true signal should not increase during the training process (since data samples become less informative over time), then the increasing L2 norm can only come from increasing noise.
2. The Solution: Learning Rate Scheduler
Key Finding 1: Simply Reducing LR Helps Stabilize Training
Simple theoretical analysis reveals that reducing learning rate can drastically shrinks the effect brought by gradient noise (including bias and variance, detailed in Appendix B). Inspired by this, we reduced our constant learning rate from the default 1e-6 or 1e-7.

Figure 4(a): validation accuracy on aime24 dataset. The green line is lr=1e-6; The blue line is lr=1e-7.

Figure 4(b): validation accuracy on aime25 dataset.

Figure 5 (a): log_ppl_abs_diff of different learning rates (baseline:1e-6; 1e-7). Smaller learning rate significantly reduces training-inference mismatch.

Figure 5 (b): Gradient norm of different learning rates (baseline:1e-6; 1e-7).
This change had a significant impact:
Stability: As seen in Figure 4, lowering the LR significantly stabilizes training.
Mismatch Reduction: As illustrated in Figure 5, our approach (somewhat unexpectedly) alleviates the training-inference mismatch.
It’s worth noting that the blue line isn’t simply 10x slower than the green line.
This suggests that the mismatch is often a U-shaped trajectory—decreasing initially before escalating. It is not merely a static background noise caused by numerical precision, but a dynamic instability coupled with the optimization path.
Possible Explanation.
Insight:
The mismatch is not static background noise, but rather a dynamic instability coupled with the optimization path. This is also mentioned in When Speed Kills Stability Demystifying RL Collapse from the Training Inference Mismatch, Section 3.6.
Key Finding 2: Schedule by Collapse Time, Not by Epoch
Of course, simply lowering the LR globally prevents the model from learning rapidly during the benign early stages. This motivates the use of an LR scheduler.
Why not a Traditional Scheduler?
A predefined stopping point (like standard cosine decay) fails to adaptively decay the LR based on the training situation. It might decay too fast in the benign stage or too slow when the model starts to collapse. Furthermore, as shown in figure 6 and figure 7, our experiments (using fractional datasets) show that the “collapse step” is not proportional to the dataset size, making Epochs an unreliable metric for timing the decay.

Figure 6(a): validation accuracy on aime24 dataset with only a quarter of training dataset. An obvious drop around 500 steps.

Figure 6(b): validation accuracy on aime25 dataset with only a quarter of training dataset. An obvious drop around 600 steps.

Figure 7(a): validation accuracy on aime24 dataset with only 2.5% of training dataset.

Figure 7(b): validation accuracy on aime25 dataset with only 2.5% of training dataset.
Our Approach
We customize a step-decay scheduler: set a hyper-parameter decay_period, and decay LR by half every decay_period until it reaches a minimal floor (e.g., 1/10th of the initial LR). We’ll discuss more about how to choose decay_period in Section 3.

Figure 8(a): validation accuracy on aime24 dataset. The yellow line(applies lr decay) effectively prevents collapse, while maintaining good peak performance.

Figure 8(b): validation accuracy on aime25 dataset.

Figure 8(c): The yellow line(applies lr decay) keep training-inference mismatch at a safe level

Figure 8(c): The yellow line(applies lr decay) significantly reduces spikes in gradient norm.
To further validate our method, we also conducted experiments of multi-turn RL training with Qwen3-8B on DAPO-Math dataset. In the following experiment, the initial learning rate is set as 1e-6; batch size is 128 and equal to ppo_mini_batch_size, i.e completely on policy; decay_period is set as 50.

Figure 9(a): validation accuracy on aime24 dataset. The pink line(applies lr decay) effectively prevents collapse, while maintaining good peak performance.

Figure 9(b): validation accuracy on aime25 dataset.

Figure 9(c): The pink line(applies lr decay) keep training-inference mismatch at a safe level.

Figure 9(d): The pink line(applies lr decay) significantly reduces spikes in gradient norm.
As shown in Figure 8 and Figure 9, experimental evidence suggests this scheduler works well across settings, consistently stabilizing training and keeping the mismatch at a safe level.
Key Finding 3: LR Scheduler Fixes Importance Sampling
Is an LR scheduler necessary if we already use techniques like Importance Sampling (IS)? We conducted ablation experiments on Masked Importance Sampling (MIS) and Truncated Importance Sampling (TIS).
MIS Results (Figure 10): Applying token-level MIS
alone did not solve the mismatch or the collapse; it only slightly postponed the crash. Our customized LR scheduler effectively stabilized training regardless of whether MIS was used.
TIS Results (Figure 11): token-level TIS
extends the stable training window. However, even in this scenario, LR decay further helps to reduce the training-inference mismatch.

Figure 10(a): validation accuracy on aime24 dataset. The purple only applies MIS; Beige line applies customized lr scheduler and MIS simultaneously.

Figure 10(b): validation accuracy on aime25 dataset.

Figure 10(c): The indicator for training-inference mismatch. The purple only applies MIS; Beige line applies customized lr scheduler and MIS simultaneously.

Figure 10(d): The gradient norm. The purple only applies MIS; Beige line applies customized lr scheduler and MIS simultaneously.

Figure 11(a):validation accuracy on aime24 dataset. The green only applies TIS; Beige line applies customized lr scheduler and TIS simultaneously.

Figure 11(b):validation accuracy on aime25 dataset.

Figure 11(c):validation accuracy on simplelr_math_35 dataset. TIS has an obvious drop in the final stage.

Figure 11(d): The indicator for training-inference mismatch. The green only applies TIS; Beige line applies customized lr scheduler and TIS simultaneously.

Figure 11(e): The gradient norm. The green only applies TIS; Beige line applies customized lr scheduler and TIS simultaneously.
3. Response Length Determines the Decay Schedule
To make our LR scheduler practical, we need a reliable, adaptive signal that indicates when optimization instability is approaching. Through our experiments, we identified Average Response Length as the optimal indicator.
The Signal: Response Length Surge
Unlike other training metrics, the average response length exhibits a distinct, reliable pattern: Figure 12 shows a sharp “surge” that consistently precedes validation performance degradation in single-turn RL.

Figure 12: The average response length of rollouts. There’s a clear surge around 100 steps.
We attribute this instability to the increased variance inherent in longer trajectories. As the sequence length grows, the variance of the policy gradient estimate increases. This effectively raises the noise term in our optimization dynamics, necessitating a lower learning rate to maintain stability.
Determining the Decay Period
To operationalize this, we tested different decay_period settings relative to the start of this surge (training Qwen3-4B on dapo_filter). Corresponding validation performance are shown in Figure 13.
Period = 125 (Too Early): Decaying immediately as the surge begins is too conservative; the model stabilizes but achieves lower peak performance.
Period = 204 (Optimal): Decaying shortly after the surge captures the best balance.
Period = 250 (Too Late): Decaying too late allows the accumulated gradient noise to cause irreversible collapse.

Figure 13 (a): validation accuracy on aime24 dataset.

Figure 13 (b): validation accuracy on aime25 dataset.
Recommendation:
We recommend setting the decay_period linearly proportional to the step where the response length surge ends. Empirically, a delay factor of 1.8x the start of the surge provides the optimal balance between maximizing learning in the early phase and dampening variance before collapse.
Insight: The surge in response length serves as a proxy for rising gradient variance, signaling exactly when to start decaying the learning rate.
Conclusion
Training stability remains the “elephant in the room” for LLM Reinforcement Learning. In this post, we demonstrated that the notorious training-inference mismatch is not just an engineering artifact of hybrid engines, but a symptom of deteriorating optimization dynamics characterized by high gradient noise.
Our solution—a customized Step-Decay Learning Rate Scheduler—offers a simple yet effective remedy. By pivoting away from epoch-based schedules and instead using response length surges as a signal for noise accumulation, we can preemptively stabilize training. This approach not only outperforms standalone Importance Sampling but also suggests a broader lesson for Post-Training: as models reason deeper and response lengths grow, our optimization strategies must become more conservative to prevent collapse.