Reinforcement learning researchers often wake up to a nightmare where their training curves have suddenly diverged. One day the model is converging beautifully, and the next, the log probabilities are spiking or drifting in a direction that defies the reward function. For many in the community, this frustration recently peaked during the transition of vLLM, the industry-standard engine for high-throughput LLM inference, from its V0 architecture to the revamped V1. What initially looked like a failure of the learning algorithm was actually a fundamental shift in how the underlying engine handled numerical precision and scheduling.

The V1 Migration and the Numerical Gap

The technical friction began when research teams attempted to migrate their reinforcement learning environments from vLLM version 0.8.5 (V0) to version 0.18.1 (V1). The goal was to leverage the performance gains of the new architecture while maintaining the stability of their Grouped Strategic Policy Optimization (GSPO) pipelines. However, the results were immediate and alarming. When plotting the training trajectories, the initial V1 attempts produced a red graph that deviated sharply from the established green reference line of V0. The divergence was most prominent in the trainer's log probabilities—the logarithmic values of the probability that a model selects a specific token—and the corresponding reward values, both of which drifted away from the V0 baseline almost immediately after training commenced.

To bridge this gap, the team had to dissect the V1 engine to identify where the numerical drift was originating. They discovered that the issue was not a single bug but a combination of four distinct architectural shifts. First, the handling of log probabilities had changed. While V1 retrieves log probabilities from the raw model output by default, the PipelineRL system—the orchestration tool connecting data generation to model updates—expected probability values that had been filtered during the sampling process. This required a specific configuration change to ensure the engine provided the expected processed values.

python

V1에서 처리된 로그 확률을 사용하도록 설정

processed_logprobs = True

Second, the team had to contend with V1's new runtime defaults. To maximize speed, V1 enables prefix caching and asynchronous scheduling by default. Prefix caching allows the engine to reuse previously computed tokens to accelerate inference, but in the sensitive context of reinforcement learning, this can introduce subtle inconsistencies. To match the V0 environment, the team disabled caching and synchronized the weight update timing. Third, they performed a precision audit of the weight update paths to ensure that gradients were being applied identically to the previous version. Finally, they addressed the projection layer. By forcing the lm_head—the final output layer of the language model—to use fp32 (32-bit floating point) precision, they eliminated the rounding errors that were plagueing the output.

The Hidden Cost of Inference Imprecision

For years, the prevailing wisdom in reinforcement learning was that performance gains came from tuning the reward function or refining the mathematical formulation of the optimization algorithm. This transition to vLLM V1 reveals a different reality: the inference engine is not a passive pipe but a critical component of the mathematical chain. When a model's final output layer operates at lower precision, it introduces microscopic errors in token probability calculations. In standard chat applications, these errors are invisible. In reinforcement learning, however, they are catastrophic.

These tiny discrepancies accumulate within the policy ratio, which measures the difference in probability between the current model and the previous version, and the KL divergence, which tracks how much the probability distribution has shifted. As noted in the MiniMax-M1 technical report, failing to use fp32 precision in the final projection layer creates a ripple effect that distorts the entire learning process. By forcing the V1 engine to adhere to the fp32 path, the research team found that the reward graphs aligned perfectly with the V0 reference values, proving that the perceived algorithmic failure was actually an engine precision issue.

This shift changes the fundamental order of operations for AI developers. The instinct is often to add compensation values or importance sampling to the algorithm to fix a drifting model. But doing so is a mistake; it attempts to solve a hardware or engine-level error with a software-level patch, effectively blaming the algorithm for the engine's instability. The realization here is that numerical consistency in the inference engine must be established before a single hyperparameter is tuned. Attempting to optimize a reinforcement learning agent on an inconsistent engine is equivalent to pouring water into a leaking bucket.

The path to stable reinforcement learning begins not with the reward function, but with the numerical integrity of the engine that serves it.