The Friction of AI Agent Reinforcement Learning
For developers attempting to apply reinforcement learning (RL) to AI agents, the process often hits a wall during the integration phase. Traditional RL frameworks require agents to be refactored into rigid API structures, such as the OpenAI Gym standard, necessitating the manual rewriting of `env.init()`, `env.step()`, and `env.reset()` functions. This process is not only labor-intensive but often leads to the loss of critical execution context and tool-use logic inherent to the original agent harness. Developers are frequently forced to choose between maintaining their custom agent architecture or gaining the benefits of RL, creating a significant bottleneck in the model improvement cycle.
Implementing Polar at the API Boundary
NVIDIA has introduced Polar, a rollout framework designed to bypass the need for harness modification entirely. Instead of forcing agents to conform to a specific framework, Polar places a gateway proxy at the model API boundary. This proxy intercepts streaming requests, converts them into non-streaming responses to capture the full token output, and then re-emits them as synthetic streams. By simply updating the model base URL in the agent's configuration to point to the Polar gateway, developers can integrate existing tools—such as Codex CLI or Claude Code—directly into an RL training pipeline.
Polar supports a wide array of harnesses, including `codex`, `claude_code`, `gemini_cli`, `qwen_code`, `opencode`, and `pi`. The system manages the entire lifecycle of an agent session—from runtime initialization and trajectory construction to evaluation and termination—using isolated worker pools powered by Docker or rootless Apptainer. By offloading CPU-intensive tasks like runtime preparation and evaluator pre-warming outside the GPU execution path, Polar minimizes latency and ensures that even if a harness times out, partial trajectories are captured and preserved for training, significantly reducing data loss.
Prefix Merging and Trajectory Reconstruction
To address the fragmentation of multi-turn agent sessions, Polar employs a strategy called prefix merging. Because agent sessions accumulate conversation history sequentially, Polar identifies the token prefix relationships between adjacent completed responses to reconstruct them into a coherent chain. This allows the system to distinguish between sub-agents and context-compression boundaries. Within these merged trajectories, only the sampled assistant tokens are marked for learning, while auxiliary tokens are masked to prevent the model from assigning weight to irrelevant information. This approach ensures that the model learns from the actual action protocols and patch-submission methods used during evaluation, rather than an abstracted approximation.
Performance Gains and Large-Scale Data Generation
In practical testing, the framework has demonstrated clear performance benefits. When training a Qwen3.5-4B model using Group Relative Policy Optimization (GRPO) on the SkyRL-v0-293-data SWE-Gym dataset, the model showed significant improvements in environments with unfamiliar action protocols. Even in environments where the model was already well-aligned, such as the Qwen Code harness, Polar facilitated an additional 0.6-point performance gain by directly connecting reward signals to the actual sampling tokens used by the agent.
Beyond real-time training, Polar serves as a robust tool for large-scale offline data generation. Researchers utilized a Qwen3.5-122B-A10B model across eight H100 GPU servers to process 1,638 instances from the SWE-Bench environment. The process required 64 GPU-hours and yielded a high-quality SFT (Supervised Fine-Tuning) dataset, with sessions averaging 104 messages and 51 assistant turns. By automating the feedback loop between agent execution and data collection, Polar provides a scalable path for refining complex, multi-turn agent capabilities without the engineering overhead of traditional RL integration.
By decoupling the training infrastructure from the agent's internal logic, Polar allows developers to focus on model performance rather than environment compatibility.




