Qwen3.7-Max: The 35-Hour Autonomous Sprint to the Agent Frontier

Imagine a senior systems engineer staring at a blank terminal, tasked with optimizing a kernel for a piece of hardware they have never seen before. There are no manuals, no example code, and no pre-existing profiles. The process usually involves days of grueling trial and error, manual profiling, and sleepless nights spent debugging memory leaks and race conditions. This is the traditional bottleneck of high-performance computing. But this week, that entire workflow was compressed into a 35-hour window of total autonomy. An AI did not just suggest the code; it lived in the terminal, failed, corrected itself, and eventually delivered a 10x performance boost without a single human prompt. This is the operational reality of Qwen3.7-Max.

The Benchmark Shift from Answerer to Executor

Alibaba Cloud has positioned Qwen3.7-Max not as a chatbot, but as an agent-centric engine designed for long-term reasoning persistence. While previous generations of large language models focused on the accuracy of a single response, Qwen3.7-Max is built to maintain a loop of tool calls, evaluations, and self-corrections until a complex goal is achieved. The data suggests this shift is already yielding results that challenge the current industry leaders. In the Terminal Bench 2.0-Terminus, which measures an agent's ability to operate within a real terminal environment, Qwen3.7-Max scored 69.7, surpassing the 67.9 recorded by DS-V4-Pro Max.

In the realm of software engineering, the model demonstrates a level of proficiency that places it in the top tier of available tools. On the SWE-Verified benchmark, it achieved 80.4, putting it on par with the 80.8 of Opus-4.6 Max and the 80.6 of DS-V4-Pro Max. Its practical utility is further evidenced by a score of 60.6 on SWE-Pro and 53.5 on SciCode, a benchmark specifically designed for scientific code generation. These numbers indicate that the model is moving beyond simple code completion and into the territory of autonomous software maintenance.

Reasoning capabilities have seen an even more aggressive climb. Qwen3.7-Max recorded 92.4 on the GPQA Diamond benchmark, which tests PhD-level scientific reasoning, beating the 91.3 of Opus-4.6. It also outperformed Opus-4.6 on the Hard Logic Evaluation (HLE) with a score of 41.4 against 40.0, and on the HMMT 2026 Feb mathematical reasoning benchmark with 97.1 against 96.2. This suggests a fundamental leap in the model's ability to design complex logical structures rather than relying on pattern matching from its training data.

Global accessibility and precision are also core to this release. The model scored 79.1 on IFBench for precise instruction following, beating the 77.0 of DS-V4-Pro. Its translation capabilities were validated by a score of 85.8 on WMT24++, while its accuracy across 23 different English and multilingual prompt settings reached 89.2 on MAXIFE. To bring these capabilities to production, Alibaba Cloud is offering the model via the Alibaba Cloud Model Studio. Developers can access these frontier-grade reasoning and coding capabilities through the compatible mode API base URL at https://dashscope-intl.aliyuncs.com/compatible-mode/v1.

Cross-Harness RL and the End of Benchmark Overfitting

The secret to these gains is not simply a larger parameter count, but a fundamental change in how the model is trained to generalize. Most modern LLMs suffer from a form of academic cheating where they memorize the paths to correct answers in popular benchmarks. To combat this, the Qwen team implemented a Cross-Harness RL (Reinforcement Learning) architecture. They decoupled the learning instance into three orthogonal components: the Task, the Harness (the execution environment), and the Verifier. By randomly combining different harnesses and verifiers during training, the model is forced to learn universal problem-solving strategies rather than environment-specific shortcuts.

This architectural choice was put to a brutal test using the T-Head ZW-M890 PPU (Processing Unit). The model was dropped into a completely alien environment with no hardware documentation, no prior profiling data, and no example kernels. The only inputs were the task description, an existing SGLang implementation, and an evaluation script. Over 35 hours of autonomous execution, Qwen3.7-Max performed 1,158 tool calls and 432 kernel evaluations. It diagnosed its own compilation failures, fixed consistency bugs, and identified bottlenecks through runtime profiling.

The resulting optimization trajectory was a masterclass in systems engineering. The model first implemented Split-KV parallelization, dividing the prefix KV-cache across multiple thread blocks and introducing a reduction kernel based on online softmax rescaling. It then optimized memory management by replacing frequent `cudaMalloc` and `cudaFree` calls with pre-allocated `torch::empty` tensors and applied 2x loop unrolling to strip away overhead. To maximize the SM wave occupancy of the 36-SM architecture, it transitioned from a fixed split divisor to a workload-based heuristic. Finally, it developed a specialized MTP (Multi-Token Prediction) kernel that processes four query tokens per block while sharing K/V loads. The result was a 10.0x speedup based on the Triton geometric mean, dwarfing the 7.3x of GLM 5.1 and the 3.3x of DeepSeek V4 Pro.

From Virtual Kernels to Autonomous Enterprise Operations

The implications of this autonomy extend far beyond the compiler. In the YC-Bench, a simulation of a startup's first year of operation, Qwen3.7-Max generated 2.08 million dollars in total revenue, nearly doubling the 1.05 million dollars achieved by Qwen3.6-Plus. The model did not just write marketing copy; it managed personnel, reviewed legal contracts, and identified malicious customers. By completing 237 distinct tasks while maintaining profit margins despite rising simulated labor costs, the model proved it could handle the messy, non-linear decision-making required to run a business.

This agency is now bleeding into the physical world. Through the Qwen-RobotClaw robotics harness and the Qwen-RobotNav navigation model, the AI is now controlling quadruped robots. By integrating first-person visual trajectory control with long-term memory and planning, the model can manage 20-minute interaction flows. This suggests a future where the same reasoning engine that optimizes a PPU kernel can also optimize a warehouse logistics flow or a manufacturing line in real-time.

To accelerate adoption, Alibaba Cloud has released a suite of integration tools. The Qwen Code tool can be installed globally via the following command:

bash

npm install -g @qwen-code/qwen-code@latest

For teams already embedded in other ecosystems, the model is designed for low-friction migration. To integrate with Claude Code, users can set the environment variable `ANTHROPIC_BASE_URL=https://dashscope-intl.aliyuncs.com/apps/anthropic`. For those using OpenClaw, the model can be specified in the `~/.openclaw/openclaw.json` file as `modelstudio/qwen3.7-max` with the `reasoning true` flag enabled to unlock its full cognitive depth. A critical feature for long-term projects is the `preserve_thinking` function, which ensures that the reasoning chain from previous turns is maintained, preventing the context loss that typically plagues agents during thousand-step tasks.

By combining high-level reasoning with the ability to manipulate low-level hardware and business logic, Qwen3.7-Max is moving the industry toward a new paradigm. We are no longer looking at a tool that helps humans work faster, but at an autonomous operating system capable of managing the entire lifecycle of a technical project.

Qwen3.7-Max: The 35-Hour Autonomous Sprint to the Agent Frontier

The Benchmark Shift from Answerer to Executor

Cross-Harness RL and the End of Benchmark Overfitting

From Virtual Kernels to Autonomous Enterprise Operations

Related Articles