GPT-5.5 Hits 82.7% on Terminal-Bench, Redefining Agent Autonomy

This week, the developer community is buzzing about a model that doesn't need a human holding its hand. OpenAI dropped GPT-5.5 simultaneously to ChatGPT and Codex for Plus, Pro, Business, and Enterprise subscribers, and the early verdict is clear: this is the first model that can take a goal and run with it. On Terminal-Bench 2.0, a benchmark that measures planning, iteration, and tool coordination in terminal environments, GPT-5.5 scored 82.7% — a lead of more than 13 percentage points over Claude Opus 4.7 (69.4%) and Gemini 3.1 Pro (68.5%). ML engineers and data scientists are calling it a genuine leap forward.

GPT-5.5 Dominates Four Benchmarks, Terminal-Bench 82.7%

OpenAI published a full set of benchmark results that paint a picture of broad capability. On Terminal-Bench 2.0, which evaluates a model's ability to plan, iterate, and adjust tools in a terminal environment, GPT-5.5 scored 82.7%. That's 13.3 points ahead of Claude Opus 4.7 (69.4%) and 14.2 points ahead of Gemini 3.1 Pro (68.5%). On SWE-Bench Pro, which tests real-world GitHub issue resolution, GPT-5.5 scored 58.6%, trailing Claude Opus 4.7's 64.3%. OpenAI noted that Anthropic reported signs of memorization on some SWE-Bench problems, suggesting the gap may be narrower than it appears. On an internal Expert-SWE benchmark measuring long-duration coding tasks with a median completion time of 20 hours, GPT-5.5 outperformed GPT-5.4. Additional scores include OSWorld-Verified (real computer environment manipulation) at 78.7%, GDPval (knowledge tasks across 44 occupations) at 84.9%, and a Pro variant of BrowseComp (web information tracking) at 90.1%.

Pricing for the standard GPT-5.5 API is $5 per million input tokens and $30 per million output tokens — double the rates of GPT-5.4. The Pro variant costs $30 per million input tokens and $180 per million output tokens. These are production prices, not research preview rates, and the model is immediately available to a large user base.

The Old Way: Step-by-Step Instructions. The New Way: Just the Goal

Previous language models responded to single prompts. Complex tasks required a human to interrupt, re-prompt, or correct direction mid-stream. GPT-5.5 is designed as an agent model: it uses tools — web search, code writing, script execution, software manipulation — on its own, verifies its own work, and does not stop until the goal is achieved. OpenAI describes the difference as "the assistant who needs a checklist versus the assistant who understands the goal and figures out the steps."

Early testers report that GPT-5.5 demonstrates a deeper understanding of software systems. One developer noted, "It understands the 'shape' of a software system better — why it failed, where to fix it, and what impact the fix will have on other parts of the codebase." Codex usage has surged to approximately 4 million developers per week. This is not a research preview; it is a production model deployed immediately to a massive user base.

The Real Cost Story: Token Efficiency Changes the Math

OpenAI states that GPT-5.5 maintains the same latency as GPT-5.4 while using significantly fewer tokens to complete the same Codex task. The per-token price has doubled, but the total tokens consumed per task have dropped. For example, if a task that required 1,000 tokens with GPT-5.4 now requires 400 tokens with GPT-5.5, the total cost drops from $2.50 to $2.00 — a 20% reduction. For large-scale Codex operations, this efficiency gain is critical. The analysis is clear: look at the actual spend per completed task, not the price list.

GPT-5.5 has redrawn the baseline for agent models. The competition is no longer about who is smarter. It is about who can finish more work with less intervention.

GPT-5.5 Hits 82.7% on Terminal-Bench, Redefining Agent Autonomy

GPT-5.5 Dominates Four Benchmarks, Terminal-Bench 82.7%

The Old Way: Step-by-Step Instructions. The New Way: Just the Goal

The Real Cost Story: Token Efficiency Changes the Math

Related Articles