GPT-5.5: The tool-using model that keeps GPT-5.4 speed

GPT-5.5 shows up the moment your team tries to automate a messy, multi-step workflow, because it doesn’t just answer questions anymore—it plans, uses tools, checks itself, and keeps going across apps until the task is done.

Section 1

On April 23, 2026, OpenAI announced the release of GPT‑5.5, positioning it as “our smartest and most intuitive to use model yet” and the next step toward a new way of getting work done on a computer. The company frames GPT‑5.5 as a model that understands what a user is trying to do faster, then carries more of the work itself. In OpenAI’s description, GPT‑5.5 excels at writing and debugging code, researching online, analyzing data, creating documents and spreadsheets, operating software, and moving across tools until a task is finished.

The key product claim is that GPT‑5.5 can handle messy, multi-part instructions without requiring careful step-by-step management. Instead of users micromanaging every action, OpenAI says they can give GPT‑5.5 a “messy, multi-part task” and trust it to plan, use tools, check its work, navigate ambiguity, and continue until completion. OpenAI also says the strongest gains show up in “agentic coding, computer use, knowledge work, and early scientific research,” where progress depends on reasoning across context and taking action over time.

OpenAI adds that GPT‑5.5 delivers this jump in intelligence without sacrificing speed. The company says larger, more capable models are often slower to serve, but GPT‑5.5 “matches GPT‑5.4 per-token latency in real-world serving,” while performing at a “much higher level of intelligence.” It also claims GPT‑5.5 uses significantly fewer tokens to complete the same Codex tasks, making it both more efficient and more capable.

Safety and rollout details come next. OpenAI says it is releasing GPT‑5.5 with “our strongest set of safeguards to date,” designed to reduce misuse while preserving access for beneficial work. The company says it evaluated GPT‑5.5 across its full suite of safety and preparedness frameworks, worked with internal and external redteamers, added targeted testing for advanced cybersecurity and biology capabilities, and collected feedback on real use cases from nearly 200 trusted early-access partners before release.

In terms of availability, OpenAI says GPT‑5.5 is rolling out today to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex. It also says GPT‑5.5 Pro is rolling out to Pro, Business, and Enterprise users in ChatGPT. The company notes that API deployments require different safeguards and that it is working with partners and customers on safety and security requirements for serving GPT‑5.5 at scale. OpenAI then states: “We’ll bring GPT‑5.5 and GPT‑5.5 Pro to the API very soon.”

The announcement includes a benchmark table comparing GPT‑5.5, GPT‑5.4, GPT‑5.5 Pro, GPT‑5.4 Pro, and several other systems. The table lists results for Terminal-Bench 2.0, Expert-SWE (Internal), GDPval (wins or ties), OSWorld-Verified, Toolathlon, BrowseComp, FrontierMath Tier 1–3, FrontierMath Tier 4, and CyberGym. The numbers shown are:

Terminal-Bench 2.0: GPT‑5.5 82.7%, GPT‑5.4 75.1%, GPT‑5.5 Pro -, GPT‑5.4 Pro -, Claude Opus 4.7 69.4%, Gemini 3.1 Pro 68.5%, Terminal-Bench 2.0 (other column) -

Expert-SWE (Internal): GPT‑5.5 73.1%, GPT‑5.4 68.5%, GPT‑5.5 Pro -, GPT‑5.4 Pro -, GDPval (wins or ties): GPT‑5.5 84.9%, GPT‑5.4 83.0%, GPT‑5.5 Pro 82.3%, GPT‑5.4 Pro 82.0%, Claude Opus 4.7 80.3%, Gemini 3.1 Pro 67.3%

OSWorld-Verified: GPT‑5.5 78.7%, GPT‑5.4 75.0%, GPT‑5.5 Pro -, GPT‑5.4 Pro -, Claude Opus 4.7 78.0%, Toolathlon: GPT‑5.5 55.6%, GPT‑5.4 54.6%, GPT‑5.5 Pro -, GPT‑5.4 Pro -, Claude Opus 4.7 -, Gemini 3.1 Pro 48.8%

BrowseComp: GPT‑5.5 84.4%, GPT‑5.4 82.7%, GPT‑5.5 Pro 90.1%, GPT‑5.4 Pro 89.3%, Claude Opus 4.7 79.3%, Gemini 3.1 Pro 85.9%

FrontierMath Tier 1–3: GPT‑5.5 51.7%, GPT‑5.4 47.6%, GPT‑5.5 Pro 52.4%, GPT‑5.4 Pro 50.0%, Claude Opus 4.7 43.8%, Gemini 3.1 Pro 36.9%

FrontierMath Tier 4: GPT‑5.5 35.4%, GPT‑5.4 27.1%, GPT‑5.5 Pro 39.6%, GPT‑5.4 Pro 38.0%, Claude Opus 4.7 22.9%, Gemini 3.1 Pro 16.7%

CyberGym: GPT‑5.5 81.8%, GPT‑5.4 79.0%, GPT‑5.5 Pro -, GPT‑5.4 Pro -, Claude Opus 4.7 73.1%, Gemini 3.1 Pro -

OpenAI then shifts from rollout to capability framing. It says it is building “the global infrastructure for agentic AI,” and that over the past year AI has accelerated software engineering. With GPT‑5.5 in Codex and ChatGPT, OpenAI says that transformation is extending into scientific research and broader work people do on computers.

The company claims GPT‑5.5 is not just more intelligent but more efficient in how it works through problems. It says GPT‑5.5 often reaches higher-quality outputs with fewer tokens and fewer retries.

OpenAI also cites a specific claim about cost and performance: “On Artificial Analysis’s Coding Index, GPT‑5.5 delivers state-of-the-art intelligence at half the cost of competitive frontier coding models.” It describes the Artificial Analysis Intelligence Index as a weighted average of 10 evals run by an external party: AA-LCR, AA-Omniscience, CritPt, GDPval-AA, GPQA Diamond, Humanity’s Last Exam, IFBench, SciCode, Terminal-Bench Hard, τ²-Bench Telecom.

Finally, OpenAI calls out agentic coding as GPT‑5.5’s strongest area. It says GPT‑5.5 is its strongest agentic coding model to date. On Terminal-Bench 2.0, which tests complex command-line workflows requiring planning, iteration, and tool coordination, OpenAI says GPT‑5.5 achieves state-of-the-art accuracy of 82.7%. On SWE-Bench Pro, which evaluates real-world GitHub issue resolution, OpenAI says it reaches 58.6%, solving more tasks end-to-end i

One-sentence conclusion: OpenAI’s GPT‑5.5 launch ties a new agentic workflow style to concrete rollout plans and a benchmark set that emphasizes tool coordination and end-to-end completion.

Section 2

So what is actually different about GPT‑5.5, beyond the usual “smarter model” phrasing? The twist is that OpenAI is selling a shift in how work gets executed, not just how well the model answers.

First, the company explicitly changes the user contract. Earlier generations often rewarded careful prompting and step-by-step guidance; GPT‑5.5 is presented as tolerant of messy, multi-part instructions where the model must decide what to do next, use tools, check its work, and keep moving through ambiguity. That is a different kind of capability: it’s less about producing a single correct response and more about sustaining a task across time and interfaces.

Second, OpenAI tries to remove a common tradeoff that teams feel in production. The announcement acknowledges a pattern: bigger models are often slower to serve. GPT‑5.5 is positioned as an exception to that pattern by claiming it matches GPT‑5.4 per-token latency in real-world serving while delivering higher intelligence. That matters because agentic systems often require multiple internal steps; if latency balloons, the whole “let the model do the work” approach becomes impractical.

Third, OpenAI’s efficiency claim is not generic. It says GPT‑5.5 uses significantly fewer tokens to complete the same Codex tasks. In practice, that’s the difference between an agent that can run repeatedly inside a workflow and one that burns budget or hits rate limits. The benchmark table reinforces this emphasis on end-to-end behavior: Terminal-Bench 2.0 jumps to 82.7% for GPT‑5.5 versus 75.1% for GPT‑5.4, and BrowseComp shows GPT‑5.5 Pro at 90.1% and GPT‑5.4 Pro at 89.3%, suggesting that the “agentic” improvements are not confined to one narrow coding metric.

Fourth, the safety and rollout structure hints at how OpenAI expects GPT‑5.5 to be used. The company says it added targeted testing for advanced cybersecurity and biology capabilities and collected feedback from nearly 200 trusted early-access partners. It also distinguishes between ChatGPT/Codex availability and API availability, stating that API deployments require different safeguards and that it will bring GPT‑5.5 and GPT‑5.5 Pro to the API “very soon.” That implies OpenAI is treating agentic capability as something that changes risk profiles, not just performance.

When you connect those points, the “so what” becomes clearer: GPT‑5.5 is designed to make tool-using agents feel like a normal part of work rather than a fragile experiment. The model’s claimed speed parity with GPT‑5.4, its token-efficiency for Codex tasks, and its benchmark emphasis on planning and tool coordination all point to a single goal—reducing the operational friction that usually blocks agentic automation.

One-sentence conclusion: GPT‑5.5’s real shift is that it’s engineered to execute multi-step tasks with tool coordination at production-friendly latency and cost.

Section 3

OpenAI’s own capability framing doubles down on where it expects GPT‑5.5 to matter most: agentic coding, computer use, knowledge work, and early scientific research. The company describes GPT‑5.5 as understanding user intent faster and then carrying more of the work itself, including researching online, analyzing data, and producing documents and spreadsheets.

The announcement also situates GPT‑5.5 inside a broader platform narrative. OpenAI says it is building “the global infrastructure for agentic AI,” and it points to the past year’s acceleration of software engineering. With GPT‑5.5 in Codex and ChatGPT, OpenAI argues that the same transformation is beginning to extend into scientific research and broader computer-based work.

In that context, the benchmark table is not just a scoreboard; it is meant to validate the agentic story across domains. Terminal-Bench 2.0 is explicitly described as testing complex command-line workflows that require planning, iteration, and tool coordination, and GPT‑5.5’s 82.7% is presented as state-of-the-art accuracy. OSWorld-Verified at 78.7% for GPT‑5.5 versus 75.0% for GPT‑5.4 is offered as evidence that the model can handle verified “computer use” tasks. FrontierMath Tier 1–3 and Tier 4 results are included to show reasoning performance across difficulty bands, while CyberGym at 81.8% suggests capability in cybersecurity-oriented environments.

OpenAI also brings in an external evaluation framing through Artificial Analysis. It claims GPT‑5.5 delivers state-of-the-art intelligence at half the cost of competitive frontier coding models on Artificial Analysis’s Coding Index. It then defines the Artificial Analysis Intelligence Index as a weighted average of 10 evals run by an external party: AA-LCR, AA-Omniscience, CritPt, GDPval-AA, GPQA Diamond, Humanity’s Last Exam, IFBench, SciCode, Terminal-Bench Hard, τ²-Bench Telecom.

Even the rollout language reinforces the “real work” positioning. GPT‑5.5 is rolling out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex, while GPT‑5.5 Pro is rolling out to Pro, Business, and Enterprise users in ChatGPT. That tiering suggests OpenAI expects different usage patterns and risk tolerances across user groups, especially as agentic behavior increases the chance that models will interact with external systems.

One-sentence conclusion: OpenAI is trying to prove GPT‑5.5 can handle the full arc of knowledge work, from planning and tool use to verified execution.

Section 4

The final piece is how OpenAI connects GPT‑5.5’s capabilities to the next step in deployment: the API. The company says API deployments require different safeguards and that it is working with partners and customers on safety and security requirements for serving GPT‑5.5 at scale, then promises that GPT‑5.5 and GPT‑5.5 Pro will arrive on the API “very soon.”

That matters because agentic systems live or die by integration. A model that can plan and use tools is only as useful as the surrounding orchestration—auth, permissions, logging, and safety controls. By emphasizing safeguards and staged rollout, OpenAI signals that it expects GPT‑5.5 to become a building block for production agents, not just a chat experience.

One-sentence conclusion: GPT‑5.5 is positioned as the model layer that makes agentic automation practical enough to ship, not just to demo.

GPT-5.5: The tool-using model that keeps GPT-5.4 speed

Section 1

Section 2

Section 3

Section 4

Related Articles