Responses API Agent Loops Get 40% Faster With WebSockets

This week, developers building agentic workflows keep repeating the same frustration: “If the model calls a tool, why do we still have to wait so long for the next step?” The complaint lands hardest in code-fixing agents, where Codex-style systems don’t just think; they scan a codebase, read files to build context, apply edits, and then run tests. When that entire loop is implemented as many separate Responses API requests, the round trips become the bottleneck—especially as faster models raise expectations for how quickly the next action should start.

OpenAI’s latest update tries to address that mismatch directly. Instead of treating tool calls as a sequence of independent HTTP exchanges, it rethinks how the agent loop maintains state across turns, and it does so with a WebSockets-based approach that ultimately improves end-to-end speed by 40%.

Section 1: GPT-5.3-Codex-Spark aims for 1,000 TPS, while GPT-5 ran near 65 TPS

OpenAI says the Responses API previously relied on flagship models such as GPT-5 and GPT-5.2 operating at roughly 65 tokens per second (TPS). That number matters because it sets the ceiling for how quickly the model can generate output once the request reaches the inference stage.

For GPT-5.3-Codex-Spark, OpenAI’s “fast coding” model, the goal is not a small bump but a step change: it targets a 10x-level improvement, aiming to exceed 1,000 TPS. The company attributes the jump to Cerebras hardware optimized for LLM inference, positioning the model as a system designed to move quickly once it has the right context.

But OpenAI also acknowledges a second, less glamorous constraint: API overhead. In late 2025—around November—it began a performance sprint focused on the Responses API’s critical path latency, meaning the time from request arrival to the earliest point where useful output can be produced.

As part of that sprint, OpenAI says it implemented multiple optimizations to improve time to first token (TTFT), the metric that often determines how “snappy” an API feels to users. It reports an approximately 45% improvement in TTFT.

Even with those gains, OpenAI says GPT-5.3-Codex-Spark still runs into a structural issue: the Responses API overhead remains significant enough that users may still end up waiting on CPU-side API handling rather than benefiting fully from the faster GPU inference. In other words, the model can be fast, but the agent loop can still stall if the system keeps paying the cost of rebuilding state and reprocessing context across many requests.

The tension here is straightforward: faster inference makes the overhead more visible, and the agent experience becomes dominated by how the API orchestrates tool calls and follow-up reasoning.

Section 2: The real change is state reuse across a connection lifetime, not just faster inference

So what actually changes in OpenAI’s approach? The key shift is how the system treats repeated agent steps. Previously, OpenAI describes a model where each Codex request is handled as an independent unit. That means follow-up requests repeatedly reprocess conversation state and any context that could be reused.

As conversations get longer, the waste grows. Developers effectively pay for “most of the same history” again and again, even when only a small portion of the conversation changes between steps. OpenAI frames this as a structural problem: the system keeps incurring work tied to the full history, even though the incremental value of each new request is smaller than the cost of rebuilding everything.

To reduce that repeated cost, OpenAI revisits the transport and protocol strategy. Instead of sending the entire conversation history every time over HTTP—creating a fresh connection and re-sending context—it explores whether it can maintain a “persistent connection + state cache.” The goal is to send only the new information required for the next step, while caching reusable state in memory for the duration of the connection.

OpenAI says it evaluated multiple options, including WebSockets and gRPC bidirectional streaming. The final choice is WebSockets, largely because it doesn’t require changing the input/output shape of the Responses API. In OpenAI’s telling, WebSockets provides a developer-friendly messaging model that can slot into existing architectures without forcing a major redesign of the API contract.

However, the first WebSocket prototype did more than shave latency—it changed the way the agent loop is modeled, and that created a new kind of complexity. OpenAI reports that the prototype reduced latency dramatically, but it also made the API “less familiar” and more complex for developers.

In the prototype, OpenAI models an agent rollout as a single long-running Response. Tool calling is handled inside an asyncio-based sampling loop: the system selects whether a tool call should happen, then it asynchronously blocks while the tool runs, and finally it sends a `response.done` event to the client. The client executes the tool and returns the result via a `response.append` event, which unblocks the sampling loop so the model can continue.

OpenAI describes this as treating local tool calls similarly to hosted tool calls. In the hosted case, the inference loop pauses when the model requests a web search, the web search service returns results, and the inference continues with the new context. In the prototype, the remote service call is replaced by a WebSocket message to the client, and the client’s tool output is fed back into the context over the same connection.

The causal mechanism is important: by removing repeated API work during the tool execution window, OpenAI says the system can avoid doing certain preinference and postinference steps multiple times. Instead, it aims to run preinference once, pause during the tool execution, and then finish postinference at the end—reducing the number of times the system pays overhead for each tool boundary.

But there’s a practical twist. OpenAI says it couldn’t ship the prototype as-is because developers would have to adopt a new interaction mode, and that would require reworking how existing integrations structure agent loops.

So the released version returns to the familiar shape: it keeps `response.create` with the same request body, but it changes how state is carried forward. Rather than relying on the long-running Response event model, the system uses `previous_response_id` so follow-up calls can continue from earlier state.

This is where the WebSockets design shows its value without forcing a new developer mental model. In the WebSocket connection, the server maintains an in-memory cache scoped to the connection lifetime. When a follow-up `response.create` includes `previous_response_id`, OpenAI says the server retrieves the prior response state directly from the cache instead of reconstructing the entire conversation from scratch.

OpenAI emphasizes that “reusing previous response state from memory” is the core optimization rationale. The point isn’t that WebSockets magically make inference faster; it’s that the system stops paying the same orchestration and context-rebuild costs repeatedly across tool calls.

Community reaction reflects the tradeoff. Some developers ask why the prototype’s done/append event approach wasn’t kept, since it seems like it could be faster. Others argue that the real-world win comes from reducing overhead without forcing a breaking change to the API integration pattern.

The insight that emerges is that agent latency isn’t only a model-performance problem. It’s also a protocol and orchestration problem: the cost of splitting an agent loop into multiple requests can dominate end-to-end time, even when the model itself runs at high TPS.

Section 3: Connection-lifetime caching cuts TTFT and end-to-end agent waiting by 40%

OpenAI frames the overall objective as a balancing act. It wants to get as close as possible to the low-overhead behavior of the prototype while preserving the API shape developers already understand and have already built around.

The result, according to OpenAI, is an end-to-end improvement for Responses API agent loops: they run 40% faster. OpenAI also ties the experience back to inference speed expectations, describing a shift from a 65 TPS feel toward a flow that approaches the 1,000 TPS target behavior.

The resolution of the story is not that the model suddenly becomes “instant.” Instead, OpenAI’s update points to a more actionable diagnosis: when agents feel slow, the slowdown often originates not in the model’s reasoning speed but in the way the system breaks the workflow into repeated request/response cycles.

By caching state across a connection lifetime and reusing previous response state via `previous_response_id`, the system reduces the repeated orchestration work that accumulates during tool-driven agent loops.

This is the direction agent builders need to watch next: as inference accelerates, the protocol layer becomes the bottleneck, and WebSockets-backed state reuse is one of the clearest ways to keep agent loops responsive.

Responses API Agent Loops Get 40% Faster With WebSockets

Section 1: GPT-5.3-Codex-Spark aims for 1,000 TPS, while GPT-5 ran near 65 TPS

Section 2: The real change is state reuse across a connection lifetime, not just faster inference

Section 3: Connection-lifetime caching cuts TTFT and end-to-end agent waiting by 40%

Related Articles