Step 3.7 Flash Hits 400 Tokens Per Second to Solve Agent Latency

The current state of AI agent deployment is defined by a frustrating paradox. We have models capable of reasoning through complex financial audits or cross-referencing a dozen disparate data sources, yet the actual user experience is often a stagnant screen. Developers have built sophisticated loops where an agent thinks, searches, and acts, but the latency between those steps creates a cognitive gap that kills productivity. In high-frequency production environments, a five-second delay in a tool-call isn't just a nuisance; it is a systemic failure that prevents AI agents from feeling like integrated software and keeps them feeling like slow, external consultants.

The Architecture of Instantaneous Execution

StepFun has addressed this latency wall with the release of Step 3.7 Flash, a model specifically engineered to prioritize execution speed without sacrificing the intelligence of a massive parameter count. The core of this efficiency is a sparse Mixture of Experts (MoE) architecture. While the model boasts a total of 198 billion parameters, it does not engage the entire network for every token. Instead, it activates only approximately 11 billion parameters per token during the inference process. This surgical approach to computation allows the model to maintain the broad knowledge base of a large-scale model while slashing the operational overhead that typically slows down inference.

The result is a throughput of up to 400 tokens per second. To achieve this, StepFun combined a 196 billion parameter language backbone with a dedicated 1.8 billion parameter vision encoder. This native multimodal integration means the model does not rely on a separate, slower translation layer to understand images; it processes visual data as a primary input. This is paired with a 256k context window, providing enough headroom to ingest dozens of technical manuals or massive system logs in a single pass without losing the thread of the conversation.

For developers, the model introduces a granular control mechanism through three distinct reasoning levels: low, medium, and high. This allows a system architect to assign the low-level setting for simple data extraction tasks to maximize speed, while reserving the high-level setting for complex logical deductions. This flexibility transforms the model from a static tool into a scalable resource that can be tuned based on the specific cognitive depth required for a given task.

This architectural efficiency translates directly into a competitive pricing structure. The cost for input tokens is set at $0.20 per million tokens for cache misses and drops to $0.04 per million tokens for cache hits. Output tokens are priced at $1.15 per million. The five-fold reduction in cost during cache hits is particularly critical for agentic workflows, where the same system prompts and context are often reused across multiple iterative loops. By solving the dual constraints of latency and token cost, StepFun is removing the physical barriers that previously forced companies to limit their agent's capabilities to save on budget or time.

From Reasoning to Reliable Action

Speed is irrelevant if the agent cannot accurately perceive its environment. The true shift in Step 3.7 Flash is not just how fast it talks, but how accurately it sees and acts. In the SimpleVQA search category, the model secured the top spot with a score of 79.2. Furthermore, in V*, a benchmark that evaluates visual reasoning through Python code execution, it scored 95.3. This indicates a specialized ability to transform dense visual information—such as UI wireframes, complex GUI layouts, and data charts—into structured, executable code. Unlike models that simply describe an image, Step 3.7 Flash can identify missing data in a visual asset, trigger an external query to verify the context, and then finalize a conclusion.

Reliability in multi-step orchestration is where most agents fail, often falling into recursive loops or ignoring system constraints. Step 3.7 Flash demonstrated significant resilience in the ClawEval-1.1 benchmark, scoring 67.1, which comfortably leads the second-place group's score of 59.8. This gap suggests a superior ability to follow strict system policies and maintain trajectory integrity during long-term tasks. This is further supported by scores of 49.5 in Toolathlon and 48.1 in HLE w. Tool, proving that the model can interact with various Application Programming Interfaces (APIs) without deviating from the original instructions.

The model's capacity for autonomous software engineering is perhaps its most potent application. In the SWE-Bench PRO benchmark, which simulates real-world development complexity, Step 3.7 Flash ranked second overall with a score of 56.3. It does not merely suggest code snippets; it can independently track repositories spanning multiple files, isolate bugs from issue reports, and generate functional patches that pass automated unit tests. With a score of 59.5 in Terminal-Bench 2.1 and 45.8 in GDPVal-AA, the model proves it can operate stably within a terminal environment, modifying files and correcting errors in a closed loop.

To facilitate immediate adoption, StepFun has made the model available via the StepFun Open Platform, OpenRouter, and NVIDIA NIM. For organizations with strict data privacy requirements, the model supports local deployment. It is optimized for NVIDIA DGX Station and systems powered by the AMD Ryzen AI Max+ 395. Additionally, it can run on Mac Studio and Macbook Pro hardware equipped with 128GB of RAM or more. This local capability allows enterprises to deploy software engineering agents that can modify proprietary internal codebases without ever sending sensitive data to an external cloud server.

The bottleneck for AI agents has shifted. It is no longer a question of whether a model can reason through a problem, but whether it can execute the solution fast enough to be useful in a live production environment. By optimizing the relationship between parameter activation and throughput, Step 3.7 Flash moves the industry closer to agents that act as instantaneous extensions of the user's intent.

Step 3.7 Flash Hits 400 Tokens Per Second to Solve Agent Latency

The Architecture of Instantaneous Execution

From Reasoning to Reliable Action

Related Articles