DeepSeek V4 Cuts KV Cache to 2% to Enable 1M Token Agents

Every developer who has attempted to deploy a production-grade AI agent has hit the same wall. The experience is familiar: the agent starts a complex multi-step task with promise, but as the conversation grows, the model begins to hallucinate or lose the original goal. Then comes the technical collapse. The GPU memory spikes as the KV cache expands, the context window overflows, and the model enters a loop of repetitive tool calls that eventually crash the session. For years, the industry has treated this as an inevitable tax of long-context reasoning, assuming that the only way forward was more VRAM or aggressive truncation.

The Architecture of Extreme Efficiency

DeepSeek V4 arrives as a direct response to this memory bottleneck, shipping with a 1 million token context window that prioritizes computational efficiency over raw capacity. The technical leap is most evident in the resource consumption of its variants. DeepSeek V4-Pro reduces single-token inference FLOPs to just 27% of what was required by V3.2, while slashing KV cache usage to 10%. The V4-Flash variant pushes these boundaries further, dropping FLOPs to 10% and KV cache usage to 7%. When measured against the standard Grouped Query Attention (GQA) with 8 heads and bfloat16 storage, the KV cache size of V4 is approximately 2%.

This efficiency is not the result of a single tweak but a fundamental restructuring of the attention mechanism. DeepSeek V4 splits attention into two distinct paths and interleaves them across its 61-layer stack. The first path, Compressed Sparse Attention (CSA), compresses KV items by a factor of 4 along the sequence dimension. To navigate this, a Lightning Indexer operating in FP4 precision selects the top k compressed blocks for each query. The second path, Heavily Compressed Attention (HCA), applies a massive 128x compression to KV items and bypasses sparse selection entirely. To maintain precision where it matters, the model stores most KV items in FP8, reserving BF16 exclusively for the Rotary Positional Embedding (RoPE) dimensions. In the final architecture, layers 0 and 1 utilize HCA, while layers 2 through 60 alternate between CSA and HCA.

Solving the Agentic Loop

Increasing the context window is a quantitative win, but DeepSeek V4 introduces qualitative shifts specifically designed for agentic workflows. The most significant friction point in current agent design is the loss of internal reasoning. In V3.2, the model maintained reasoning traces between tool calls, but these were discarded the moment a new user message arrived. DeepSeek V4 changes this by preserving reasoning content across user message boundaries whenever tool calls are involved, ensuring the agent does not forget its plan mid-execution.

To solve the persistent issue of parsing errors, DeepSeek has moved away from a pure JSON reliance. The model now utilizes the `|DSML|` special token and an XML-based tool calling format. This shift eliminates the common failure point of escape characters within JSON strings. The system now distinguishes between parameter types explicitly: string parameters are passed with `string="true"`, while structured parameters are passed as JSON with `string="false"`. This architectural choice removes the parsing ambiguity that often causes agents to fail during complex API interactions.

Beyond the model weights, the development of the DSec (DeepSeek Elastic Compute) platform provides the infrastructure necessary for these agents to learn. Built in Rust, DSec allows the model to undergo reinforcement learning (RL) within actual tool environments rather than simulated ones. The platform provides a unified Python SDK that abstracts four distinct execution environments: simple function calls, containers, microVMs via Firecracker, and full virtual machines via QEMU. This allows DeepSeek to run hundreds of thousands of concurrent sandboxes in a single cluster, training the model on the actual consequences of its actions in a real OS environment.

While some in the community remain skeptical of whether benchmark gains translate to real-world reliability, the technical shift is undeniable. By reducing the operational cost of 1 million tokens—with V4-Pro requiring 73% fewer operations and 90% less KV cache than V3.2—DeepSeek has moved the conversation from how to manage memory to how to utilize it. The combination of interleaved compressed attention and environment-aware RL marks a transition toward models that do not just process text, but operate software.

This blueprint for compressed attention and specialized agent post-training now sets the standard for the next generation of open-weights models.

DeepSeek V4 Cuts KV Cache to 2% to Enable 1M Token Agents

The Architecture of Extreme Efficiency

Solving the Agentic Loop

Related Articles