DeepSeek-V4 Cuts KV Cache to 10% for 1M Token Context

Modern software engineering is increasingly a battle against context limits. Developers frequently find themselves attempting to feed entire repositories, sprawling technical manuals, and thousands of lines of legacy code into a prompt, only to hit the ceiling of a model's memory or suffer from a precipitous drop in reasoning quality. The industry has long accepted a brutal trade-off: as the input length grows, memory consumption spikes exponentially and inference speeds crawl. This bottleneck has turned the quest for a truly usable million-token context window into the primary frontier for large language model efficiency.

The Architecture of DeepSeek-V4

DeepSeek-V4 arrives as a dual-model series designed to break this memory-performance deadlock using a Mixture-of-Experts (MoE) architecture. This design allows the models to maintain massive knowledge bases while only activating a fraction of their parameters for any given token, drastically reducing the computational overhead. The series is split into two distinct tiers to serve different deployment needs. DeepSeek-V4-Pro is the heavyweight, boasting a total of 1.6 trillion parameters, though it only activates 49 billion parameters during inference. For environments where agility is paramount, DeepSeek-V4-Flash offers a leaner profile with 284 billion total parameters and 13 billion active parameters.

Both models support a context window of up to 1 million tokens. To manage the precision required for such scale, DeepSeek employs a sophisticated quantization strategy. The models utilize FP8 mixed precision, and in specific MoE expert parameters, they implement FP4 mixed precision to further compress the model footprint without sacrificing significant accuracy. This hardware-aware approach ensures that the models can actually fit into available VRAM during high-load inference.

The training regime for DeepSeek-V4 was equally rigorous, beginning with a pre-training phase involving over 32 trillion high-quality tokens. The post-training pipeline followed a two-stage paradigm. First, the team used Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) based reinforcement learning to cultivate specialized domain experts. Second, they applied on-policy distillation, where the model's own high-quality outputs were fed back into the training loop to coalesce these experts into a single, unified model. To ensure the training didn't collapse under its own weight, the team integrated the Muon optimizer for faster convergence and a Manifold-Constrained Hyper-Connections (mHC) structure to stabilize signal transmission across the deep layers of the network.

The Hybrid Attention Breakthrough

While the parameter counts are impressive, the real technical shift lies in how DeepSeek-V4 handles the attention mechanism. The core innovation is a Hybrid Attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). This combination solves the primary pain point of long-context LLMs: the Key-Value (KV) cache. In traditional transformers, the KV cache grows linearly with the sequence length, eventually consuming all available GPU memory and slowing inference to a halt.

By implementing this hybrid approach, DeepSeek-V4-Pro achieves a staggering reduction in resource consumption compared to its predecessor, DeepSeek-V3.2. When processing a 1 million token context, the floating-point operations (FLOPs) required for single-token inference dropped to just 27% of the V3.2 levels. More critically, the KV cache usage—the primary memory hog—was slashed to only 10% of what the previous version required. This means that the same hardware can now process ten times the context or handle significantly more concurrent users without a corresponding increase in infrastructure costs.

This efficiency does not come at the cost of intelligence. On the MMLU benchmark, which tests general knowledge and problem-solving, the Pro-Base model scored 90.1%, surpassing the 87.8% achieved by V3.2-Base. The gap is even more pronounced in the more challenging MMLU-Pro set, where Pro-Base reached 73.5% compared to the 65.5% of V3.2-Base. For those requiring maximum reasoning capabilities, the Pro-Max mode has pushed the model into the top tier of coding benchmarks, narrowing the gap between open-weights models and proprietary, closed-source giants. Interestingly, the Flash-Max model demonstrates that when given a sufficient reasoning budget, it can mirror the inference performance of the Pro version, making high-tier reasoning accessible even in resource-constrained environments.

Developers looking to integrate these models into their pipelines can pull them directly via the HuggingFace CLI using the following commands:

bash

huggingface-cli download deepseek-ai/DeepSeek-V4-Pro
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash

DeepSeek-V4 effectively decouples the relationship between context length and memory exhaustion, establishing a new efficiency benchmark for the open-source community.

DeepSeek-V4 Cuts KV Cache to 10% for 1M Token Context

The Architecture of DeepSeek-V4

The Hybrid Attention Breakthrough

Related Articles