DeepSeek-V4 Cuts KV Cache to 10% for 1M Token Context

The current arms race in large language models has shifted from raw parameter counts to the battle for the context window. For developers, the promise of feeding an entire codebase or a library of legal documents into a single prompt is often met with the harsh reality of the memory wall. As the input size grows, the memory requirements for the key-value cache and the computational cost of attention mechanisms scale aggressively, often leading to system crashes or prohibitive API costs. The industry has largely accepted this trade-off: if you want a million-token window, you must pay for it in massive VRAM overhead and slower inference speeds.

The Architecture of DeepSeek-V4

DeepSeek-V4 arrives as a direct challenge to this bottleneck, utilizing a Mixture-of-Experts (MoE) architecture to decouple total model capacity from active computational cost. The series is split into two primary tiers designed for different operational needs. The DeepSeek-V4-Pro model is the heavyweight of the family, boasting a total of 1.6 trillion parameters, yet it only activates 49 billion parameters during any single inference step. In contrast, DeepSeek-V4-Flash is built for speed and agility, featuring 284 billion total parameters with only 13 billion active during inference.

Both models support a massive context length of 1 million tokens. To make this feasible on existing hardware, DeepSeek implemented a sophisticated precision strategy. The models utilize FP8 Mixed precision or a hybrid FP4 + FP8 Mixed approach. Specifically, the expert parameters are stored in 4-bit precision, which drastically reduces the memory footprint without sacrificing the nuanced intelligence required for complex tasks.

The training regime for DeepSeek-V4 was equally rigorous, beginning with a pre-training phase involving over 32 trillion high-quality tokens. This was followed by a two-stage post-training pipeline. The first stage focused on domain-specific expertise through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) based on Group Relative Policy Optimization (GRPO). This allowed the model to cultivate independent experts across various fields. The second stage employed on-policy distillation, a process of knowledge transfer that consolidated these disparate domain abilities into a unified, cohesive model.

Breaking the Memory Bottleneck

While the parameter efficiency is impressive, the true technical breakthrough lies in how DeepSeek-V4 handles attention. The model introduces a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). This is the mechanism that solves the memory crisis associated with long-context windows. When processing 1 million tokens, DeepSeek-V4 reduces the floating-point operations (FLOPs) per single token inference to just 27% of what was required by DeepSeek-V3.2. More critically, the KV cache occupancy—the primary driver of VRAM consumption—is slashed to only 10% of the previous version's requirements.

This efficiency is augmented by Manifold-Constrained Hyper-Connections (mHC), which stabilize signal transmission between layers, preventing the gradient degradation often seen in ultra-deep MoE models. The team also integrated the Muon optimizer to accelerate convergence and enhance training stability. These architectural choices translate directly into benchmark performance. The DeepSeek-V4-Pro-Base model recorded an accuracy of 90.1% on the MMLU benchmark and 73.5% on MMLU-Pro, effectively closing the gap between open-weights models and the most powerful proprietary systems.

From a practical deployment perspective, the model offers varying modes to balance cost and intelligence. The DeepSeek-V4-Pro-Max mode is optimized for high-stakes agentic workflows and complex coding tasks where precision is non-negotiable. Meanwhile, the Flash-Max model demonstrates a fascinating property: when provided with a sufficient thinking budget, its reasoning capabilities approach those of the Pro version. This suggests that the gap between lightweight and heavyweight models is no longer just about parameter count, but about the computational time allocated to the reasoning process.

Developers can access these models via the deepseek-ai repository on Hugging Face. The available versions include DeepSeek-V4-Flash-Base, DeepSeek-V4-Flash, DeepSeek-V4-Pro-Base, and DeepSeek-V4-Pro, allowing users to select the specific precision and scale that fits their hardware constraints.

DeepSeek-V4 establishes a new equilibrium where massive context windows no longer require massive hardware sacrifices.

DeepSeek-V4 Cuts KV Cache to 10% for 1M Token Context

The Architecture of DeepSeek-V4

Breaking the Memory Bottleneck

Related Articles