DeepSeek-V4 Cuts 1M-Token Inference Cost by 90%

This week, four DeepSeek-V4 checkpoints appeared on Hugging Face's trending page simultaneously. The largest variant packs 1.6 trillion parameters, but developers immediately questioned whether a 1-million-token context could actually run in practice. In standard transformers, doubling context length quadruples computation — a scaling law that has kept long-context models mostly theoretical.

1.6T Parameters, 1M-Token Context

DeepSeek-AI released preview versions of the DeepSeek-V4 series, both built on a Mixture-of-Experts (MoE) architecture. DeepSeek-V4-Pro totals 1.6T parameters with 49B activated per token. DeepSeek-V4-Flash comes in at 284B total parameters with 13B activated per token. Both models support a default context length of 1 million tokens. V4-Pro was pre-trained on 33T tokens, V4-Flash on 32T tokens. Checkpoints for all four variants — DeepSeek-V4-Pro, DeepSeek-V4-Pro-Base, DeepSeek-V4-Flash, and DeepSeek-V4-Flash-Base — are available on Hugging Face.

Four innovations drive the efficiency gains: a hybrid attention architecture, a new residual connection design, a novel optimizer, and FP4 quantization-aware training. At 1M-token context, V4-Pro uses only 27% of the single-token inference FLOPs. KV cache size drops to 10% of V3.2's footprint. V4-Flash needs just 10% of single-token FLOPs and 7% of the KV cache.

Compressed Attention and a New Optimizer

Previously, attention computation scaled quadratically with context length, making 1M tokens practically impossible. DeepSeek-V4 alternates between Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) across transformer layers. CSA compresses the KV cache of m tokens into a single entry, then each query token attends only to the top-k compressed KV items. A component called Lightning Indexer scores queries against compressed KV blocks to handle sparse selection. HCA goes further: it compresses KV entries from m′ tokens (where m′ ≫ m) into one and applies dense attention. No sparse selection step is needed — the compression ratio itself reduces KV cache size.

The residual connection has also been redesigned. Manifold-Constrained Hyper-Connections (mHC) replace standard residual connections. Hyper-connections expand the residual stream width by a factor of n_hc (set to 4 for both models) and introduce learnable input, residual, and output mapping matrices. mHC constrains the residual mapping matrix to the Birkhoff polytope — the manifold of doubly stochastic matrices where all rows and columns sum to 1 and all entries are non-negative — binding the spectral norm to 1. The constraint is enforced via the Sinkhorn-Knopp algorithm with t_max=20 iterations. Mapping parameters are generated dynamically per input.

The optimizer switches to Muon, which orthogonalizes gradient update matrices via Newton-Schulz iterations before applying weight updates. A hybrid two-stage schedule is used: 8 iterations for fast convergence (coefficients 3.4445, −4.7750, 2.0315), then 2 iterations for stabilization (coefficients 2, −1.5, 0.5). Embedding modules, prediction heads, static biases, mHC gating factors, and all RMSNorm weights retain AdamW.

Inference Cost and Performance Benchmarks

The most immediate change for developers is inference cost. FP4 (MXFP4) quantization-aware training (QAT) is applied to MoE expert weights and the CSA Lightning Indexer query-key path. During inference and reinforcement learning rollouts, actual FP4 weights are used directly, reducing memory traffic and sampling latency.

Two techniques ensure training stability. Anticipatory Routing decouples backbone and routing network updates: routing indices at step t are computed using past parameters θ_{t−Δt}, breaking the cycle where routing decisions amplify outliers in MoE layers. SwiGLU Clamping constrains the linear component of SwiGLU to [−10, 10] and caps the gate component at 10, directly suppressing abnormal activations.

The post-training pipeline replaces V3.2's mixed reinforcement learning stage with On-Policy Distillation (OPD). Independent domain experts in math, coding, agent tasks, and instruction following are first trained via supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) reinforcement learning. Then, 10 or more teacher models distill into a single student model. The student minimizes reverse KL divergence between its output distribution and each teacher's distribution over trajectories the student generates, with full-vocabulary logit distillation ensuring stable gradient estimates.

Three inference effort modes are supported: Non-think (fast, no explicit reasoning), Think High (deliberate reasoning), and Think Max (maximum inference effort with a dedicated system prompt and reduced length penalty during RL training).

DeepSeek-V4-Pro-Max achieves a Codeforces rating of 3206, surpassing GPT-5.4-xHigh (3168) and Gemini-3.1-Pro-High (3052). On SimpleQA Verified, it scores 57.9 Pass@1, beating Claude Opus 4.6 Max (46.2) and GPT-5.4-xHigh (45.3) but trailing Gemini-3.1-Pro-High (75.6). On SWE-Verified, it solves 80.6% of tasks, narrowly behind GPT-5.4-xHigh (81.2%).

With 1M-token inference now feasible, the model reaches the threshold where RAG pipelines can be replaced entirely.

DeepSeek-V4 Cuts 1M-Token Inference Cost by 90%

1.6T Parameters, 1M-Token Context

Compressed Attention and a New Optimizer

Inference Cost and Performance Benchmarks

Related Articles