The modern LLM developer is locked in a constant war with the KV cache. As context windows expand to millions of tokens, the memory overhead of standard softmax attention grows linearly with sequence length, eventually hitting a hardware wall that slows inference to a crawl. This has pushed the community toward linear attention and recurrent states, which promise constant-time decoding and fixed memory footprints. Yet, these models have long struggled with a fundamental tension: how to write new information into a compressed memory state without accidentally erasing the critical associations the model already learned. This struggle is where the current frontier of sequence modeling resides.
The Architecture of Gated DeltaNet-2
NVIDIA has entered this fray with Gated DeltaNet-2, a model designed to refine how linear attention updates its internal state. To establish a rigorous baseline, the development team built the model with 1.3B parameters and trained it on 100B tokens from the FineWeb-Edu dataset. Crucially, these specifications were mirrored across all comparison models, including Mamba-2, Gated DeltaNet, KDA, and Mamba-3. By keeping the parameter count and data volume identical, NVIDIA ensures that any performance gains are a result of architectural efficiency rather than brute-force scaling.
At the heart of the model is a recurrent state size strictly fixed at 262,144 floats per layer. This design choice eliminates the unbounded growth seen in softmax attention, ensuring that memory occupancy remains constant regardless of the input sequence length. The training process utilized a sequence length of 4K tokens, optimized via the AdamW optimizer with a peak learning rate of 4e-4. To maintain numerical stability, the team applied a weight decay of 0.1 and a gradient clipping threshold of 1.0. The learning schedule followed a cosine curve, incorporating a warmup period of 1B tokens to prevent early weight divergence, all while maintaining a global batch size of 0.5M tokens.
To address the inherent precision loss in purely recurrent models, NVIDIA implemented a hybrid configuration. In this setup, the model alternates between Gated DeltaNet-2 cells, MLP layers, and Sliding-Window Attention (SWA) blocks. The SWA component uses a window size of 2K, allowing the model to handle high-precision local interactions without sacrificing the linear scaling properties of the overall system. This hybrid pipeline—Gated DeltaNet-2, MLP, SWA, MLP—creates a division of labor where the recurrent mixer compresses long-term history and the SWA preserves short-term accuracy.
Breaking the Scalar Bottleneck with Gated Delta Rule-2
To understand why Gated DeltaNet-2 represents a leap forward, one must look at the limitations of its predecessor, Kimi Delta Attention (KDA). In KDA, the process of active memory editing was controlled by a single scalar value, $\beta_t$. This meant that the decision to erase old information from the Key axis and the decision to write new information to the Value axis were tethered to the same number. In a complex linguistic environment, this is a severe constraint; the model cannot choose to keep the existing key while updating the value, or vice versa, with granular precision.
Gated DeltaNet-2 solves this by introducing channel-wise vector gates, effectively decoupling the erase and write operations. The model employs a channel-wise erase gate $b_t \in [0,1]^{d_k}$ for the Key axis and a channel-wise write gate $w_t \in [0,1]^{d_v}$ for the Value axis. Both gates are generated through a sigmoid projection of the token representation. This allows the model to exercise selective channel-level control: the erase gate determines which specific dimensions of the memory should be cleared, while the write gate controls which dimensions of the new value are recorded.
This logic is formalized in the following recurrent equation:
`St = (I − kt (bt ⊙ kt)⊤) Dt St−1 + kt (wt ⊙ vt)⊤`
In this formula, $D_t = \text{Diag}(\alpha_t)$ represents the channel-wise decay inherited from KDA. By maintaining $k_t$ as the left factor in the erase matrix, the model preserves the directional nature of the delta rule, while the $b_t \odot k_t$ term allows for surgical precision in what is removed. The write term $w_t \odot v_t$ similarly ensures that value updates are gated at the channel level. Mathematically, this transforms the memory update into an online gradient step for a local regression loss, keeping the state close to the memory while using gated targets for residual edits.
Implementing this at scale required significant hardware optimization. NVIDIA utilized a chunk size of 64 and fused Triton kernels to maintain computational efficiency. However, the introduction of separate diagonal gates meant that the scalar shortcuts used in KDA's backward pass were no longer viable. The team had to explicitly derive a gate-aware vector-Jacobian product to handle gradient accumulation. Furthermore, to avoid Triton WGMMA (Warpgroup Matrix Multiply-Accumulate) layout assertion errors on Hopper GPUs, the number of warps in the fused WY backward kernel was restricted to between 2 and 4.
Benchmarking Retrieval and Long-Context Memory
When tested on language modeling and common-sense reasoning in a recurrent setting, Gated DeltaNet-2 achieved an average score of 53.11, surpassing Mamba-3 MIMO (52.39) and KDA (52.28). The gap widened in the hybrid setting, where Gated DeltaNet-2 scored 53.97 compared to Mamba-3 MIMO's 52.72. Because the recurrent state size was held constant across all tests at 262,144 floats, these results prove that the performance boost stems from the efficiency of the update rule rather than an increase in memory capacity.
The most striking evidence of the model's superiority appears in the RULER benchmark, which measures long-context retrieval. In the S-NIAH-2 (4K) task, Gated DeltaNet-2 scored 93.0, beating KDA's 89.0. The disparity became massive in the S-NIAH-3 (2K) task, where KDA struggled at 63.2 while Gated DeltaNet-2 surged to 89.8. Similarly, in the MK-NIAH-1 (4K) test, Gated DeltaNet-2 recorded 37.8 against KDA's 28.0. These numbers suggest that the single scalar gate in KDA acted as a bottleneck during complex information extraction, whereas the decoupled gates in Gated DeltaNet-2 allowed for far more effective selective preservation of data.
Real-world retrieval performance across datasets like SWDE, SQuAD, FDA, TriviaQA, NQ, and DROP further validates this approach. The model averaged 29.88 in the recurrent setting and 42.28 in the hybrid setting. The significant jump in the hybrid score is directly attributable to the SWA's ability to handle local interactions. While purely recurrent models often suffer from information loss when compressing long histories, the 42.28 score indicates that Gated DeltaNet-2 has largely mitigated the traditional forgetting problem associated with linear attention, providing a practical path toward long-context processing without the memory explosion of softmax attention.
By replacing the unbounded KV cache with a fixed-size recurrent state and optimizing the update logic through channel-wise gating, NVIDIA has shifted the conversation from how much memory a model has to how intelligently that memory is edited. The integration of fused Triton kernels and Hopper-specific optimizations ensures that this theoretical efficiency translates into actual wall-clock speed, positioning Gated DeltaNet-2 as a blueprint for the next generation of linear-time sequence models.




