Together AI's OSCAR Breaks the 2-Bit KV Cache Accuracy Barrier

For any developer attempting to serve long-context LLMs, the experience is often a race against the inevitable Out-of-Memory error. As context windows push toward 100K tokens and beyond, the KV cache—the memory space storing previous token information—expands aggressively, consuming nearly all available GPU VRAM. This creates a brutal trade-off: you can either support a massive context for a single user or support many users with a tiny context. For years, the industry has looked toward low-bit quantization as the escape hatch, but 2-bit KV caching has long been considered a forbidden zone where model intelligence simply collapses.

Together AI and the Architecture of OSCAR

Together AI has now challenged this limitation with the release of OSCAR, or Offline Spectral Covariance-Aware Rotation. The system is designed to solve the memory bottleneck of long-context inference by compressing the KV cache to 2 bits without the catastrophic accuracy loss typically associated with such extreme quantization. This technology has already been integrated into the production stack of SGLang, a high-performance LLM serving framework available at https://github.com/sgl-project/sglang.

The technical objective of OSCAR is to reduce memory traffic by 8x, effectively removing the KV-bandwidth bottleneck that slows down decoding in long-context scenarios. To prove its efficacy, Together AI tested the system across a variety of model scales, including Qwen3-4B-Thinking-2507, Qwen3-8B, Qwen3-32B, and the massive 358B parameter GLM-4.7-FP8. The results on H100 GPUs are stark. In an environment with a 100K context and a batch size of 32, the Qwen3-4B-Thinking model achieved a 6.17x increase in throughput compared to BF16, while the GLM-4.7-FP8 model saw an improvement of 7.83x.

These gains are not merely theoretical. By slashing the memory footprint of the KV cache, OSCAR allows developers to increase batch sizes significantly on the same hardware, maximizing GPU utilization. Furthermore, the system maintains the robustness of the model's retrieval capabilities. In RULER-NIAH (Needle In A Haystack) tests, the GLM-4.7-FP8 model maintained BF16-level accuracy even as the context length reached 128K tokens, proving that 2-bit compression does not have to mean a loss of contextual awareness.

The Shift from Data-Oblivious to Covariance-Aware Rotation

To understand why OSCAR succeeds where previous 2-bit attempts failed, one must look at the nature of outliers in LLM activations. In a 2-bit (INT2) system, there are only four possible representable levels. When a few outlier values dominate the scale factor, the remaining general values are crushed together, erasing the nuance the model needs to function. Previous attempts to fix this used rotation methods like the Hadamard transform to spread the energy of these outliers evenly across dimensions. While this worked for 4-bit quantization, it failed at 2-bit because it was data-oblivious—it treated all data distributions the same, regardless of how the attention mechanism actually read the information.

OSCAR introduces a spectral covariance-aware approach. Instead of blindly mixing values, it analyzes the statistical direction of the attention mechanism. For the Key (K) rotation, OSCAR minimizes the error in attention logits rather than simple Euclidean reconstruction error. It derives the query covariance `CQ = (1/N) Σ qn⊤qn` and uses its eigenvectors `UQ` as the rotation basis. For the Value (V) rotation, it utilizes the score-weighted value covariance `CS = (1/N) V⊤S⊤SV` to derive the eigenvectors `US`. The resulting rotation formulas are defined as `RK = UQ · HHad · Pbr` and `RV = US · HHad · Pbr`. This ensures that quantization errors are pushed into directions that the attention mechanism considers unimportant.

Another critical innovation is the strategic preservation of high-precision data. OSCAR does not quantize everything. It maintains the sink tokens and the most recent window of tokens in BF16 precision. In a 128K context window, these high-precision elements account for only 0.24% of the total data, yet they are mathematically essential for maintaining the model's intelligence.

This mathematical framework is paired with highly optimized Fused Triton kernels to ensure that the rotation and quantization do not introduce new latency. In the write path, tokens are rotated and then clipped based on calibration thresholds—`cK = 0.96` and `cV = 0.92`—before undergoing per-token asymmetric INT2 quantization with a group size of `GK = 64` channels. In the read path, the INT2 kernel unpacks the bytes and performs inverse quantization and inverse rotation in a single fused pass. Most impressively, the Value rotation matrix `RV` is absorbed into the model's projection weights offline, meaning the rotation happens as part of the existing weight multiplication, adding zero overhead to the real-time decoding process.

Quantifying the Gap: OSCAR vs. Naive Quantization

The performance gap between OSCAR and traditional methods is not a matter of marginal gains; it is the difference between a functioning model and a broken one. In tests using Qwen3-4B and 8B models, Naive INT2 quantization resulted in a score of 0.00, indicating a total collapse of the model's reasoning capabilities. Even QuaRot-INT2, which uses the Hadamard transform, struggled significantly, scoring only 1.40 on Qwen3-4B and 10.14 on Qwen3-8B.

When compared to more modern techniques, the efficiency of OSCAR becomes even more apparent. TurboQuant, which uses 3.25 bits, suffered a massive 43.90 point drop in performance on the Qwen3-4B-Thinking model. In contrast, OSCAR achieved a score of 71.86 using only 2.28 bits. To put this in perspective, Saw-INT4 requires 4.25 bits to achieve a score of 73.11. OSCAR essentially delivers nearly the same level of intelligence while using almost half the memory.

This trend continues in complex reasoning benchmarks. On the AIME25 mathematics benchmark, the Qwen3-8B model powered by OSCAR scored 66.67 with 2.38 BPE (bits per element). This comfortably outperformed KIVI-KV2, which scored 57.67 at 2.26 BPE, and Kitty, which scored 59.67 at 2.39 BPE. The reason for this superiority is the fundamental shift in strategy: OSCAR does not just flatten the distribution of values; it actively protects the directions of highest importance to the attention mechanism.

By combining spectral analysis with hardware-level Triton optimization, Together AI has moved 2-bit KV caching from a theoretical curiosity to a production-ready tool. The ability to maintain BF16-level robustness while reducing memory traffic by 8x fundamentally changes the economics of long-context serving.

This breakthrough effectively signals the end of the KV-bandwidth bottleneck as the primary constraint for long-context LLMs, opening the door for real-time, high-throughput applications that were previously cost-prohibitive.

Together AI's OSCAR Breaks the 2-Bit KV Cache Accuracy Barrier

Together AI and the Architecture of OSCAR

The Shift from Data-Oblivious to Covariance-Aware Rotation

Quantifying the Gap: OSCAR vs. Naive Quantization

Related Articles