Every morning, developers trying to run 70B-class large language models locally face the same VRAM squeeze. Four-bit quantization has become the baseline, but dropping to 2 bits has historically meant watching accuracy fall off a cliff. That tradeoff has been the hard ceiling on local LLM deployment for years.

Intel AutoRound Achieves 97.9% Accuracy at 2-Bit Quantization

Intel Research has released AutoRound, an advanced quantization toolkit for large language models and vision-language models that maintains high accuracy even at 2 to 4 bit widths. The INT2 mixed-precision DeepSeek-R1 model, released in March 2025 at roughly 200GB, preserved 97.9% of its original accuracy. For 7B models, quantization completes in about 10 minutes on a single GPU. The core technique is sign-gradient descent, which minimizes rounding error without requiring fine-tuning.

The SignRoundV2 paper, published in December 2025, reproduced state-of-the-art performance on LLaMA model evaluations using the `enable_alg_ext` flag and the AutoScheme API, a tool that automatically generates mixed-precision quantization configurations. In November 2025, AutoRound was integrated into LLM-Compressor, a compression model management tool. That same month, the GGUF binary format for LLM inference also saw its quantization algorithm improved through `enable_alg_ext`. In October 2025, AutoRound was integrated into SGLang, the high-speed LLM inference engine, and gained an algorithm that cuts mixed-precision scheme generation time to minutes. September 2025 brought support for MXFP4 and NVFP4 data types, while August saw improvements to the INT2 algorithm. July added GGUF format support, and May brought integration with vLLM, the high-performance LLM inference server, and the Transformers Hugging Face model library.

bash
auto-round -h

Environment variable configuration also supports downloads from ModelScope, the Chinese model hub.

bash
export AR_USE_MODELSCOPE=1

Block-wise FP8 quantization can be executed with the following command:

bash
--scheme FP8_BLOCK --iters 0 --disable_opt_rtn

Four Bits Used to Be the Limit; Now Two Bits Is Practical Territory

Previously, quantization below 4 bits meant accepting accuracy loss as a necessary tradeoff. AutoRound breaks that rule by maintaining 97.9% accuracy at 2 bits. Competing tools like AutoGPTQ, a 4-bit quantization library, and AutoAWQ, a weight-based quantization tool, show strength at 4 bits but fall behind AutoRound in the 2-bit regime. The `enable_alg_ext` flag delivers additional improvements on MXFP4 and W2A16 schemes, which use 2-bit weights with 16-bit activations.

AutoRound supports export to multiple formats including AutoRound, AutoAWQ, AutoGPTQ, and GGUF. It automatically detects and selects the optimal inference backend from more than ten options. Support for mixture-of-experts models and vision-language models remains limited, but the toolkit can quantize more than ten VLMs immediately. Three quantization recipes are available: `auto-round-best` for maximum accuracy, `auto-round` for balance, and `auto-round-light` for maximum speed.

The change developers will feel immediately is inference engine integration. Integration with vLLM, SGLang, and Transformers is complete, allowing quantized models to load directly without a separate conversion step. Use cases were published in November 2025 on the vLLM blog and the Red Hat blog, and in October 2025 on the LMSYS blog. Developers should note that manually moving a quantized model to another device during inference, such as `model.to('cpu')`, can trigger exceptions.

Two-bit quantization is signaling that it could become the new standard for local LLM deployment.