For years, the industry standard for processing multi-page documents has been a tedious loop. Developers feed a PDF into an OCR engine page by page, clearing the memory buffer between each slice to avoid a system crash. This fragmented approach not only kills throughput but destroys the model's ability to maintain context across a fifty-page report. The developer experience has been a trade-off between memory stability and document coherence, where the longer the text, the more the system struggles to keep up.
The Architecture of Unlimited OCR
Baidu has introduced Unlimited OCR to break this cycle, building upon the DeepSeek OCR baseline to create an end-to-end model capable of processing dozens of pages in a single forward pass. The architecture utilizes a hybrid design combining a DeepEncoder with a Mixture of Experts (MoE) decoder. While the model possesses 3B total parameters, it optimizes efficiency by activating only 500M parameters during inference. This lean activation allows for high-speed processing without sacrificing the depth required for complex document understanding.
Performance metrics validate this approach. On the OmniDocBench v1.6 benchmark, Unlimited OCR achieved a state-of-the-art accuracy of 93.92%. This represents a significant leap over its performance on v1.5, where it hit 93%, and provides a 6% advantage over the original DeepSeek OCR. Speed is where the gains become most apparent. In Base mode, operating at a 1024×1024 resolution, the model recorded 5580 TPS, a 12.7% improvement over the 4951 TPS delivered by DeepSeek OCR.
The training phase involved approximately 2 million document OCR samples. Baidu maintained a 9:1 ratio between single-page and multi-page data, with multi-page samples ranging from 2 to 50 pages. These were packed into sequences of up to 32K tokens. The training was executed on 8×16 A800 GPUs, employing a strategy where the DeepEncoder remained frozen while the LLM parameters underwent 4,000 additional training steps to refine the output.
Solving the KV Cache Bottleneck with R-SWA
The fundamental shift in Unlimited OCR is the replacement of standard Multi-Head Attention (MHA) with Reference Sliding Window Attention, or R-SWA. In traditional MHA models, the Key-Value (KV) cache grows linearly with the output length. As a document grows, the memory consumption spikes, and the tokens-per-second rate plummets. R-SWA solves this by restricting the attention range to two distinct windows: a prefix window (m) and a decode window (n).
The prefix window (m) captures the visual tokens and the initial prompt. This section remains fixed throughout a single inference session, meaning its memory footprint depends solely on the number of pages and the image resolution, not the length of the generated text. The decode window (n) is a causal sliding window that only tracks the most recent n tokens, with a default value of 128. Mathematically, the KV cache size is maintained as $L_m + \min(n, T) \le L_m + n$. Because the cache size is capped, the memory consumption remains constant regardless of how long the output sequence T becomes.
To further reduce the memory load, the DeepEncoder employs a cascade of SAM-ViT and CLIP-ViT, applying a 16x token compression at the bridge. This compresses a 1024×1024 PDF image into just 256 tokens. When measured using Flash Attention v3 kernels, the difference is stark. DeepSeek OCR exhibits increasing latency and efficiency spikes at specific alignment boundaries as the sequence grows. In contrast, R-SWA maintains a flat latency curve. For developers, this means that when output reaches 6,000 tokens, standard MHA models are typically 35% slower than Unlimited OCR, which treats the KV cache as a queue and evicts old values to maintain a constant generation speed.
Despite these gains, physical constraints remain. The current maximum sequence length is 32K tokens. While the DeepEncoder's compression is aggressive, an extremely high page count will eventually extend the prefill length, limiting the theoretical infinity of the parsing. Furthermore, some repetition errors occur when identifying very small text; however, analysis shows this is a limitation of the Base mode's spatial resolution rather than a failure of the R-SWA mechanism. The model is currently available via Hugging Face and ModelScope, with full technical details provided in the arXiv paper.
Baidu plans to expand this framework to 128K context windows and implement a prefill pool for automatic KV chunk fetching, signaling a move toward truly seamless long-form document intelligence.




