An engineer building document automation for a global service uploads a PDF containing a mix of Korean and Japanese text. The output is a disaster: garbled characters and text boxes scattered randomly across the page. To fix this, the team expands the model's supported character set, assuming that adding more symbols to the dictionary will solve the problem. Yet, the recognition accuracy barely budges. This is the wall most developers hit when moving from English-centric OCR to the complex visual landscape of East Asian languages.
The Benchmarks of Nemotron OCR v2
Nvidia addresses this failure with Nemotron OCR v2, a multilingual optical character recognition model trained on 12 million synthetic images across six languages. The primary goal was to slash the Normalized Edit Distance (NED), the standard metric for measuring text error rates. In non-English languages, the NED scores plummeted from a mediocre range of 0.56-0.92 down to a highly precise 0.035-0.069. This shift represents a fundamental leap in how the model perceives and interprets non-Latin scripts.
Performance is not sacrificed for accuracy. On a single A100 GPU, the model processes 34.7 pages per second, making it viable for high-throughput enterprise pipelines. To achieve this, Nvidia leveraged the mOSCAR multilingual web corpus, which includes 163 language subsets, as the primary text source. The visual diversity was ensured by a massive font pool, incorporating Google Fonts and the Noto family, providing between 165 and 1,258 unique fonts per language. For developers looking to implement these tools, the model and dataset are available via `nvidia/nemotron-ocr-v2` and `nvidia/OCR-Synthetic-Multilingual-v1` respectively.
Why More Characters Were Not Enough
The transition from v1 to v2 reveals a critical insight into the nature of OCR. In the first version, Nvidia attempted to solve the recognition problem by simply expanding the supported character set from 855 to 14,244. The result was negligible. The failure was not a lack of vocabulary, but a lack of visual experience. The model knew which characters existed, but it did not know how those characters actually looked when rendered in a variety of real-world fonts, sizes, and distortions.
To bridge this gap, Nvidia moved away from traditional data collection and labeling, which is prohibitively expensive for thousands of characters, and instead turned to synthetic data generation. By modifying SynthDoG, a document image synthesis generator, they created a pipeline that produces three simultaneous levels of bounding boxes: word, line, and paragraph. These are not just static boxes; they are linked via a Relation graph that defines the reading order of the text. This architectural choice solves the perennial problem of multi-column layouts or complex tables where a model might otherwise read across columns instead of down them.
The synthetic pipeline was designed to mimic the chaos of real documents. It applies templates for multi-column text, scattered annotations, and the vertical writing styles essential for Japanese and Chinese. It also simulates headers, bordered tables, dotted tables of contents, and the specific styling found in PowerPoint slides and Word documents. By training on this randomized layout variety, the model develops an invariance to document structure, allowing it to maintain accuracy regardless of how the page is organized.
For CJK languages specifically, Nvidia abandoned the word-level recognition unit in favor of line-level recognition. Because these languages often lack clear spacing between words, attempting to segment individual words creates unnecessary noise and errors. Recognizing the entire line as a single unit is more efficient and aligns with the linguistic structure of the scripts. To keep the system fast, the architecture utilizes a shared detection backbone, ensuring that the recognizer and the relation model do not perform redundant calculations.
The bottleneck for multilingual OCR has shifted from the sophistication of the model architecture to the precision with which visual diversity can be programmed into synthetic data.




