DharmaOCR Cuts Text Degeneration by 87.6% Using DPO

Every developer working with generative OCR has encountered the nightmare of the infinite loop. You feed a complex document into a model, and instead of a clean transcription, the output suddenly breaks. A single phrase begins to repeat, over and over, cascading into a wall of redundant text that continues until the model hits its maximum token limit. This is not a simple glitch or a fluke of sampling; it is a systemic failure known as text degeneration, and for years, it has remained one of the most stubborn hurdles in structured document extraction.

The Structural Ceiling of Supervised Fine-Tuning

In April, DharmaOCR addressed this persistent flaw by releasing a specialized model and a corresponding methodology paper via Hugging Face. The research focused on structured document extraction for Brazilian Portuguese text, using this specific linguistic environment to benchmark how models handle the transition from visual pixels to structured strings. The core of their investigation was the prevalence of text degeneration, where the model ceases to transcribe and instead enters a self-reinforcing loop of identical tokens.

Before the introduction of new optimization techniques, the researchers observed a wide variance in failure rates across open-source model families. Vanilla models exhibited text degeneration rates ranging from under 1% to as high as 33%. To combat this, the team initially employed Supervised Fine-Tuning (SFT), the industry standard for adapting a general model to a specific task. While SFT succeeded in lowering the degeneration rate, it failed to push the error rate down to a level viable for production-grade services.

The reason for this failure lies in the fundamental architecture of SFT. SFT operates on a token-by-token basis, evaluating each single prediction in isolation. When a model enters a repetitive loop, SFT treats each repeated token as an individual prediction. Because the token is technically correct in the context of the immediate previous token, the model does not receive a holistic penalty for the overall failure of the sequence. This creates a performance ceiling where the model's ability to perform the OCR task and its resistance to degeneration move independently, meaning that making the model better at transcription does not necessarily make it less likely to loop.

Breaking the Attractor with Preference Optimization

To shatter this ceiling, DharmaOCR implemented Direct Preference Optimization (DPO) as a second stage of training following the initial SFT. The results were immediate and significant. Across all tested model groups, DPO reduced text degeneration by an average of 59.4%, with the most dramatic improvement reaching a reduction of 87.6%. In a specific test case using the Nanonets-OCR2-3B model, the degeneration rate plummeted from 1.61% to a negligible 0.20%.

The effectiveness of DPO stems from its departure from token-level probability. Unlike SFT, which tries to match the probability of the next single token, DPO operates on the level of the entire output. It uses a preference-based framework where the model is presented with a chosen result and a rejected result. By labeling a looping sequence as rejected and a clean transcription as chosen, the model learns to recognize the entire degenerate sequence as a failure. This allows the model to directly modify its internal probability distribution to suppress the triggers that lead to loops.

This approach addresses the geometric root of the problem: the attractor. Text degeneration occurs when a specific token begins to dominate its own conditional distribution, creating a high-probability zone that acts like a gravitational pull. Once the model's inference path enters this attractor, it becomes nearly impossible to escape, as each repeated token further reinforces the probability of the next identical token. Most developers attempt to fix this at the inference layer using repetition penalties, temperature adjustments, or early-abort logic. However, these are merely cosmetic filters. They mask the symptoms by blocking the output, but they leave the internal attractor intact.

DharmaOCR turned the model's own failures into the solution. The pipeline was designed to capture the degenerate outputs generated by the SFT model and pair them with the correct transcriptions. By using these preference pairs, the system utilized an objective binary signal—success or failure of the transcription—rather than the subjective alignment signals typically used in chatbot training. This forced the model to redistribute its probability mass away from the attractors, solving the problem at the structural level rather than the sampling level.

This shift from token-matching to sequence-optimization transforms how OCR models handle stability, ensuring that the internal logic of the model is fundamentally incompatible with infinite loops.

DharmaOCR Cuts Text Degeneration by 87.6% Using DPO

The Structural Ceiling of Supervised Fine-Tuning

Breaking the Attractor with Preference Optimization

Related Articles