Tax season for a professional accountant is often defined by a grueling race against the clock, characterized by the manual migration of data from fragmented PDFs, handwritten notes, and sprawling spreadsheets into rigid government forms. For a single complex tax return, a practitioner might spend up to eight hours just on data entry, a process where a single keystroke error can lead to significant compliance risks. While the industry has long sought automation, the gap between a laboratory AI demo and a production-ready system has remained wide because tax law is not just about data extraction, but about the application of tacit professional judgment. OpenAI is now attempting to close this gap not by simply refining prompts, but by building a system that learns from the act of correction itself.
Scaling Precision Across 7,000 Tax Returns
OpenAI, in collaboration with Thrive Holdings and the Crete tax accounting network, has deployed Tax AI to transform the workflow of over 30 accounting firms. During a recent tax season, the system processed 7,000 returns, specifically focusing on the complex 1040 and 1041 forms. The primary objective was to reach 100% accuracy, defined as a state where every field in a tax return is completed perfectly without requiring any human intervention. While total perfection remains the North Star, the practical gains are already substantial. The time required for tax preparation has dropped to approximately one-third of previous levels, allowing practitioners to shift their focus from rote data entry to high-value client advisory and complex legal interpretation.
The system began by handling standardized documents like W-2s and 1099s, but it quickly expanded its capabilities to encompass high-difficulty filings, including K-1 documents and intricate schedules. Tax AI achieved a draft accuracy rate of up to 97%, which significantly reduced the cognitive load on the reviewing accountants. This efficiency gain translated into a 50% increase in overall throughput for the participating firms. The most critical observation from this deployment is that the system did not remain static. Within three months, the model showed measurable performance improvements that were directly tied to the real-world corrections made by the accountants in the field.
From Manual Patching to the Codex Iteration Loop
The fundamental difference between Tax AI and traditional AI implementations lies in its autonomous improvement loop. In a standard AI deployment, when a user discovers an error, the feedback is routed to an engineer who manually analyzes the failure, tweaks the prompt, and redeploys the model. This cycle is slow and often loses the nuance of the professional's intent. OpenAI replaced this manual process with a three-pillar engineering system that converts expert practitioner feedback into structured signals.
First, the system captures the delta between the AI's suggested value and the final value confirmed by the accountant. This is not treated as a simple correction but as a structured data point that captures the professional's tacit knowledge. Second, the system utilizes production traces to map the entire lifecycle of a data point. For example, when processing Schedule E for rental property income, the system tracks the data from the raw source—such as an email or a spreadsheet—through a normalization phase where it becomes a cited field, and finally to its mapping within the tax engine. By preserving this history, the system can pinpoint exactly where a failure occurred: whether it was an extraction error from the source document or a logical mapping error in the tax engine.
Third, these failure patterns are fed into a Codex-driven iteration loop. When the system detects a recurring failure—such as consistently missing the fair rental days field—it automatically aggregates these cases into a targeted evaluation set. Codex, OpenAI's code generation model, then analyzes these evaluation sets and the underlying source packages to rewrite the code and verify the fix. This transforms the developer's role from a manual debugger to a high-level architect who reviews and approves the autonomous improvements suggested by the system. The result was a dramatic leap in reliability: the percentage of documents achieving over 75% field accuracy jumped from 25% to 86% in just six weeks.
This architecture effectively codifies professional expertise. By treating the accountant's correction as the primary signal for engineering, the system bypasses the need for massive, pre-cleaned datasets. Instead, it uses the friction of real-world errors to drive its own evolution, ensuring that the AI evolves in lockstep with the complexities of actual tax practice.
This shift from static automation to an autonomous learning loop suggests a future where AI does not just assist professionals but actively absorbs their expertise to eliminate the most tedious aspects of their craft.




