Every developer has experienced the frustration of an AI-generated patch that passes the unit tests but fails the human code review. It is the gap between a solution that is technically correct and one that is architecturally sound. For months, the industry has treated LLM output as a binary—either the code works or it does not. However, as we move toward more autonomous agentic workflows, the conversation is shifting from simple correctness to maintainability and semantic alignment with human intent.
The Economics of Reasoning Effort
Recent analysis of GPT-5.5 Codex, a model specifically optimized for programming tasks, reveals a precise correlation between the amount of compute spent during inference and the actual utility of the resulting code. The study focused on 26 real-world tasks within the `GraphQL-go-tools` repository, a complex project implementing the GraphQL query language in the Go programming language. To measure the impact of reasoning depth, the researchers tested four distinct inference settings: low, medium, high, and xhigh.
The financial cost of this increased reasoning is immediate. Moving from a low to a medium setting results in an average cost increase of 1.43x. Specifically, the average cost per task was $2.65 for the low setting, rising to $3.13 for medium, $4.49 for high, and peaking at $9.77 for xhigh. This cost trajectory is mirrored by the time spent on inference, which stretched from 286.9 seconds at the lowest setting to 753.3 seconds at the highest.
To ensure the validity of the results, the team utilized Stet, a tool designed for patch application and test execution, within isolated container environments. The evaluation was conducted by GPT-5.4, which acted as a blind judge, assessing the accuracy and maintainability of the patches without knowing which reasoning setting produced them. While the test pass rate remained stagnant between low and medium settings at 21/26, the semantic equivalence—the degree to which the AI's patch behaved identically to a human-written one—jumped from 4/26 in the low setting to 11/26 in the medium setting. Similarly, the code review pass rate climbed from 3/26 to 5/26.
Detailed data and interactive charts regarding these reasoning curves are available at stet.sh/blog/gpt-55-codex-graphql-reasoning-curve.
From Heuristic Patching to Domain Understanding
The critical insight from this data is that increasing reasoning effort does not just produce more code; it changes the fundamental strategy the model uses to solve a problem. In the low setting, GPT-5.5 Codex tends to rely on heuristics—simple patterns or empirical rules that happen to satisfy the test case without necessarily addressing the underlying logic. This is evident in PR #1297, where the low setting simply added a conditional branch to pass the test, a move that would likely be rejected in any professional production environment for being a superficial fix.
When the setting is shifted to medium, the model moves from heuristics to domain modeling. It begins to reflect the actual business logic and structural dependencies of the system. In the same PR #1297 task, the medium setting correctly modeled the data dependency rules, allowing the patch to pass not just the tests, but the human-centric code review process.
The high setting represents the optimal tipping point where additional token consumption translates into tangible quality gains. Compared to the medium setting, the high setting saw a 15.4%p increase in test pass rates, a 26.9%p increase in semantic equivalence, and a 19.2%p increase in review pass rates. A clear example of this is found in PR #1209. While both low and medium settings passed the tests, they failed the review. The high setting, however, introduced explicit response key handling, meeting the strict integration requirements of the project.
However, the transition to the xhigh setting reveals the law of diminishing returns and the emergence of footprint risk. While xhigh achieved the highest raw scores—semantic equivalence at 23/26 and review pass rates at 18/26—the cost skyrocketed to $9.77, making it 2.18x more expensive than the high setting. More concerning is the tendency of the xhigh setting to over-edit. It often modified fixtures and test files unnecessarily, expanding the surface area of the change and increasing the potential for regression bugs. In some instances, such as PR #1155, the xhigh setting actually produced a worse implementation than the high setting, creating a performance inversion where more compute led to a lower-quality result.
The success of AI-driven engineering is no longer measured by whether the model can find the right answer, but by whether it can produce code that a human engineer is willing to maintain.




