Compounding Correctness: The New Logic of AI Token Consumption

For a period, the primary metric for AI success within the enterprise was simply volume. At companies like Meta, this manifested as tokenmaxxing, a corporate culture where token consumption was tied directly to performance reviews. To satisfy these KPIs, engineers found themselves in the absurd position of orchestrating two AI agents to converse with each other all day, generating mountains of useless text just to move a needle on a dashboard. This was a blunt instrument designed to force adoption, pushing senior engineers over the hump of resistance and integrating tools like Cursor into the daily workflow through sheer mandate.

The Shift Toward Compute-Driven Accuracy

The industry is now pivoting away from vanity metrics toward a concept known as compounding correctness. In the early days of LLM agents, the prevailing fear was compounding error, where a small hallucination in the first step of a chain would cascade, rendering the final output useless. The current paradigm flips this logic, suggesting that if the architecture is correct, increasing the amount of compute and tokens dedicated to a problem actually increases the probability of a correct answer.

This shift is operationalized through loops. Rather than a linear chain, a loop allows an agent to complete a task, evaluate the result, and then restart the process using the previous output as a refinement layer. This allows the system to automatically decompose complex specifications into manageable parts and solve them incrementally over time. The evidence for this approach is most striking in high-stakes security research. The AI Safety Institute (AISI) recently utilized the Mythos model to identify system vulnerabilities, allocating a massive budget of 100M tokens per attempt. At a cost of $12,500 per run, a series of ten executions totaled $125,000.

Crucially, the AISI reported that as the token budget increased, the model's performance continued to improve. There were no signs of diminishing returns, suggesting that for complex reasoning tasks, the ceiling for performance is much higher than previously thought, provided the system is allowed to iterate.

The Mathematical Case for Cheap Models

The realization that iteration beats raw model power has fundamentally changed the procurement strategy for AI pipelines. The goal is no longer to find the single most powerful model, but to find the most efficient ratio of improvement per dollar. When the objective is to run a loop hundreds of times, the cost difference between frontier models and open-weight alternatives becomes the deciding factor.

Consider the current pricing landscape per million tokens:

`GLM 5.2: Input $1.4 / Output $4`

`Haiku 4.5: Input $1 / Output $5`

`Opus 4.X series: Input $5 / Output $25`

GLM 5.2 has emerged as a potent contender, showing strength over Haiku and even surpassing GPT 5.5 in specific benchmarks. The strategic advantage here is mathematical. If a top-tier model like Claude provides a 1.1x improvement in accuracy per loop, but GLM 5.2 provides a 1.05x improvement at one-fifth of the cost, the optimal move is to run GLM 5.2 five times. The cumulative gain from more iterations with a cheaper model often outweighs the marginal intelligence of a more expensive one.

However, this requires a strict distinction between developer spend and pipeline spend. Developer spend, such as using Claude Code to accelerate an engineer's productivity, is an investment in human efficiency and is easily justified. Pipeline spend is different. Many teams attempt to reduce hallucinations by stacking verifier agents—adding one agent to check the first, and another to check the verifier. This often triples the cost while actually lowering accuracy because it introduces more points of failure into a non-deterministic system. The goal is not to add more agents, but to create a tighter, more iterative loop with a model that can actually self-correct.

The ultimate destination for this trajectory is the software factory: a system where a human provides a high-level specification, and the AI handles the entire lifecycle of generation, review, bug fixing, and test writing. While some industry voices, such as StrongDM, have suggested that companies should aim for a token spend of $1,000 per engineer per day, the reality of current software factory operations is much leaner, often hovering around $600 per month. Spending the equivalent of a senior Google engineer's salary on tokens for a single developer is currently inefficient.

Success in the next era of AI implementation will depend on calculating the precise correlation between a model's per-loop improvement rate and its token cost to find the optimal number of iterations.

Compounding Correctness: The New Logic of AI Token Consumption

The Shift Toward Compute-Driven Accuracy

The Mathematical Case for Cheap Models

Related Articles