A developer opens a fresh session with a frontier model and uploads an entire codebase, three massive technical manuals, and a dozen Jira tickets. The interface confirms the upload, boasting a context window of one million tokens, and the developer feels a surge of confidence that the AI now possesses a complete mental map of the project. However, as the conversation progresses, the model begins to hallucinate basic function names or ignore critical constraints established in the first prompt. The experience is a common frustration in the current AI era: the gap between the theoretical capacity of a model and its actual cognitive reliability.

The Marketing Myth of the Million-Token Window

AI vendors have entered an arms race of context window expansion, with figures like 200k, 1M, and 2M tokens becoming standard marketing pillars. The promise is simple: the larger the window, the more data the model can ingest, and therefore, the smarter it becomes. This narrative suggests that the attention mechanism—the core architecture that allows a model to weigh the importance of different parts of the input data—scales linearly with the size of the window. In practice, the architecture may technically support the ingestion of millions of tokens, but the ability to reason across that data does not scale at the same rate.

Evidence from the RULER benchmark and reports from Chroma demonstrates a phenomenon known as Context Rot. This is the progressive degradation of a model's ability to retrieve and utilize information as the context window fills up. While a model might successfully retrieve a specific fact from the middle of a massive prompt—the classic needle-in-a-haystack test—performing complex reasoning or following multi-step instructions across that same volume of data is a different challenge entirely. The RULER and Chroma findings prove that effective context is often only a fraction of the advertised limit. As the input grows, the model's precision drops, meaning the million-token window is less a workspace and more of a storage bin where information is kept but not necessarily understood.

The 100k Divide and the Failure of Auto-Compaction

There is a tangible tipping point in LLM performance that occurs around the 100k token mark. This creates a binary experience for the user: the Smart Zone and the Dull Zone. In the Smart Zone, which spans from the start of a session up to roughly 100k tokens, the model remains sharp, adheres to complex system prompts, and maintains a high degree of logical consistency. Once the session crosses into the Dull Zone, the model's attention begins to fray. It starts forgetting the primary objectives defined at the beginning of the session or begins providing generic, low-effort responses that ignore the nuances of the provided data.

This degradation is particularly visible in high-intensity environments like AI-driven coding. Coding agents consume tokens at an accelerated rate because they must read dozens of files, process long debugging logs, and execute extensive test suites. A complex bug fix can push a session into the Dull Zone in a matter of minutes. To combat this, tools like Claude Code implement an auto-compact feature that summarizes the existing history to reset the token count. While this seems like a logical solution, it introduces a critical failure loop. The auto-compact trigger usually activates after the model has already entered the Dull Zone. Consequently, the summary is generated by a model that is already experiencing cognitive decline. This degraded summary then becomes the foundation for the next session, ensuring that the model never truly returns to its peak intelligence.

To bypass this limitation, the most effective developers are moving away from long-lived sessions and toward a breadcrumb strategy. Instead of trusting the model's internal memory, they create small, external, named artifacts—such as Product Requirement Documents (PRD), execution plans, and specific skill definitions—that live outside the live session. Frameworks and methodologies like obra/superpowers and mattpocock/skills exemplify this approach. By treating the LLM as a stateless processor and feeding it only the necessary, verified specifications for a specific task, developers force the model to stay within the 100k Smart Zone. This shifts the burden of memory from the model's volatile context window to a structured system of external records.

Ultimately, the true utility of an LLM is not determined by the size of the window provided by the vendor, but by the density of the context managed by the user.