A developer at an AI startup stares at a log window where every single turn consumes over 40,000 tokens. The actual task is a simple API call, yet the context window is bloated, with half of the space occupied by dozens of tool descriptions and schemas. This is the invisible friction of the modern AI agent: as the number of available tools grows, the cost of simply telling the model what those tools are begins to outweigh the cost of the actual reasoning. This structural waste has become a recurring bottleneck for teams scaling their agentic workflows, creating a phenomenon known as the token tax.
The High Cost of the MCP Tool Tax
When AI agents connect to various external tools, the context window occupancy rises sharply as the number of connections increases. In a real-world operational environment utilizing five Model Context Protocol (MCP) servers and 34 configured tools, the average token consumption per turn was measured at 45,000 tokens. Approximately 50 percent of this total, or 22,000 tokens, is wasted on tool schema overhead that has nothing to do with the immediate task at hand. Essentially, half of the text the model must process to answer a question consists of JSON-formatted tool specifications.
Engineering data from Anthropic reveals that unoptimized tool definitions can occupy up to 134,000 tokens. In typical multi-server deployment environments, this adds an overhead of 15,000 to 60,000 tokens per turn. This MCP tool tax does more than just increase API costs; it actively degrades reasoning quality. When unnecessary tool definitions dominate the context, critical user instructions and the core nuances of the previous conversation are pushed aside, leading to a measurable drop in the model's ability to follow complex prompts.
To combat this, the Tool Search feature was introduced to reduce the token volume used for tool definitions by up to 85 percent. Instead of the traditional method of pre-loading every single tool schema into the prompt, the system adopts an on-demand approach. By loading only the specific tools required at the moment of execution, the system removes irrelevant options from the context window, which suppresses false positive responses and sharpens the model's overall reasoning accuracy.
Solving Decision Paralysis with BM25 and Dynamic Loading
Hermes Agent implements a progressive disclosure architecture that prevents the model from being overwhelmed by a massive array of tools. The core of this design is the replacement of the exhaustive tool list with three bridge tools: `search_tools`, `load_tool`, and `unload_tool`. Rather than reading a full catalog, the model uses these bridges to selectively call only the tools it needs, physically limiting token consumption.
The internal search logic of these bridge tools is powered by the BM25 (Best Matching 25) algorithm. When a model inputs a query, the system compares it against tool names, detailed descriptions, and parameter names, filtering and suggesting tools based on their relevance scores. If the BM25 results fail to produce a valid matching score, the system employs a fallback mechanism using literal substring matching to ensure the correct tool is eventually located.
To prevent drift bugs—where stored information becomes desynchronized from the actual tool registry—the catalog uses a stateless approach, rebuilding the tool definition list every turn. During the actual tool execution phase, the bridge tools step aside, and the system applies guardrails and approval prompts directly to the actual tool names. Security verification and user approval are only triggered once a tool has been brought into the context via `load_tool` and an execution command is issued.
This operational flow follows a strict three-step sequence: search, load, and execute. The model first calls `search_tools` to identify candidates, then uses `load_tool` to bring the necessary schema into the context. Once the function is performed and the tool is no longer required, `unload_tool` removes it to reclaim space.
This structural change addresses a psychological hurdle for LLMs known as decision paralysis. Internal MCP evaluations from Anthropic showed that when a massive tool catalog fills the context window, the model struggles to prioritize the correct action, often calling irrelevant tools. By removing this noise, the accuracy of the Opus 4 model jumped from 49 percent to 74 percent. This improvement demonstrates that the primary barrier to agentic performance is often not the model's raw intelligence, but the noise level within its immediate environment.
To manage this process, Hermes Agent provides a `tool_search` option within the `hermes.yaml` configuration file. This setting can be toggled between `auto`, `on`, and `off`, moving the burden of context calculation from the developer to the system.
tool_search: autoIn `auto` mode, the system monitors the context window in real-time. If tool schemas occupy less than 10 percent of the active model's context window, the system operates in a pure pass-through mode, delivering all schemas to avoid the computational overhead and potential omission risks of a search step. However, the moment the threshold exceeds 10 percent, the system automatically activates the search layer to filter and deliver only the most relevant tools. This dynamic switch is particularly critical in environments where multiple MCP servers are connected and the number of available tools fluctuates.
For enterprise environments, especially those dealing with fragmented API ecosystems like ERP or CRM systems, this optimization is a necessity. Connecting dozens of internal APIs via MCP servers without such a mechanism would lead to unsustainable operational costs and degraded performance. By implementing a stateless catalog and BM25-based retrieval, companies can maintain access to a vast library of internal tools without sacrificing the reasoning capabilities of the underlying model.
Ultimately, the 85 percent reduction in MCP tool overhead proves that interface lightweighting can be more effective than simply increasing model parameters. By suppressing unnecessary context consumption, the system creates an environment where the model can focus entirely on the logic of the task. The real-world performance of an AI agent is no longer determined solely by the size of the model, but by the precision with which its tools are managed.




