Thousand Token Wood v2: Why Four SLMs Beat One Large Model

Developers have long operated under the assumption that a single high-performance model is sufficient for any multi-agent system. The logic seems sound: if you provide a powerful LLM with distinct persona prompts, it should be able to simulate a diverse cast of characters. However, in practice, this often leads to a phenomenon of behavioral homogenization. No matter how distinct the prompts, agents powered by the same underlying weights tend to converge on similar reasoning patterns, tones, and decision-making logic. They become echoes of the same model rather than independent actors.

The Multi-Lab Architecture of Thousand Token Wood v2

Thousand Token Wood v2 was built specifically to dismantle this assumption. Instead of relying on a single model with multiple masks, the development team integrated four distinct small language models (SLMs) from four different research labs into a single platform. The architecture deploys OpenAI's gpt-oss-20b, OpenBMB's MiniCPM3-4B, NVIDIA's Nemotron-Mini-4B, and a custom fine-tuned version of Qwen 0.5B. By mixing models that underwent entirely different training datasets and post-training alignment processes, the team achieved what they call genuine difference.

This diversity transforms the simulation from a scripted play into a live market. Because the underlying cognitive frameworks differ, the agents do not simply change their speaking style; they change their strategy. In this financial ecosystem, an owl character's method of hoarding assets differs fundamentally from a fox character's approach to speculation. These behaviors emerge not from prompt instructions, but from the inherent biases and logical structures of the respective models. The result is a market driven by real-time arguments and conflicting logic rather than a pre-determined script.

Within this world, the user steps in as the Patron of the Wood, a shadow financier. The user executes interest-based loans, trades insider information, engages in short selling, and brokers alliances through bribery. To maintain tension, the system introduces a magistrate character who tracks the user's transactions. If the user leverages illicit information for profit, the magistrate detects the anomaly and initiates sanctions. This creates a dynamic loop of action and consequence where the agents' reactions are unpredictable and grounded in their specific model origins.

Structure Over Scale: Solving the SLM Reasoning Gap

Integrating four different models created an immediate engineering crisis, but it did not happen at the modeling level. The friction occurred at the serving layer. Because each model utilizes a different tokenizer and possesses unique formatting habits, the system suffered from frequent JSON malformations. One model might close a bracket prematurely, while another might introduce unexpected whitespace, causing the entire simulation to crash during data exchange.

To solve this, the team implemented a tolerant JSON parse-and-repair layer. Every output from every model must pass through this filter, which analyzes grammatical errors in real-time and attempts to reconstruct the JSON structure. If the data is structurally unsalvageable, the layer drops the packet entirely. This design prioritizes system continuity over perfect data retention; it is better to lose a single agent's turn than to crash the entire simulation. Once this layer was established, the cost of adding new models plummeted. The team no longer needed to refactor the codebase to accommodate a new model's quirks; they simply updated a config file.

This shift reveals a critical insight into the nature of small language models: SLMs are reliable format generators but unreliable reasoners. The instinct of many developers is to solve this reasoning gap by scaling up to a larger model. Thousand Token Wood v2 proves that the gap is more efficiently closed through structural design and data flow control.

This philosophy extends to the system's security and memory management. In traditional agent design, developers often put secret information in the prompt and tell the model not to reveal it. For SLMs, this is essentially a hope-based security strategy. Because SLMs are prone to leaking any text present in their context window, the team moved all sensitive data, or hidden flags, outside the prompt entirely. These flags exist only in the player's ledger and are physically stripped from the data flow before any public event record is generated. To ensure this, the team implemented a mandatory scan for banned tokens across all agent prompts every turn, treating this as the most critical item in their test suite.

Memory management followed a similar structural approach to avoid prompt inflation. As conversation histories grow, SLMs often lose the thread of the original instructions, drowning in a sea of tokens. Instead of feeding the entire history back into the model, the system uses one-line bucketed summaries derived from integer-based emotion values. For example, instead of a long transcript of past arguments, a model receives a concise summary: You feel warmth toward Oona and suspicion toward the Patron. This limits the token count while preserving the emotional valence necessary for consistent character behavior.

Finally, to prevent the stochastic nature of SLMs from breaking the simulation's logic—such as an arch-enemy suddenly approving a loan—the team implemented deterministic overrides. Core behaviors are not left to the model's probabilistic generation. If an agent's hostility level exceeds a certain threshold, the system mechanically forces a loan rejection or imposes predatory terms. The final behavior is a hybrid: the model provides the emergent tone and flavor, while the system structure enforces the logical boundaries.

By treating the model as a component of a larger machine rather than the machine itself, the developers transformed the limitations of small models into a design advantage. The diversity of the agents is a product of the heterogeneous model mix, and the stability of the system is a product of the rigid serving layer. The result is a complex, functioning society that requires far less compute than a single monolithic LLM would demand.

Thousand Token Wood v2: Why Four SLMs Beat One Large Model

The Multi-Lab Architecture of Thousand Token Wood v2

Structure Over Scale: Solving the SLM Reasoning Gap

Related Articles