North Mini Code Outperforms 120B Models in Coding Benchmarks

Developers building autonomous coding agents currently face a brutal trade-off between intelligence and infrastructure. To achieve the reasoning capabilities required for complex software engineering, teams typically deploy massive models with hundreds of billions of parameters. However, this brute-force approach leads to skyrocketing inference costs and latency that makes real-time terminal interaction nearly impossible. The industry has been waiting for a model that possesses the specialized knowledge of a giant but the operational footprint of a lightweight assistant.

The Architecture of Sparse Efficiency

Cohere has addressed this bottleneck with the release of North Mini Code, a model specifically engineered for agentic software engineering and terminal-based workflows. At its core, North Mini Code utilizes a Mixture-of-Experts (MoE) architecture with 30 billion total parameters, but it only activates 3 billion parameters per token. This sparse design allows the model to maintain a vast internal knowledge base while keeping the actual computational load during inference remarkably low. To ensure maximum accessibility for the developer community, Cohere has released the model under the Apache 2.0 license via Hugging Face, permitting full commercial use and modification.

The technical foundation is a decoder-only Transformer. The model employs 128 expert blocks, but the router only activates the top 8 experts for any given token. Each of these expert blocks consists of a Feed-Forward Network (FFN) using the SwiGLU activation function. To manage the routing process, a sigmoid activation function is applied before the top-k selection. To ensure the model maintains a baseline of general representation, Cohere placed a single dense layer before the sparse MoE layers.

Memory management for long-context windows is handled through a hybrid attention mechanism. The model alternates between sliding window attention, which utilizes Rotary Positional Embedding (RoPE), and global attention without position embeddings in a 3:1 ratio. This specific configuration allows the model to process extensive codebases without the exponential memory growth typically associated with long-context transformers.

The training pipeline followed a rigorous three-stage progression. It began with two phases of Supervised Fine-Tuning (SFT), followed by Reinforcement Learning from Verifiable Rewards (RLVR). During the first SFT stage, 70% of the training tokens were dedicated to code, with 43% focusing on agent tool-use data and 27% on competitive and scientific programming. This mix was designed to cultivate not just syntax knowledge, but the logical reasoning required to navigate a file system and execute commands.

To expand the context window, Cohere implemented a long-to-longer cascade technique. The model was first trained at a 64K context length during the first SFT phase and then expanded to 128K in the second. This staged approach prevents data collisions and ensures the model maintains a consistent coding style across massive files. Interestingly, internal evaluations revealed that models trained with a 64K cutoff actually generated longer final trajectories than those trained on the full length distribution, suggesting that constrained training can sometimes improve the model's ability to sustain long-term goals.

The final RLVR stage served as the critical polish. By using verifiable rewards, Cohere eliminated common agent failures such as hallucinated citations or incorrect tool calls. The team performed sample-level filtering to remove hyperparameters that led to structural generation errors, ensuring the model learned to follow instructions and call tools accurately rather than simply memorizing templates.

Breaking the Parameter Arms Race

The most striking aspect of North Mini Code is its ability to punch far above its weight class. In the Artificial Analysis coding index, North Mini Code scored 33.4. This figure is not just competitive; it is dominant. It surpasses other models in its size bracket, including Qwen3.5 (35B-A3B), Gemma 4 (26B-A4B), and Devstral Small 2 (24B Dense).

More significantly, North Mini Code outperformed models that are four times its size. It beat the 120B-parameter Nemotron 3 Super (120B-A12B), the 119B-parameter Mistral Small 4 (119B-A6B), and the 123B-parameter Devstral 2. This result fundamentally challenges the assumption that parameter count is the primary driver of coding proficiency. It proves that a sophisticated training strategy—combining sparse MoE with RLVR—can overcome a massive deficit in raw scale.

This performance translates directly into real-world software engineering tasks. On the SWE-Bench Verified benchmark, North Mini Code achieved a pass@10 rate of 80.2%. In the Terminal-Bench v2, which measures the ability to perform agentic tasks within a command-line interface, it recorded a pass@10 rate of 55.1%. These numbers indicate that the model has a high potential for iterative problem solving, meaning it can attempt a fix, observe the error in the terminal, and correct its course—the hallmark of a true AI software engineer.

Generalization is another area where North Mini Code excels. The model supports four distinct harness environments: SWE-Agent, which uses a specialized CLI with commands like `str_replace_editor`; mini-SWE-agent, which relies on standard bash output; OpenCode, which requires structured JSON responses for tools like `edit`, `grep`, and `todowrite`; and Terminus 2, which operates via plain-text chat.

Despite only 6% of the second-stage SFT data consisting of benchmark harness data, the model showed a 10% performance increase in the OpenCode harness through cross-transfer effects. In the mini-SWE-Agent environment, it achieved a pass@1 rate of 61.0%. This suggests that the model has learned the underlying logic of tool interaction rather than just the syntax of a specific API. To ensure this robustness, Cohere used over 70,000 verifiable tasks extracted from more than 5,000 unique repositories, with strict deduplication against SWE-Bench and SWE-Bench-Pro sources.

For the enterprise, these capabilities unlock the possibility of high-performance, on-premise coding agents. The primary barrier to internal AI adoption is often the GPU memory requirement and the security risk of sending proprietary code to an external API. Because North Mini Code only activates 3B parameters, it can deliver 100B-level performance on significantly modest hardware. This allows companies in highly regulated sectors, such as finance or manufacturing, to host the model on their own servers and further fine-tune it on their private codebases or domain-specific languages.

By combining the Apache 2.0 license with an efficient MoE architecture, Cohere has removed both the financial and legal barriers to deploying sophisticated coding assistants. The shift is clear: the competitive edge in AI coding is moving away from the size of the model and toward the precision of the training data and the efficiency of the architecture.

North Mini Code demonstrates that a 30B model can outclass a 120B giant when the training is focused on verifiable rewards and agentic workflows. The era of blind scaling is ending, replaced by a new standard of architectural efficiency.

North Mini Code Outperforms 120B Models in Coding Benchmarks

The Architecture of Sparse Efficiency

Breaking the Parameter Arms Race

Related Articles