VibeThinker-3B Outperforms Gemini 3 Pro in Math Reasoning

The prevailing dogma in artificial intelligence has long been that scale is the primary driver of intelligence. For years, the industry operated under a simple, linear assumption: more parameters equal more cognitive capacity. To build a model capable of complex reasoning, developers believed they needed to construct digital brains with hundreds of billions of connections, requiring massive server farms and astronomical energy budgets. This race for size created a barrier to entry where only the wealthiest tech giants could compete at the frontier of reasoning.

The 3B Model Challenging the Giants

Sina Weibo's research team has just challenged this paradigm with the release of VibeThinker-3B. In a 14-page technical report published on arXiv, the team detailed a model with only 3 billion parameters that rivals or exceeds the reasoning capabilities of flagship systems hundreds of times its size. The most striking evidence appears in the AIME 2026 mathematics benchmark, where VibeThinker-3B achieved a score of 94.3. This figure does not just compete with other small models; it surpasses the 91.7 score recorded by Google's Gemini 3 Pro and matches the performance of DeepSeek V3.2, a behemoth boasting 671 billion parameters.

Performance gains extend further when the team implements Claim-Level Reliability Assessment, a technique that verifies the reliability of each step in the reasoning chain. With this addition, the model's score climbs to 97.1. Because of its compact architecture, VibeThinker-3B avoids the need for enterprise-grade H100 clusters for inference, making it capable of running on standard consumer laptops. This shift in efficiency has sparked a divide in the AI community. While many view it as a breakthrough in resource-efficient AI, skeptics argue that such high scores on specific benchmarks may be the result of data contamination or the model learning to game the test rather than developing genuine reasoning capabilities.

The Logic of Parametric Compression

To explain how a 3B model can outperform a 671B model in mathematics, the researchers introduce the Parametric Compression-Coverage Hypothesis. This theory posits a fundamental difference between two types of AI capabilities: parameter-dense abilities and parameter-expansive abilities. Reasoning, particularly in fields with objective truths like mathematics and coding, is categorized as a dense ability. The researchers argue that the logic required for these tasks can be compressed into a small number of highly efficient connections if the training is precise.

In contrast, open-domain knowledge—the vast array of facts and common sense required for general conversation—is an expansive ability. This requires a wide coverage of parameters to store a diverse library of information. The data supports this distinction: while VibeThinker-3B dominates in math and code, it scores significantly lower than larger models on the GPQA-Diamond benchmark, which measures high-level general knowledge. This suggests that while reasoning can be compressed, a comprehensive world model still requires scale.

The model was built using Qwen2.5-Coder-3B as a foundation and refined through a rigorous four-stage training process. The team applied the Spectrum-to-Signal Principle to narrow the scope of training data and isolate the most critical signals for reasoning. To push the model beyond its current limits, they employed the MGPO (MaxEnt-Guided Policy Optimization) algorithm, which prioritizes the learning of problems that sit exactly at the boundary of the model's current capabilities. By focusing on the edge of its knowledge rather than reinforcing what it already knows, the model achieves a level of efficiency that belies its size.

The era of equating parameter count with intelligence is ending. VibeThinker-3B proves that reasoning can be decoupled from general knowledge and compressed into a lightweight package. For developers and enterprises, the strategic priority is shifting from seeking the largest possible model to identifying the optimal compression ratio for a specific task. When the goal is a mathematically precise answer, a hyper-optimized small model is no longer just a cheaper alternative—it is a superior one.

VibeThinker-3B Outperforms Gemini 3 Pro in Math Reasoning

The 3B Model Challenging the Giants

The Logic of Parametric Compression

Related Articles