Every developer who processes millions of AI inference requests daily has been waiting for the moment when a small model can deliver advanced reasoning without the hardware tax. This week, Google's Gemma 4 aims directly at that inflection point.

Four Sizes, One Benchmark Surprise

On April 1, Google released Gemma 4 in four configurations: Effective 2B (E2B), Effective 4B (E4B), a 26B Mixture of Experts (MoE) model, and a 31B Dense model. The 31B variant has already claimed third place among open models on the Arena AI text leaderboard, while the 26B MoE model sits at sixth. According to Google, Gemma 4 can compete with models up to 20 times its size. All models ship under the Apache 2.0 license with no commercial restrictions. Since the first generation launched, the Gemma family has accumulated over 400 million downloads and spawned more than 100,000 derivative models.

The Intelligence-Per-Parameter Shift

For years, model size was synonymous with performance. Gemma 4 introduces a different metric: intelligence per parameter. The 26B MoE model activates only 3.8 billion of its 26 billion total parameters during inference, keeping latency minimal. The 31B Dense model, by contrast, uses all parameters for maximum quality and is optimized for fine-tuning. The smallest E2B and E4B models run fully offline on edge devices — smartphones, Raspberry Pi, and NVIDIA Jetson Orin Nano. Google worked with the Pixel team, Qualcomm, and MediaTek to ensure these models deliver near-instant responses while conserving battery and RAM.

The practical difference for developers is immediate. The 31B model's bfloat16 weights fit on a single 80GB NVIDIA H100 GPU. Quantized versions can run IDEs, coding assistants, and agentic workflows on consumer-grade GPUs. Android developers can test compatibility with Gemini Nano 4 through the AICore Developer Preview. Yale University has already used Gemma for the Cell2Sentence-Scale project, which discovers cancer treatment pathways, and INSAIT built BgGPT, the first Bulgarian language model, on top of Gemma.

The real shift is not just that smaller models are catching up — it's that the cost of entry for serious AI work is collapsing. When a 31B model fits on one GPU and a 4B model runs on a phone, the bottleneck moves from hardware access to developer imagination.