Xiaomi MiMo-V2.5 Scales Omni-modal Agents to 1M Token Context

The current race in generative AI has shifted from simple multimodal capabilities to the pursuit of native omni-modality. For months, the developer community has watched as models attempted to stitch together disparate encoders for text, image, and audio, often resulting in a fragmented understanding of the world and massive computational overhead. The struggle is no longer just about whether a model can see or hear, but whether it can process these streams within a single, unified architecture without collapsing under the weight of its own memory requirements. This week, the conversation shifted toward a new open-source contender that aims to solve the memory bottleneck while providing the legal freedom necessary for enterprise adoption.

The Architecture of a 1.02 Trillion Parameter Omni-model

Xiaomi has introduced MiMo-V2.5, a native omni-modal model specifically engineered for agentic workflows. At its core, the model utilizes a Sparse Mixture of Experts (MoE) architecture, a design choice that allows the system to maintain a massive knowledge base while keeping inference costs manageable. The Base version of the model features a total of 310B parameters, but it only activates 15B parameters during any single inference pass. For users requiring higher reasoning capabilities, the Pro version scales significantly, boasting 1.02T total parameters with 42B active parameters per token.

The development of MiMo-V2.5 followed a rigorous five-stage pipeline. It began with foundational text pre-training, followed by projector warming, multimodal pre-training, supervised fine-tuning (SFT), and finally, a combination of agent post-processing and reinforcement learning (RL) paired with Multi-teacher On-Policy Distillation (MOPD). This training process consumed a total of 48T tokens and employed FP8 mixed precision to optimize the training throughput.

To handle diverse inputs, the model integrates a 729M parameter Vision Transformer (ViT) as its vision encoder and a 261M parameter MiMo-Audio-Tokenizer for audio processing. To address the latency issues common in large-scale models, Xiaomi implemented a three-layer Multi-Token Prediction (MTP) module. This module enhances speculative decoding, allowing the model to predict multiple future tokens and verify them in parallel, which significantly accelerates the generation speed. For deployment, the model is officially distributed via vLLM and supports FP8 quantization and parallel processing through the SGLang framework.

Solving the Memory Wall with Hybrid Attention

While many models claim large context windows, the reality is often a trade-off between window size and VRAM consumption. As the context grows, the KV-cache typically expands linearly, eventually exhausting GPU memory and making long-document analysis or extended video understanding prohibitively expensive. MiMo-V2.5 breaks this trend by introducing Hybrid Attention, a mechanism that blends Sliding Window Attention (SWA) and Grouped Attention (GA) in a 5:1 ratio.

By setting the window size to 128, Xiaomi has managed to reduce the KV-cache storage requirements by approximately 6 times compared to standard attention mechanisms. This structural efficiency is what enables the model to support a massive context window. The Base version supports 256K tokens, while the Full version extends this to 1M tokens. This allows the model to ingest entire codebases or hour-long video files without the typical memory crashes associated with long-context processing.

The technical leap is complemented by a strategic licensing decision. Unlike many open-weights models that impose restrictive commercial terms or user caps, MiMo-V2.5 is released under the MIT license. This allows developers to deploy, modify, and monetize the model without seeking explicit permission from Xiaomi. When combined with the native omni-modal architecture, which eliminates the need for costly data transformation between different modality encoders, the model becomes a highly efficient engine for complex agent workflows. Developers can now build agents that perceive and act across text, image, and audio in a single latent space, reducing the latency and error rates inherent in multi-step pipeline architectures. Detailed model specifications and the associated datasets are available via the official GitHub repository.

MiMo-V2.5 positions itself as a practical alternative to closed-source omni-models by pairing a 1M token window with a truly permissive license.

Xiaomi MiMo-V2.5 Scales Omni-modal Agents to 1M Token Context

The Architecture of a 1.02 Trillion Parameter Omni-model

Solving the Memory Wall with Hybrid Attention

Related Articles