The modern developer experience with large language models is defined by a persistent, rhythmic tension: the waiting game. Even with the most powerful frontier models, there is a palpable gap between the speed of human thought and the speed of token generation. We have grown accustomed to the streaming cursor, a digital heartbeat that reminds us the AI is thinking, but for high-stakes applications like real-time trading or autonomous coding agents, this latency is a wall. The industry has long accepted a brutal trade-off where increasing a model's parameter count to improve intelligence inevitably degrades its responsiveness. Until now, the dream of a trillion-parameter model that responds as fast as a local script has remained out of reach.
The Architecture of Extreme Throughput
Xiaomi has challenged this trade-off with the release of MiMo-V2.5-Pro-UltraSpeed, a model that pushes the boundaries of inference efficiency. The headline figure is staggering: the system achieves a decoding speed of 1000 tokens per second (tps) on a model with 1 trillion parameters. In peak real-time measurements, the system has been clocked at approximately 1200 tps. Crucially, this performance is not the result of proprietary, exotic hardware. Xiaomi achieved these numbers using a single standard 8-GPU node consisting of commodity GPUs, proving that the bottleneck is often software and system design rather than raw silicon.
Access to this capability is currently managed through a strict, application-based promotional window. The API is available from June 9, 2026, to June 23, 2026, at 23:59 (UTC+8), and is restricted to approved users. From a resource perspective, the UltraSpeed variant represents a strategic shift in cost-benefit analysis. While it incurs approximately 3 times the cost of the standard MiMo-V2.5-Pro, it delivers a 10-fold increase in generation speed. This version is provided exclusively via API, meaning traditional token plans are not supported for this specific high-speed tier.
For those seeking access, applications are processed through the official platform at platform.xiaomimimo.com/ultraspeed, with priority given to professional developers and enterprises with documented business needs. Once approved, users gain a two-week free trial of the Chat interface at ultraspeed.xiaomimimo.com. To manage the immense demand on the 8-GPU nodes, the trial accounts are limited to 10 queue entries per day, with sessions capped at 30 minutes and an automatic timeout after 5 minutes of inactivity.
Breaking the Memory Wall through Codesign
To understand how a 1T model can move this fast, one must look at the codesign between the model architecture and the inference engine. The primary enemy of LLM speed is memory bandwidth. To combat this, Xiaomi implemented FP4 quantization using the MXFP4 format. However, applying aggressive 4-bit quantization across a trillion parameters typically destroys the model's ability to handle complex logic and coding tasks. The solution was a surgical approach to quantization within the Mixture of Experts (MoE) architecture. Xiaomi identified that the Experts modules are more resilient to quantization than the shared attention or routing layers. By selectively applying FP4 to the Experts and maintaining higher precision for the rest of the network, and further refining this with Quantization-Aware Training (QAT), they reduced the model size without sacrificing the intelligence of the original weights.
This hardware-aware quantization is paired with DFlash Speculative Decoding. Traditional speculative decoding operates linearly, guessing tokens and verifying them one by one. DFlash departs from this by filling entire masked blocks in a single forward pass. To make this viable, Xiaomi utilized a Muon 2nd order optimizer and model self-distillation to strip away the overhead typically associated with the draft stage. They also integrated Sliding Window Attention (SWA), which transforms the computational load of the attention mechanism from a value that grows with context length into a constant. By limiting the block size to 8, they minimized the verification overhead while maintaining high acceptance rates. The effectiveness of this approach is visible in the average acceptance lengths across different domains: Coding reaches 6.30 (peaking at 7.14), Math and Reasoning sit at 5.56, and Agentic tasks average 4.29.
On the system level, the engine is powered by TileRT, a low-latency inference kernel designed to eliminate the execution gaps and operator boundaries that plague standard frameworks. The core innovation here is the Persistent Engine Kernel. Instead of launching kernels sequentially, the pipeline remains resident on the GPU, allowing data movement and computation to overlap almost perfectly. This is further optimized through Warp Specialization, which physically segregates communication, data movement, and tensor operations. This transforms the GPU from a general-purpose processor into a finely tuned, heterogeneous execution system where no cycle is wasted.
For developers who want to verify these claims or implement similar low-latency pipelines, Xiaomi has released the `MiMo-V2.5-Pro-FP4-DFlash` checkpoint on Hugging Face. This resource includes the FP4 quantized weights and the specific DFlash parameters required to run the model within a TileRT environment.
The elimination of inference latency fundamentally changes the role of the LLM from a tool we query into an extension of our own cognitive process. When a 1T model generates 1000 tokens per second, the concept of a single response becomes obsolete. Developers can now implement Best-of-N sampling or complex Tree Search algorithms in real-time, allowing the model to explore dozens of reasoning paths, self-correct, and verify its own logic in the background before the user even sees the first word. This unlocks a new class of high-frequency AI applications, from millisecond-level quant trading signals and real-time fraud prevention to surgical assistants that can analyze medical imaging and predict risks in the middle of a procedure without a perceptible pause.
This shift suggests a future where the intelligence of the largest models is finally decoupled from the frustration of the wait.




