Engineers scaling long-context large language models currently live in a state of constant tension with the network wall. The industry standard for high-performance inference relies on a fragile balance between the prefill stage, where the model processes input tokens, and the decode stage, where it generates the response. To optimize this, teams try to decouple these stages, but they hit a physical limit: the Key-Value Cache (KVCache). This memory, which stores the computations of previous tokens, grows so massive that moving it between servers becomes a bottleneck. This forces developers into a costly architectural prison, requiring expensive Remote Direct Memory Access (RDMA) networks and restricting all operations to a single data center or even a single server rack.
The PrfaaS Framework and Heterogeneous Hardware Scaling
Moonshot AI and researchers from Tsinghua University are attempting to break this hardware dependency with PrfaaS, or Prefill-as-a-Service. This cross-datacenter serving architecture fundamentally changes how inference is distributed by offloading the computationally heavy prefill tasks to a dedicated high-performance computing cluster. Once the prefill is complete, the system transmits the resulting KVCache to a local decode cluster using standard Ethernet, the ubiquitous wired network that lacks the specialized low-latency overhead of RDMA. In a case study involving a 1 trillion parameter hybrid model, this approach yielded a 54% increase in inference throughput compared to traditional single-cluster configurations.
This performance jump is not merely a result of throwing more hardware at the problem. When the researchers normalized for hardware costs, PrfaaS still delivered a 15% throughput advantage. The remaining gains come from a strategic, heterogeneous hardware deployment. The architecture places H200 GPUs in the prefill cluster to handle the intense initial computation and H20 GPUs in the decode cluster for token generation. This optimization was tested across a suite of high-capacity models, including Kimi Linear, MiMo-V2-Flash, Qwen3.5-397B, and Ring-2.5-1T, proving that the architecture can scale to the largest models currently in existence.
Hybrid Attention as the Catalyst for Network Migration
For years, the idea of transmitting KVCache over standard Ethernet was considered impossible because dense models using Grouped Query Attention (GQA) produced caches that were simply too large. The MiniMax-M2.5 model serves as a prime example of this limitation; a request of 32K tokens generates approximately 60Gbps of KVCache on a single 8xH200 instance. Such a volume of data would saturate a standard Ethernet link, causing the system to collapse under its own weight. The breakthrough that makes PrfaaS viable is the shift toward hybrid attention stacks.
By integrating linear complexity layers such as Kimi Delta Attention (KDA), Multi-head Latent Attention (MLA), and Sliding Window Attention (SWA), the researchers significantly reduced the rate at which the KVCache grows relative to sequence length. The results are stark. The MiMo-V2-Flash model, when processing 32K tokens, generates only 4.66Gbps of KVCache, representing a 13x reduction compared to MiniMax-M2.5. The Ring-2.5-1T model achieves even more dramatic results, utilizing MLA for 4.5x compression and a 7:1 hybrid ratio to achieve a total memory reduction of 36x. For a 1 trillion parameter model, the KVCache transmission rate drops to 3.19Gbps, a volume that modern inter-datacenter Ethernet links can handle with ease.
To manage this flow in a production environment, PrfaaS implements length-based threshold routing. The system evaluates the additional prefill length, excluding any cached prefixes; if this length exceeds a specific threshold t, the request is routed to the PrfaaS cluster. If it is below or equal to t, it is processed locally. In the researchers' case study, the optimal threshold was set at 19.4K tokens, which effectively distributed approximately 50% of all requests—specifically the long-context ones—to the specialized prefill cluster. To further refine this, the team applied layer-wise prefill pipelining and a multi-connection TCP transmission mechanism to minimize network congestion and ensure a steady stream of data.
The bottleneck of LLM serving has shifted from the physical limitations of the network to the architectural efficiency of the model itself.




