How GLM-5 Boosted Throughput by 132% via KV Cache Fixes

Imagine a developer deploying a sophisticated coding agent into a production environment. In local tests, the model is flawless, producing clean, executable code. But as the user base scales to millions, the system begins to glitch. Suddenly, the agent outputs strings of nonsensical characters—digital gibberish that looks like an alien language—or falls into an infinite loop, repeating the same word hundreds of times. These failures are ghost-like, appearing only under the crushing pressure of high concurrency, vanishing the moment a developer tries to replicate them in a controlled sandbox.

The Architecture of a High-Concurrency Failure

Z.ai encountered this exact scenario while operating a coding agent based on the GLM-5 large language model. Serving hundreds of millions of requests daily, the team noticed a pattern of corrupted outputs that only surfaced in high-concurrency environments and long-context windows. After an intensive investigation, they traced the issue to a race condition bug within the KV Cache, the specialized memory space an AI uses to store previous conversation history to avoid re-processing the entire prompt for every new token.

The technical fallout was significant. The team discovered that when multiple requests hit the system simultaneously, data within the KV Cache was becoming entangled. By identifying and patching the primary race condition, Z.ai successfully reduced the rate of abnormal outputs from 0.1% to less than 0.03%. However, the engineering effort extended beyond simple bug fixing. By contributing a critical update to the SGLang framework via PR #22811 and implementing a memory optimization called LayerSplit, the team achieved a massive leap in performance. Specifically, for context windows ranging from 40K to 120K tokens, the system throughput increased from a modest 10% improvement to a staggering 132% improvement.

From Regex to Speculative Decoding Metrics

Identifying these failures in real time presented a unique challenge. Traditionally, teams rely on regular expressions to find malformed patterns or deploy a second, smaller LLM to classify whether an output is "garbage." Both methods are flawed; regex is too rigid to catch all forms of AI hallucination, and using another model is prohibitively expensive at a scale of millions of requests. Z.ai pivoted to a more elegant solution by repurposing the metrics from Speculative Decoding.

Speculative Decoding works by using a small, fast draft model to predict the next few tokens, which a larger, more capable model then verifies. Z.ai realized that the verification process itself is a perfect health monitor for the system. They focused on two key metrics: spec_accept_length and spec_accept_rate. If the larger model is forced to rewrite almost everything the draft model suggests, the spec_accept_length drops. This indicates that the model is likely hallucinating or that the memory has been corrupted, leading to "gibberish" output. Conversely, if the spec_accept_rate is unnaturally high, it often signals that the model has entered a repetitive loop, simply confirming the same token over and over.

To operationalize this, the team implemented a monitoring strategy that triggers once a generation exceeds 128 tokens. If these speculative metrics cross a specific threshold, the system immediately terminates the generation and restarts the request. This transforms a latent architectural flaw into a detectable event, ensuring the user never sees the corrupted output.

Solving the PD Separation and HiCache Bottlenecks

The root of the memory corruption lay in the Prefill-Decode (PD) separation architecture. In this setup, the Prefill stage (which processes the initial input) and the Decode stage (which generates the response) are handled by different hardware devices to maximize efficiency. The problem arose during request cancellations. When a timeout occurred, the Decode side would cancel the request and reclaim the memory. However, the cancellation signal often reached the Prefill side too late.

This created a scenario akin to a restaurant where a waiter clears a table because a customer left, but the kitchen—unaware of the cancellation—continues to cook the meal and delivers it to that same table, accidentally dumping food on a new customer's plate. In technical terms, the system was reusing memory addresses before the previous operations were fully cleared. Z.ai resolved this by restructuring the sequence of operations, ensuring that memory is only reused after a definitive safety signal confirms that the RDMA (Remote Direct Memory Access) write operation has completely finished.

Similar synchronization errors were found in HiCache, the hierarchical caching system used to manage memory shortages. The team identified a read-before-ready pattern where the system attempted to begin computations before the data transfer from the CPU to the GPU was complete. By inserting explicit synchronization points to force the correct order of operations, they eliminated the source of the corrupted outputs.

Scaling Throughput with LayerSplit

While fixing the bugs stabilized the system, the team still faced a bottleneck in Context Parallelism. In standard parallel processing for long sequences, every GPU typically holds a redundant copy of the entire KV Cache. This redundancy is a massive waste of VRAM, which in turn limits the number of requests the system can handle simultaneously.

Z.ai introduced LayerSplit to solve this. Instead of every GPU carrying the full weight of the cache, LayerSplit distributes the cache so that each GPU is responsible for only a specific subset of layers. This is the difference between every team member carrying a full set of encyclopedias versus each member carrying only the volumes they are responsible for, sharing information only when necessary. To prevent the communication overhead of this distributed approach from slowing down the model, the team overlapped the indexer operations with broadcast transmissions, effectively hiding the latency.

Under conditions with a 90% cache hit rate, this optimization proved more effective as the context length increased. The result was a dramatic shift in efficiency, culminating in the 132% throughput increase for long-context windows. This demonstrates that while scaling laws define the theoretical intelligence of a model, the actual utility of that intelligence in a production environment depends entirely on the precision of the underlying system engineering.

True AI performance is not just about the size of the parameter count, but the stability of the memory pipeline that supports it.

How GLM-5 Boosted Throughput by 132% via KV Cache Fixes

The Architecture of a High-Concurrency Failure

From Regex to Speculative Decoding Metrics

Solving the PD Separation and HiCache Bottlenecks

Scaling Throughput with LayerSplit

Related Articles