Zyphra’s ZAYA1-8B-Diffusion-Preview Hits 7.7x Faster Inference Speeds

The Shift from Autoregressive Bottlenecks

For years, the standard for large language models has been the autoregressive approach, where tokens are generated one by one. This process forces the system to repeatedly fetch the Key-Value (KV) cache from GPU memory, creating a persistent bottleneck where data transfer speeds, rather than raw compute power, dictate performance. This week, Zyphra introduced a structural departure from this paradigm with the ZAYA1-8B-Diffusion-Preview. By utilizing discrete diffusion, the model generates 16 tokens simultaneously in a single forward pass, effectively turning the inference process into a compute-bound operation. This shift allows the system to bypass the memory bandwidth limits that typically throttle LLM performance, functioning much like opening multiple lanes on a highway to allow simultaneous traffic flow.

TiDAR Recipes and Architectural Innovation

Rather than training a diffusion model from scratch—a process currently lacking established, reliable recipes—Zyphra opted to transform an existing autoregressive model. The team applied their TiDAR (Training recipe for Diffusion-based Autoregressive models) to the ZAYA1-8B base checkpoint. The training process involved a mid-training phase of 600B tokens at a 32k context length, followed by an additional 500B tokens to extend the context to 128k, concluding with supervised fine-tuning (SFT).

Central to this performance is the CCGQA (Compressed Chunked Grouped Query Attention) architecture. By setting the query-to-key head ratio at 4:1, Zyphra optimized memory usage while avoiding the excessive computational intensity of Multi-Head Latent Attention (MLA). This design choice allows the model to handle the additional arithmetic demands of block diffusion. The implementation details and model weights are available on the GitHub repository.

Eliminating the Speculative Decoding Overhead

Traditional acceleration methods, such as EAGLE3 or dFlash, rely on speculative decoding where a smaller draft model generates tokens that a larger model then validates. This two-stage process introduces overhead due to the constant data exchange and control flow switching between the two models. ZAYA1-8B-Diffusion-Preview eliminates this entirely by performing inference and verification within a single forward pass. Because the same model acts as both the proposer and the verifier, the latency associated with model switching is removed. Furthermore, the model demonstrates that non-causal inference within these blocks can actually yield higher expressivity than standard causal autoregression, as evidenced by improved results on benchmarks like LiveCodeBench v6.

Hardware Scaling on AMD MI300x and MI355x

The model’s performance is tightly coupled with the hardware it runs on, specifically leveraging the high VRAM capacity of AMD’s MI300x and MI355x GPUs. In bf16 precision, the MI300x supports approximately 3 block proposals per forward pass, while the MI355x scales this to 5 blocks. This scalability is a direct result of the model’s ability to share the KV-cache across all tokens in a block, drastically reducing memory access frequency. By combining CCGQA with Compressed Context Attention (CCA), the model significantly reduces the compute requirements of the prefill stage. This structural efficiency ensures that as hardware capabilities increase, the model can process more tokens in parallel without hitting the memory wall that plagues traditional autoregressive architectures.

This transition to diffusion-based architectures signals a move toward hardware-native AI, where the structure of the model is designed to match the physical throughput limits of modern silicon.