The quest for the perfect local inference model has shifted from a race for raw parameter counts to a battle for architectural efficiency. For months, the developer community has watched as massive models were squeezed into consumer hardware, often sacrificing reasoning capabilities for the sake of fit. This week, the landscape shifted again with the release of ZAYA1-8B under the Apache 2.0 license on Hugging Face. While the model boasts a total of 8.4B parameters, it operates with a lean 760M active parameters, challenging the long-held assumption that high-level mathematical reasoning requires a massive active compute footprint.
The AMD-Powered Architecture of ZAYA1-8B
Zyphra AI developed ZAYA1-8B using a Mixture of Experts (MoE) structure designed to maximize intelligence per FLOP. The training process represents a significant milestone for non-Nvidia hardware stacks. The entire model was trained on an AMD Instinct MI300 stack, utilizing a cluster of 1024 AMD Instinct MI300x nodes. This infrastructure was built in collaboration with IBM and leverages the AMD Pensando Pollara interconnect to handle the massive data throughput required for MoE training.
In benchmarks focusing on mathematics and coding, ZAYA1-8B demonstrates performance that defies its size. On the AIME'26 benchmark, it scored 89.1, while achieving 71.6 on HMMT Feb.'26. Its capabilities extend to the IMO-AnswerBench with a score of 59.3 and the APEX-shortlist at 32.2. For coding, the model recorded 65.8 on LiveCodeBench-v6 and 71.0 on GPQA-Diamond. When compared to other models in its weight class, such as Qwen3-4B-Thinking-2507 and Gemma-4-E4B-it, ZAYA1-8B consistently outperformed them across every math and coding metric.
The most striking results appear when ZAYA1-8B is pitted against models orders of magnitude larger. It outperformed Mistral-Small-4-119B, a model with 119B total parameters and 6B active parameters, in several key areas. Specifically, ZAYA1-8B beat Mistral-Small-4 on AIME'26 (89.1 vs 86.4), HMMT Feb.'26 (71.6 vs 70.6), and LiveCodeBench-v6 (63.8 vs 57.9). However, the larger Mistral model maintains an edge in general knowledge and complex reasoning, leading in GPQA-Diamond (77.2 vs 71.0) and MMLU-Pro (81.6 vs 74.2).
Breaking the Context Barrier with Markovian RSA
Traditional MoE models typically scale by simply adding more experts, but Zyphra introduced the MoE++ architecture to change the fundamental relationship between parameter count and intelligence. The development of ZAYA1-8B followed a rigorous five-stage pipeline: pre-training, intermediate training, supervised fine-tuning (SFT), a reasoning RL (Reinforcement Learning) cascade, and the implementation of Markovian RSA (Recursive Self-Aggregation) for inference-time computation. The research team noted that the RL stage provided the most significant boost to math and coding, while also improving knowledge retrieval and creative writing in benchmarks like MMLU and GPQA-Diamond.
The real technical breakthrough lies in Markovian RSA, which solves the problem of reasoning depth within a fixed context window. This method combines Recursive Self-Aggregation, which generates and aggregates multiple reasoning traces in parallel, with a Markovian thinking process. In this system, reasoning is divided into fixed-length chunks. Instead of passing the entire history forward, only the tail end of the previous chunk is passed to the next.
This architecture ensures that the intermediate thinking process never exceeds the fixed context window size. For every prompt, the model generates multiple parallel reasoning traces and extracts a fixed-length tail from each. A sub-sampled aggregation prompt then triggers the next round of parallel responses. This creates a loop of continuous refinement without the memory overflow typically associated with long-chain reasoning.
Crucially, Zyphra discovered that this performance is not just a result of the inference method, but of the co-design between the post-training methodology and the inference harness. ZAYA1-8B was trained from the SFT stage through RL to specifically understand and respond to Markovian RSA aggregation prompts and chunking. When the same Markovian RSA method was applied to Qwen3-4B-Thinking-2507, the performance gains were significantly smaller, proving that the model must be natively trained for this specific style of recursive thinking.
When the inference-time compute budget was expanded to 5.5 million tokens per problem, the results shifted from competitive to dominant. In this high-compute configuration, ZAYA1-8B surpassed DeepSeek-V3.2 and GPT-OSS-High on the APEX-shortlist math benchmark. On HMMT'25, it achieved a score of 89.6, beating both Claude 4.5 Sonnet (88.3) and GPT-5-High.
Model weights are available on Hugging Face and the full technical report is hosted on arXiv. For those who prefer not to host locally, the model is also available as a serverless endpoint via Zyphra Cloud.
The fact that 760M active parameters can outpace a 119B parameter model in specialized reasoning suggests that the future of AI efficiency lies in the tight integration of MoE structures and inference-time compute strategies.




