OpenMythos Reconstructs Claude Mythos: 770M Parameters Matches 1.3B

Every morning, developers scroll through their feeds and see Claude Mythos mentioned somewhere. Anthropic has never published a technical paper on the model, which only fuels speculation. This week, a GitHub project called OpenMythos shot to the top of the trending page, offering the first concrete hypothesis of what Mythos actually is.

Section 1: The Recurrent-Depth Transformer Architecture

Kye Gomez released OpenMythos on GitHub as an open-source reconstruction of Claude Mythos from first principles. This is not a leaked model or a fine-tune — it is a hypothesis implemented in code, written entirely in PyTorch and grounded in peer-reviewed research.

The core assumption is that Claude Mythos belongs to the Recurrent-Depth Transformer (RDT) family, also called Looped Transformers in the literature. Standard transformers — GPT, Llama, Mistral — pass input sequentially through layers with unique weights. RDT applies a fixed set of weights T times within a single forward pass. The same weights execute repeatedly. Inference depth is determined not by stored parameters but by the number of iterations run.

The architecture splits into three components: Prelude → Recurrent Block → Coda. The Prelude and Coda are standard transformer layers that execute exactly once. The Recurrent Block is the computational core, iterating up to T=16 times. At each loop step t, the hidden state updates as:

h_t = A * h_{t-1} + B * e

Here h_t is the hidden state after the t-th iteration, and e is the input encoded by the Prelude, re-injected at every step. Matrices A and B determine how much of the previous hidden state and the encoded input propagate forward.

Inside the Recurrent Block, the feed-forward network is not standard. It is replaced by a Mixture-of-Experts (MoE) layer, following the design introduced in DeepSeekMoE. A large pool of experts activates only the top K per token, alongside a shared expert that is always active. The router selects a different subset of experts at each loop depth, so every iteration remains computationally distinct while sharing the same base weights. MoE provides breadth across domains; looping provides depth of inference.

Attention uses Multi-Latent Attention from DeepSeek-V2. Instead of caching full key/value tensors, it caches compressed low-rank KV latents, reducing KV cache memory by 10–20x at production scale.

Section 2: What Changes in Inference

Previously, a model needed more layers and more parameters to perform deeper reasoning. Now, the same weights can handle deeper chains simply by increasing the iteration count. A standard transformer trained on 5-step reasoning chains cannot handle 10-step chains at test time. A Recurrent-Depth Transformer naturally handles longer reasoning chains by running more loops at inference, without retraining. Hard problems receive more compute; easy problems terminate early.

Inference happens entirely in continuous latent space. No intermediate tokens are generated between loop steps. This is structurally different from chain-of-thought prompting, which externalizes reasoning as a token sequence. Saunshi et al. (2025) formally proved that each loop iteration in an RDT is functionally equivalent to one step of a reasoning chain, but operating on real-valued vectors instead of discrete tokens. Continuous latent thought can encode multiple alternative next steps simultaneously, enabling behavior closer to a breadth-first search over reasoning space within a single forward pass.

Training recurrent models has historically been brittle. Hidden states can grow without bound across iterations — the residual explosion problem. OpenMythos solves this with a linear time-invariant (LTI) injection constraint borrowed from the Parcae architecture (Prairie et al., 2026). The spectral radius ρ(A) of matrix A is forced below 1, guaranteeing stability regardless of learning rate or gradient noise.

At the opposite extreme, excessive iterations can degrade predictions — the overthinking problem. An adaptive computation time (ACT) halting mechanism uses a per-position learned scalar to dynamically stop the loop. Difficult positions receive more compute; tokens that have already converged stop early.

Depth-Wise LoRA adapters introduce small rank-r adaptation matrices at each iteration depth. This gives each loop step slightly different behavior without significantly increasing parameters, bridging the gap between pure weight sharing and fully separate layers.

Section 3: Efficiency at 770M Parameters

The Parcae paper (Prairie et al., 2026) provides empirical evidence for the efficiency claims. A 770M-parameter RDT matches the performance of a 1.3B-parameter standard transformer. That is over 40% efficiency gain in parameter count. For developers, the immediate impact is inference cost: less memory and fewer operations to achieve the same result. Because the architecture allocates more compute only to harder problems, average inference cost is lower.

OpenMythos is available at the GitHub repository. The project presents a falsifiable hypothesis for the structure of Claude Mythos, which Anthropic has never disclosed. The code is public. Anyone can experiment, verify, or disprove it.

This is the first time the community has a concrete, testable claim about what Mythos might be. Whether the hypothesis holds or falls, the conversation has shifted from speculation to engineering.

OpenMythos Reconstructs Claude Mythos: 770M Parameters Matches 1.3B

Section 1: The Recurrent-Depth Transformer Architecture

Section 2: What Changes in Inference

Section 3: Efficiency at 770M Parameters

Related Articles