Launch the VICE emulator and load the disk/soulplayer.d64 file. Type a short message in lowercase, hit enter, and the screen border begins to blink. With every short beep from the SID sound chip, a single token appears on the screen. This is the surreal sight of a Transformer model—the very architecture powering the modern AI revolution—operating on an 8-bit computer clocked at a mere 1MHz.

The Architecture of an 8-Bit LLM

Soul Player C64 is a decoder-only Transformer consisting of two layers. Its specifications are lean by modern standards but complex for 1982 hardware: it features four attention heads with 8 dimensions per head, a 32-dimensional embedding, and 64 hidden units in the feed-forward network (FFN). The total parameter count sits at approximately 25,000, all of which have been quantized to int8. To make this possible, the model is written directly in 6502/6510 assembly, the native language of the Commodore 64's central processing unit.

Performance is measured in minutes rather than milliseconds. Inference takes roughly 60 seconds per token, meaning a full response requires a significant amount of patience. Despite this, the entire model is small enough to fit on a single floppy disk. For those looking to replicate the build, the pipeline follows a specific sequence of Python scripts:

bash
python train.py
python build.py
python test.py

The training process utilizes a BPE tokenizer to generate 128 tokens and employs Quantization-Aware Training (QAT) to minimize the errors introduced by 8-bit precision. This workflow produces models/soul.bin and models/tokenizer.json. The build.py script then merges these weights with the 6502 assembly routines to output the final disk/soulplayer.prg and disk/soulplayer.d64 files.

The Battle Against Numerical Instability

The true technical challenge lies in the fact that the 6502 CPU possesses no native multiplication instruction. Every matrix multiplication must be painstakingly executed via a shift-and-add approach. To manage this, the system uses Q8.8 fixed-point (int16) for all activation functions, while weights are stored as int8 with power-of-2 shift scaling applied per tensor. Bias values are handled as int16, pre-scaled to match the matrix multiplication accumulator.

However, standard quantization was not enough to make the model coherent. The primary failure occurred during the softmax process, where output values are converted into a probability distribution. Initially, a 17-bit shift method was used, but the dynamic range of the 128-item exponential lookup table was insufficient. This caused the attention weights to distribute uniformly, effectively blinding the model to context. The breakthrough came when this was adjusted to a 14-bit shift, which allowed the model to generate meaningful attention weights and actually perceive the relationship between tokens.

To further bridge the gap between training and execution, FakeQuantI8 was introduced during the QAT phase. By intentionally inducing a mismatch between the continuous scales used during training and the power-of-2 shift grid used during export, the developer forced the model to learn a wider logit margin. This acted as a form of implicit noise, making the model more resilient to quantization gaps. Additionally, a label smoothing value of 0.15 was applied to prevent the distribution from becoming too sharp for int8 operations to distinguish.

The resulting operational flow is a rigid sequence: it begins with RMSNorm, moves through multi-head causal self-attention, a residual connection, another RMSNorm, a ReLU MLP, a second residual connection, a final RMSNorm, and an output projection, ending with an argmax to determine the final token.

This experiment proves that the minimum requirements for a functioning Transformer are not defined by raw hardware throughput, but by the precise control of numerical stability.