Most developers interacting with Large Language Models today exist in a state of high-level abstraction. The daily workflow typically involves importing a pre-trained model from a library, passing a prompt through an API, and refining the output. While this approach is efficient for shipping products, it creates a dangerous knowledge gap. The internal mechanics of how a token is embedded or how the attention mechanism weighs the importance of a preceding word remain hidden behind a curtain of optimized C++ and CUDA kernels. This reliance on black-box libraries means that when a model hallucinates or fails to follow a complex constraint, the engineer is left guessing rather than diagnosing. There is a growing movement among practitioners to strip away these abstractions and return to the fundamental mathematics of the transformer.
The Architecture of a 10M Parameter Training Pipeline
This workshop project addresses the abstraction gap by guiding developers through the creation of a 10M parameter GPT model. The project draws its lineage from nanoGPT, the educational repository created by Andrej Karpathy that distilled the complexity of GPT-2 into a few hundred lines of readable code. While the original nanoGPT aimed to replicate the 124M parameter scale of GPT-2, this specific workshop scales the model down to 10M parameters. This reduction is a deliberate pedagogical choice, ensuring that the entire training cycle can be completed in under an hour on consumer-grade hardware without sacrificing the core structural lessons of the transformer.
The technical foundation of the project is PyTorch, the open-source machine learning library developed by Meta. By using PyTorch, developers manually define the model's architecture, from the embedding layers to the multi-head attention blocks and the final linear layer. To provide a tangible result, the model is trained on the works of William Shakespeare, allowing the AI to learn the specific rhythmic and linguistic patterns of Early Modern English. To ensure the project is accessible regardless of the user's hardware, the pipeline includes automatic device detection. Whether the developer is using an Apple Silicon GPU via Metal Performance Shaders (MPS), an NVIDIA GPU via CUDA, or a standard CPU, the code adapts to the available compute resources.
For those setting up a local environment, the project recommends using the uv Python package manager for speed and reliability. The installation process begins with the following command:
bash
uv(파이썬 패키지 관리자) 설치
curl -LsSf https://astral.sh/uv/install.sh | sh
For developers who prefer a cloud-based approach or lack a dedicated GPU, the project is fully compatible with Google Colab. In the Colab environment, the process is streamlined to a few simple commands to prepare the dependencies and trigger the training script:
!pip install torch numpy
!python train.pyBreaking the Dependency on High-Level APIs
For years, the industry standard for implementing LLMs has been the use of high-level APIs, such as the AutoModel.from_pretrained() function provided by Hugging Face. While these tools are indispensable for production, they obscure the actual process of learning. By forcing the developer to write the model from the ground up, this workshop shifts the focus from consumption to construction. The most critical technical pivot in this project is the decision to abandon Byte Pair Encoding (BPE) in favor of character-level tokenization.
In a massive model like GPT-4, BPE is essential because it handles a vocabulary of tens of thousands of tokens, allowing the model to represent complex words and sub-words efficiently. However, in a 10M parameter model, a massive vocabulary becomes a liability. If a model with limited capacity attempts to learn 50,000 different tokens, the majority of those token pairs will appear too infrequently in the training data. This creates a sparsity problem where the model cannot find enough patterns to converge, leading to poor performance and unstable loss curves.
By utilizing character-level tokenization, the project reduces the vocabulary size to a manageable 65. This ensures that every single token in the vocabulary is seen thousands of times during the training process, allowing the 10M parameters to be used more effectively for learning the structural relationships between characters rather than wasting capacity on a sparse dictionary. The configuration is strictly tuned for the Shakespearean dataset, with a vocab_size of 65 and a block_size of 256. This block size defines the context window, determining how many previous characters the model can look at to predict the next one. While the workshop focuses on character-level logic for the initial build, it provides a roadmap for transitioning back to BPE when scaling to larger datasets and models.
This transition from API-reliance to manual implementation reveals the hidden costs of computation. When a developer writes the attention mechanism themselves, they begin to understand why the memory requirements of transformers scale quadratically with the sequence length. The tension between model size, vocabulary breadth, and training time becomes a visible engineering trade-off rather than a hidden configuration setting in a JSON file.
The final output of the workshop is not just a trained weight file, but a complete, transparent pipeline consisting of model.py, train.py, and generate.py. By observing the loss function converge in real-time, developers gain an intuitive sense of how gradients flow through the network and how the model slowly transforms random noise into coherent, Shakespearean-style prose. This experience provides a technical foundation that is far more valuable than simply knowing how to call an API. When the time comes to tune a domain-specific small language model (SLM) or optimize a model for edge deployment, the engineer who has built a GPT from scratch understands exactly which levers to pull.
Understanding the raw operations beneath the abstraction layer transforms the developer from a user of AI into an architect of AI.




