A fully manual, from-scratch implementation of a Large Language Model (LLM), guided by the book Build a Large Language Model from Scratch by Sebastian Raschka.
Unlike simply running code from the reference repo, this project is about understanding and re-implementing every major component — tokenizer, model, data loader, training loop — by hand.
To develop a working LLM from first principles by:
- Writing each component from scratch in Python
- Following the structure and logic from Raschka’s book
- Validating ideas through notebooks and experiments
- Building modular, reusable, CLI-driven code
llm-zero-to-trained/
├── src/llmscratch/ ← Modular CLI-driven Python package
│ ├── config/ ← Config loader with .env, CLI, YAML support
│ ├── models/ ← Core dataclasses and SingletonMeta
│ ├── runtime/ ← CLI argument parsing and dispatch
│ ├── launch_host.py ← Entry point for all commands (e.g., preprocess)
│ └── host.py ← Command execution coordinator
├── notebooks/ ← Book-aligned exploration notebooks
├── configs/ ← YAML configs for datasets, vocab, etc.
├── datasets/ ← Raw and processed tokenized data
├── pyproject.toml ← Project metadata and CLI definition
├── README.md ← This file
└── PROGRESS.md ← Running log of milestones
- 📘 Build a Large Language Model from Scratch – Sebastian Raschka (2024)
- 💻 LLMs-from-scratch GitHub Repo
- 🧠 Karpathy’s minGPT and nanoGPT (inspirational, but not reused)
See PROGRESS.md for completed milestones, model checkpoints, and active development notes.
git clone https://github.com/kjpou1/llm-zero-to-trained.git
cd llm-zero-to-trained
🧰 Environment Setup (with uv)
uv venv
source .venv/bin/activate # macOS/Linux
# OR
.venv\Scripts\activate # Windows
uv pip install --editable .
uv sync
🧪 This enables the
llmscratch
CLI from anywhere and installs all dependencies with reproducible locking viauv
.
Use the CLI to run modular LLM pipelines:
llmscratch preprocess --config configs/data_config.yaml
More commands like train
, sample
, and evaluate
will follow as the project evolves.
While the architecture is influenced by great projects, all code is original and written from scratch:
This project includes implementation-focused documentation aligned with academic papers and architectural design:
bpe_implementation.md
— Byte Pair Encoding (BPE) training process, aligned with Sennrich et al. (2015)
This project is MIT licensed.