LLM: Zero to Trained

A fully manual, from-scratch implementation of a Large Language Model (LLM), guided by the book Build a Large Language Model from Scratch by Sebastian Raschka.

Unlike simply running code from the reference repo, this project is about understanding and re-implementing every major component — tokenizer, model, data loader, training loop — by hand.

LLM: Zero to Trained

🎯 Objective

To develop a working LLM from first principles by:

Writing each component from scratch in Python
Following the structure and logic from Raschka’s book
Validating ideas through notebooks and experiments
Building modular, reusable, CLI-driven code

📁 Project Structure

llm-zero-to-trained/
├── src/llmscratch/        ← Modular CLI-driven Python package
│   ├── config/            ← Config loader with .env, CLI, YAML support
│   ├── models/            ← Core dataclasses and SingletonMeta
│   ├── runtime/           ← CLI argument parsing and dispatch
│   ├── launch_host.py     ← Entry point for all commands (e.g., preprocess)
│   └── host.py            ← Command execution coordinator
├── notebooks/             ← Book-aligned exploration notebooks
├── configs/               ← YAML configs for datasets, vocab, etc.
├── datasets/              ← Raw and processed tokenized data
├── pyproject.toml         ← Project metadata and CLI definition
├── README.md              ← This file
└── PROGRESS.md            ← Running log of milestones

🧠 Learning Sources

📘 Build a Large Language Model from Scratch – Sebastian Raschka (2024)
💻 LLMs-from-scratch GitHub Repo
🧠 Karpathy’s minGPT and nanoGPT (inspirational, but not reused)

🗒️ Progress Log

See PROGRESS.md for completed milestones, model checkpoints, and active development notes.

🔧 Getting Started

git clone https://github.com/kjpou1/llm-zero-to-trained.git
cd llm-zero-to-trained

🧰 Environment Setup (with uv)

uv venv
source .venv/bin/activate        # macOS/Linux
# OR
.venv\Scripts\activate           # Windows

uv pip install --editable .
uv sync

🧪 This enables the llmscratch CLI from anywhere and installs all dependencies with reproducible locking via uv.

🚀 CLI: Start Building

Use the CLI to run modular LLM pipelines:

llmscratch preprocess --config configs/data_config.yaml

More commands like train, sample, and evaluate will follow as the project evolves.

📚 References & Inspirations

While the architecture is influenced by great projects, all code is original and written from scratch:

📄 Internal Documentation

This project includes implementation-focused documentation aligned with academic papers and architectural design:

bpe_implementation.md — Byte Pair Encoding (BPE) training process, aligned with Sennrich et al. (2015)

📜 License

This project is MIT licensed.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
.vscode		.vscode
configs		configs
datasets/raw		datasets/raw
docs		docs
model		model
notebooks		notebooks
resources		resources
src/llmscratch		src/llmscratch
.gitignore		.gitignore
PROGRESS.md		PROGRESS.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM: Zero to Trained

🎯 Objective

📁 Project Structure

🧠 Learning Sources

🗒️ Progress Log

🔧 Getting Started

🧰 Environment Setup (with uv)

🚀 CLI: Start Building

📚 References & Inspirations

📄 Internal Documentation

📜 License

About

Uh oh!

Releases

Packages

Languages

kjpou1/llm-zero-to-trained

Folders and files

Latest commit

History

Repository files navigation

LLM: Zero to Trained

🎯 Objective

📁 Project Structure

🧠 Learning Sources

🗒️ Progress Log

🔧 Getting Started

🧰 Environment Setup (with uv)

🚀 CLI: Start Building

📚 References & Inspirations

📄 Internal Documentation

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages