Skip to content

Building a Large Language Model from scratch for deep understanding — inspired by Sebastian Raschka’s book, implemented entirely by hand.

Notifications You must be signed in to change notification settings

kjpou1/llm-zero-to-trained

Repository files navigation

LLM: Zero to Trained

A fully manual, from-scratch implementation of a Large Language Model (LLM), guided by the book Build a Large Language Model from Scratch by Sebastian Raschka.

Unlike simply running code from the reference repo, this project is about understanding and re-implementing every major component — tokenizer, model, data loader, training loop — by hand.



🎯 Objective

To develop a working LLM from first principles by:

  • Writing each component from scratch in Python
  • Following the structure and logic from Raschka’s book
  • Validating ideas through notebooks and experiments
  • Building modular, reusable, CLI-driven code

📁 Project Structure

llm-zero-to-trained/
├── src/llmscratch/        ← Modular CLI-driven Python package
│   ├── config/            ← Config loader with .env, CLI, YAML support
│   ├── models/            ← Core dataclasses and SingletonMeta
│   ├── runtime/           ← CLI argument parsing and dispatch
│   ├── launch_host.py     ← Entry point for all commands (e.g., preprocess)
│   └── host.py            ← Command execution coordinator
├── notebooks/             ← Book-aligned exploration notebooks
├── configs/               ← YAML configs for datasets, vocab, etc.
├── datasets/              ← Raw and processed tokenized data
├── pyproject.toml         ← Project metadata and CLI definition
├── README.md              ← This file
└── PROGRESS.md            ← Running log of milestones

🧠 Learning Sources


🗒️ Progress Log

See PROGRESS.md for completed milestones, model checkpoints, and active development notes.


🔧 Getting Started

git clone https://github.com/kjpou1/llm-zero-to-trained.git
cd llm-zero-to-trained

🧰 Environment Setup (with uv)

uv venv
source .venv/bin/activate        # macOS/Linux
# OR
.venv\Scripts\activate           # Windows

uv pip install --editable .
uv sync

🧪 This enables the llmscratch CLI from anywhere and installs all dependencies with reproducible locking via uv.


🚀 CLI: Start Building

Use the CLI to run modular LLM pipelines:

llmscratch preprocess --config configs/data_config.yaml

More commands like train, sample, and evaluate will follow as the project evolves.


📚 References & Inspirations

While the architecture is influenced by great projects, all code is original and written from scratch:


📄 Internal Documentation

This project includes implementation-focused documentation aligned with academic papers and architectural design:

  • bpe_implementation.md — Byte Pair Encoding (BPE) training process, aligned with Sennrich et al. (2015)

📜 License

This project is MIT licensed.

About

Building a Large Language Model from scratch for deep understanding — inspired by Sebastian Raschka’s book, implemented entirely by hand.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published