tiny-lm

A learning-focused repo for pre-training small language models from scratch, end to end. It takes inspiration from nanochat but uses PyTorch Lightning and a structure that fits my style better.

What is currently implemented

GPT-2 & Llama 3 implementations from scratch
PyTorch Lightning for the training loop
rustbpe + tiktoken tokenizer training and inference artifacts
Config-driven pipeline (model, data, tokenizer, training)
Trackio for the monitoring

Quick start

uv run python scripts/tokenizer/train_tokenizer.py
uv run python scripts/data/tokenize_data.py
uv run python scripts/training/train_gpt2.py
uv run python scripts/inference/generate.py

Each script has defaults and flags; take a quick look in scripts/.

Artifacts:

Tokenizers: tokenizers/
Tokenized data: data/
Runs/checkpoints: runs/

TinyStories run

As a starting point, I validated the code by training a 2-layer GPT-2 model using an 8k-vocab version of the TinyStories dataset (see paper).

It took ~25 minutes on an RTX 5090. I am sure it can be optimized.

Tokens/sec	Train loss

Val loss	Val perplexity

Example text generated:

<bos>Once upon a time there was a little girl called Anna. She was three years old and very curious. One day she decided to take a look. She stepped into the field and saw a beautiful butterfly fluttering around. She felt so happy and she wanted to follow it.

Suddenly, Anna heard a voice. It was her mom, who had said to go to the pond. Anna was so excited! She followed her mom and Dad to the pond. They got a bucket of water and a bucket of water.

Anna carefully put the net in the bucket and then she washed it in the water. Then she added it to the pond. She then swam back to the pond and touched the butterfly. It felt so good! Anna was so proud of herself!<eos>

Next steps

Review and optimize the code
Improve documentation
Implement more modern architectures, starting with LLaMA-3

Build your own

I encourage everyone to write their own version of this repo from scratch. You can use this one as inspiration if it helps (just as I did with nanochat). The important thing to me is learning how these models work and how the tokenization and training pipelines work. For serious stuff with multiple GPUs there are specialized libraries.

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.cursor/rules		.cursor/rules
.github		.github
configs		configs
images		images
scripts		scripts
tests		tests
tiny_lm		tiny_lm
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
TODO.md		TODO.md
pyproject.toml		pyproject.toml
run_training.sh		run_training.sh
todo_store_tokenized.md		todo_store_tokenized.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tiny-lm

What is currently implemented

Quick start

TinyStories run

Next steps

Build your own

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tiny-lm

What is currently implemented

Quick start

TinyStories run

Next steps

Build your own

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages