A learning-focused repo for pre-training small language models from scratch, end to end. It takes inspiration from nanochat but uses PyTorch Lightning and a structure that fits my style better.
- GPT-2 & Llama 3 implementations from scratch
- PyTorch Lightning for the training loop
- rustbpe + tiktoken tokenizer training and inference artifacts
- Config-driven pipeline (model, data, tokenizer, training)
- Trackio for the monitoring
uv run python scripts/tokenizer/train_tokenizer.py
uv run python scripts/data/tokenize_data.py
uv run python scripts/training/train_gpt2.py
uv run python scripts/inference/generate.pyEach script has defaults and flags; take a quick look in scripts/.
Artifacts:
- Tokenizers:
tokenizers/ - Tokenized data:
data/ - Runs/checkpoints:
runs/
As a starting point, I validated the code by training a 2-layer GPT-2 model using an 8k-vocab version of the TinyStories dataset (see paper).
It took ~25 minutes on an RTX 5090. I am sure it can be optimized.
| Tokens/sec | Train loss |
|---|---|
![]() |
![]() |
| Val loss | Val perplexity |
|---|---|
![]() |
![]() |
Example text generated:
<bos>Once upon a time there was a little girl called Anna. She was three years old and very curious. One day she decided to take a look. She stepped into the field and saw a beautiful butterfly fluttering around. She felt so happy and she wanted to follow it.
Suddenly, Anna heard a voice. It was her mom, who had said to go to the pond. Anna was so excited! She followed her mom and Dad to the pond. They got a bucket of water and a bucket of water.
Anna carefully put the net in the bucket and then she washed it in the water. Then she added it to the pond. She then swam back to the pond and touched the butterfly. It felt so good! Anna was so proud of herself!<eos>
- Review and optimize the code
- Improve documentation
- Implement more modern architectures, starting with LLaMA-3
I encourage everyone to write their own version of this repo from scratch. You can use this one as inspiration if it helps (just as I did with nanochat). The important thing to me is learning how these models work and how the tokenization and training pipelines work. For serious stuff with multiple GPUs there are specialized libraries.



