Welcome to the 5-dollar-llm repository! This project is dedicated to pushing the limits of training efficiency for a 88M parameter model on 1 billion tokens (GPT-1 level model).
If you don't have a GPU, you may use a cloud GPU.
- Lightning AI: You can use the free L4 GPU.
- Google Colab: Use the free T4 or paid A100. Click here to open
- Tip: If the model doesn't fit in your GPU memory, you can reduce the model size (e.g., reduce
batch_size,n_layer, orn_embdinconfigs/llm_config.py).
- You may rent a GPU affordably at Salad | Novita (or use our affiliate to help us get more compute ❤️) | VastAI - A lot of GPU providers give 50% off on spot billing.
You may watch our tutorial on the AI Research Setup.
We recommend using Python 3.10+.
git clone https://github.com/Open-Superintelligence-Lab/5-dollar-llm
cd 5-dollar-llmpip install -r requirements.txtpython3 -c "
from datasets import load_dataset
import os
print('Downloading 40M Token Subset...')
ds = load_dataset('vukrosic/blueberry-1B-pretrain', split='train[:20000]')
os.makedirs('processed_data/speedrun_40M', exist_ok=True)
ds.save_to_disk('processed_data/speedrun_40M')
print('✅ Speedrun Data Ready!')
"python3 -c "
from datasets import load_dataset
import os
print('Downloading 1B Pretraining Data...')
ds = load_dataset('vukrosic/blueberry-1B-pretrain')
os.makedirs('processed_data/pretrain_1B', exist_ok=True)
ds.save_to_disk('processed_data/pretrain_1B')
print('✅ Full Data Ready!')
"You need to know how our (current) code performs on your hardware before changing it, so you can measure the impact of your changes.
- This is done by simply running
python train_llm.py - After it finishes running, please run it again.
- Keep note of
Training Time (⏱️ Speedrun):and finalFinal Val Lossfrom the second run. - You may notice that these 2 runs give different training time, even though they execute the exact same code. This is normal, and it is because the first run will build / compile the model, the second run is what you need to beat. If you can solve this issue so it compiles graphs if needed with just a single run, and doesn't add that to the training time, please make a pull request.
Now that you have the exact time you need to beat, you can start making changes.
If you ran python train_llm.py as mentioned above, you trained the model on 8 million tokens (default).
Currently we have 4 benchmarks:
- 8,000,000 Tokens
- 20,000,000 Tokens
- 100,000,000 Tokens
- 1,000,000,000 Tokens
Just an improvement on 1 benchmark is enough to submit, but you may measure multiple.
If you wish to try 20M tokens, please run python train_llm.py --train_tokens 20000000.
W are not yet sure if you need to rerun it 2 times after you have already built the graphs with 8M tokens. We are working on this. As a safe bet, we recommend running the baseline on 20M 2 times as well and checking the last results.
Same goes for 100M and 1B tokens but make sure you have the full 1 billion token dataset downloaded.
Add your code changes.
- Only make a single change at a time and train the model to measure the impact of it. If the resulting time is a lot slower than the baseline, your changes may have broken the torch graph so you will have to run it a second time to get the real results.
- Do not combine multiple experiments into one (eg. learning rate, fused adam, attention heads, etc.) because you will not know what caused improvement and what caused regression.
Confirm that your changes ourperform baseline - check the Training Time (⏱️ Speedrun): & final Final Val Loss.
Create a pull request on GitHub into main branch.
Once you submit your changes, we will mesures them ourselves, and if they improve performance, we will add you to the leaderboard - you can leave your X / LinkedIn / GitHub / etc. in the pull request.
- Configs: Modify
configs/llm_config.pyto change configs (keep the parameter size around 88M), learning rates, or optimization schedules. - Model: Edit
models/llm.pyto experiment with new attention mechanisms or layer types. - GPU Memory: If the model doesn't fit on your GPU, you can reduce the model size (e.g.,
batch_sizeorn_layer) for faster local iteration.