smallm

Smallm (smaLL + LLm) is my attempt on making a tiny toy language model just for fun and educational purposes. It has about 28m parameters and is trained on 300k samples of the Cosmopedia dataset which have roughly 325m tokens. This is very small compared to LLMs' standards, which also explains why it is kinda goofy when you use it (lol), but you can definitely train this on a mid-range card for just half a day or 1-2 days, and it can still generate proper English and data that should be related to the user's prompt.

Setup

Setup venv and install necessary packages:

# Create and activate venv (run this every time you start)
python -m venv venv
source venv/scripts/activate
# or "./venv/scripts/activate" if you are on windows

# Install packages (once)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
pip install tiktoken datasets

Of course, you should already install compatible CUDA and Python versions, I currently use Python 3.13 and CUDA 13 (which is compatible with CUDA 12.8 mentioned above).

Running smallm

Download the latest model (chatbot.pth) in the releases page.
Simply run:

python main.py

A prompt will appear for you to chat with the model. You can also import the ChatBot class for more control if needed.

Training

Head over to ./main.py and change training to True, then run:

python main.py

The model will train for 10 epochs (estimated 18-20 hours on my Laptop RTX 5070), and after each epoch it will save the current model to ./chatbot.pth.

To start from where you left off, just name your file chatbot_continue.pth to resume training.

Architecture

Currently it uses:

Tokenizer: Tiktoken with GPT-2 encoding (50,257 vocab size)
Embedding: 256-dimensional token embeddings
Positional Encoding: Sinusoidal positional encoding
Linear input projection layer from embedding to transformer
Transformer: 3 encoder layers, 8 attention heads, 1024 block size, 256 d_model
Output: Linear layer to vocabulary

and is trained with:

Dataset: Cosmopedia (~325M tokens)
Context Window: 1024 tokens
Batch Size: 8 (effective batch size: 64 with gradient accumulation)
Optimizer: AdamW with mixed precision training
10 epochs (about 18 hours on the laptop-version RTX 5070)

and generates text with:

Sampling: Top-k sampling (k=50)
Temperature: 0.8 (configurable)
Context Window: 1024 tokens
Stopping: Natural EOS token or conversation breaks
Simple repetition penalty

Copyrights and License

This project is licensed under the GPL 3.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

smallm

Setup

Running smallm

Training

Architecture

Copyrights and License

About

Uh oh!

Releases 2

Languages

License

nguyenphuminh/smallm

Folders and files

Latest commit

History

Repository files navigation

smallm

Setup

Running smallm

Training

Architecture

Copyrights and License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Languages