A modular, research-friendly Transformer architecture for building and experimenting with small-scale language models — designed for learning, exploration, and open research.
💡 This project lets you define your own transformer architectures (variable layers, heads, activations, dimensions) and train them on your own datasets. We have implemented this with TinyStories.
✅ Fully modular Transformer architecture
- Dynamic number of attention heads per layer
- Configurable activation functions (ReLU, GELU, SiLU)
- Pre-LayerNorm + Residual design for stability
✅ Flexible dataset pipeline
- Uses Hugging Face Datasets
- Tokenizes and chunks datasets into fixed-length training blocks
- Saves & loads efficiently with Arrow format
✅ Research-grade training pipeline
- Config-driven hyperparameters
- AdamW optimizer + gradient clipping
- Warmup + cosine LR scheduler
- Optional mixed precision (AMP)
- Hugging Face model saving/loading compatibility
✅ Multi-GPU (Distributed) training support
- Out-of-the-box DistributedDataParallel (DDP)
- Sharded sampling for each process
- Rank-aware logging & checkpointing
✅ Open-source ready
- Easy to modify, extend, and push to the Hugging Face Hub
- Educational and clean codebase
SLM_Skeleton/
├── config_slm.py # Centralized configuration (model, data, training)
├── data_module.py # Dataset loading and tokenization
├── embedding_module.py # Token + positional embeddings
├── multihead_self_attention.py # Scaled dot-product attention (supports variable heads)
├── transformer_block.py # Transformer block (Pre-LN + MHA + FFN)
├── train_slm.py # Full training script with scheduler
├── prepare_and_save_tokenized_dataset.py # Preprocess TinyStories
└── trained_slm/ # Output directory for saved model-
Clone the repository:
git clone https://github.com/aditya20t/SLM_Skeleton.git cd SLM_Skeleton -
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
By default, SLM uses the TinyStories dataset. Run the preprocessing script:
python data_module.pySLM follows a GPT-style decoder-only transformer with a modular configuration based on the Pre-LayerNorm design for better training stability.
Architecture Flow:
Embedding → [ (LayerNorm → MHA → Residual) + (LayerNorm → FFN → Residual) ] × N → LayerNorm → LM Head
Each layer in the stack can have its own:
- number of attention heads
- activation type
- dropout rate
# From config_slm.py
n_heads_per_layer = [4, 4, 8, 8, 16, 16]
activations = ["gelu", "gelu", "silu", "silu", "gelu", "gelu"]To train the model on a single GPU, simply run python train_slm.py. The training pipeline includes a configurable warmup and cosine decay learning rate scheduler, AdamW optimizer.
After training, your model is saved to ./trained_slm/ in a format compatible with the Hugging Face ecosystem. You can load it for inference using the standard transformers library:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("./trained_slm")
tokenizer = AutoTokenizer.from_pretrained("./trained_slm")
text = "Once upon a time"
inputs = tokenizer(text, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0]))