Skip to content

aditya20t/SLM_Skeleton

Repository files navigation

🧠 Small Language Model (SLM)

A modular, research-friendly Transformer architecture for building and experimenting with small-scale language models — designed for learning, exploration, and open research.

💡 This project lets you define your own transformer architectures (variable layers, heads, activations, dimensions) and train them on your own datasets. We have implemented this with TinyStories.


🚀 Features

Fully modular Transformer architecture

  • Dynamic number of attention heads per layer
  • Configurable activation functions (ReLU, GELU, SiLU)
  • Pre-LayerNorm + Residual design for stability

Flexible dataset pipeline

  • Uses Hugging Face Datasets
  • Tokenizes and chunks datasets into fixed-length training blocks
  • Saves & loads efficiently with Arrow format

Research-grade training pipeline

  • Config-driven hyperparameters
  • AdamW optimizer + gradient clipping
  • Warmup + cosine LR scheduler
  • Optional mixed precision (AMP)
  • Hugging Face model saving/loading compatibility

Multi-GPU (Distributed) training support

  • Out-of-the-box DistributedDataParallel (DDP)
  • Sharded sampling for each process
  • Rank-aware logging & checkpointing

Open-source ready

  • Easy to modify, extend, and push to the Hugging Face Hub
  • Educational and clean codebase

🧩 Project Structure

SLM_Skeleton/
├── config_slm.py               # Centralized configuration (model, data, training)
├── data_module.py              # Dataset loading and tokenization
├── embedding_module.py           # Token + positional embeddings
├── multihead_self_attention.py   # Scaled dot-product attention (supports variable heads)
├── transformer_block.py          # Transformer block (Pre-LN + MHA + FFN)
├── train_slm.py                  # Full training script with scheduler
├── prepare_and_save_tokenized_dataset.py  # Preprocess TinyStories
└── trained_slm/                  # Output directory for saved model

⚙️ Installation

  1. Clone the repository:

    git clone https://github.com/aditya20t/SLM_Skeleton.git
    cd SLM_Skeleton
  2. Create and activate a virtual environment:

    python -m venv venv
    source venv/bin/activate
  3. Install dependencies:

    pip install -r requirements.txt

📦 Dataset Preparation

By default, SLM uses the TinyStories dataset. Run the preprocessing script:

python data_module.py

🧠 Model Architecture Overview

SLM follows a GPT-style decoder-only transformer with a modular configuration based on the Pre-LayerNorm design for better training stability.

Architecture Flow:

Embedding → [ (LayerNorm → MHA → Residual) + (LayerNorm → FFN → Residual) ] × N → LayerNorm → LM Head

Each layer in the stack can have its own:

  • number of attention heads
  • activation type
  • dropout rate
Example dynamic Configuration
# From config_slm.py
n_heads_per_layer = [4, 4, 8, 8, 16, 16]
activations = ["gelu", "gelu", "silu", "silu", "gelu", "gelu"]

🏋️‍♂️ Training

To train the model on a single GPU, simply run python train_slm.py. The training pipeline includes a configurable warmup and cosine decay learning rate scheduler, AdamW optimizer.

💾 Saving and Loading (Hugging Face format)

After training, your model is saved to ./trained_slm/ in a format compatible with the Hugging Face ecosystem. You can load it for inference using the standard transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("./trained_slm")
tokenizer = AutoTokenizer.from_pretrained("./trained_slm")

text = "Once upon a time"
inputs = tokenizer(text, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=50)

print(tokenizer.decode(output[0]))

About

A modular, research-friendly Transformer architecture for building and experimenting with small-scale language models — designed for learning, exploration, and open research.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages