🚀 SmolHub: Small Language Models from Scratch

Building powerful small language models (100-300M parameters) completely from scratch!

🎯 Overview

SmolHub is a collection of small but mighty language models implemented entirely from scratch using PyTorch. This repository focuses on creating efficient, lightweight LLMs that can run on modest hardware while still delivering impressive performance. All models are designed to be in the 100-300M parameter range, making them perfect for research, experimentation, and resource-constrained environments.

🏗️ Architecture Implementations

📚 Available Models

Model	Parameters	Architecture	Key Features	Training Dataset
SmolMixtral	~124M (8x12M)	Mixture of Experts	Sparse activation, Flash Attention, SwiGLU	TinyStories (1M texts, 14K steps)
SmolTransformer	~150M	Standard Transformer	Classic attention mechanism, RMSNorm	FineWeb
StoryLlama	~88M	Llama-inspired	RoPE, SwiGLU, MQA/GQA, RMSNorm	TinyStories (4B tokens, 5K steps)
StoryMixtral	~200M	Mixtral variant	Story-focused MoE training	Custom story dataset
StoryKimi	~180M	Custom architecture	Optimized for narrative generation	Story corpus

🔧 Key Components Implemented

Attention Mechanisms

Multi-Head Attention (MHA): Classic transformer attention
Multi-Query Attention (MQA): Shared key-value heads for efficiency
Grouped-Query Attention (GQA): Balanced efficiency and performance
Flash Attention: Memory-efficient attention computation

Position Encodings

Rotary Position Embeddings (RoPE): Advanced position encoding
Learned Position Embeddings: Traditional approach

Activation Functions & Normalization

SwiGLU: Gated Linear Unit with Swish activation
Swish: Smooth activation function
RMSNorm: Root Mean Square Layer Normalization
LayerNorm: Standard layer normalization

Advanced Features

Mixture of Experts (MoE): 8 experts with top-2 routing in SmolMixtral
Noisy Top-K Routing: Enhanced expert selection
Weight Tying: Shared embedding and output projection weights
Gradient Checkpointing: Memory optimization during training

🚀 Quick Start

Prerequisites

pip install torch torchvision transformers datasets wandb tqdm

Training a Model

# Navigate to any model directory
cd SmolMixtral

# Install dependencies
bash install.sh

# Start training
python trainer.py

Running Inference

from inference import generate_text

# Load your trained model
model = load_model("path/to/checkpoint")

# Generate text
output = generate_text(
    model=model,
    prompt="Once upon a time",
    max_length=100,
    temperature=0.7
)
print(output)

📊 Model Performance & Results

Training Achievements

SmolMixtral: Successfully trained 124M parameter MoE model
- 📈 View Training Report on WandB
- 💾 Pre-trained weights on HuggingFace

Key Performance Features

All models are optimized for:

Fast inference on consumer GPUs (RTX 3090, RTX 4090)
Low memory footprint (fits in 8GB VRAM)
Quality text generation with coherent outputs
Efficient training with gradient checkpointing and mixed precision

Benchmark Results

Generated high-quality stories and narratives
Competitive perplexity scores for model size
Fast inference speeds (>100 tokens/second on RTX 4090)

🏋️ Training Features

Advanced Training Techniques

Distributed Training: Multi-GPU support with PyTorch DDP
Memory Optimization:
- Gradient checkpointing for large models
- Mixed precision training (FP16/BF16)
- Flash Attention for memory efficiency
Advanced Scheduling:
- Cosine annealing with warmup
- Learning rate scheduling with restarts
Monitoring:
- Wandb integration for experiment tracking
- Real-time loss visualization
- Gradient norm monitoring
Checkpointing:
- Automatic model saving and resuming
- State-dict preservation for optimizers and schedulers

Optimization Features

AdamW Optimizer: With configurable weight decay
Gradient Clipping: Prevents exploding gradients
Dynamic Loss Scaling: For mixed precision training
Expert Load Balancing: For MoE models (auxiliary loss)
Tokenizer Integration: Custom BPE and GPT-2 tokenizers

🗂️ Repository Structure

SmolHub/
├── SmolMixtral/          # Mixture of Experts implementation
├── SmolTransformer/      # Standard transformer architecture  
├── StoryLlama/          # Llama-inspired model for stories
├── StoryMixtral/        # Mixtral variant for narratives
├── StoryKimi/           # Custom story-focused architecture
├── smolhub_hub/         # Package management utilities
└── README.md            # This file

Each model directory contains:

model.py - Core architecture implementation
trainer.py - Training loop and optimization
inference.py - Text generation utilities
config.py - Model configuration
tokenizer.py - Tokenization utilities
gradio/ - Interactive web demos

🎮 Interactive Demos

Each model comes with a Gradio-powered web interface for easy experimentation:

cd SmolMixtral/gradio
python app.py

📈 Training Data & Methodology

Datasets Used

TinyStories: High-quality short stories for narrative models
- SmolMixtral: 1M texts, 14K training steps
- StoryLlama: 4B tokens, 5K training steps
FineWeb: Curated high-quality web text (10BT sample)
OpenWebText: Diverse internet content
Custom datasets: Domain-specific content for specialized models

Training Methodology

From Scratch Training: All models trained without pre-training transfer
Progressive Training: Gradual increase in sequence length and complexity
Data Quality Focus: Careful dataset curation and filtering
Evaluation Strategy: Regular validation on held-out sets
Text Generation: Real-time quality assessment during training

🔬 Research & Experiments

This repository serves as a research platform for:

Architecture exploration: Testing new attention mechanisms
Scaling laws: Understanding parameter efficiency
Training techniques: Experimenting with optimization strategies
Evaluation: Comprehensive benchmarking of small models

🤝 Contributing

Contributions are welcome! Whether you want to:

Add new model architectures
Improve training efficiency
Add evaluation scripts
Fix bugs or improve documentation

Please feel free to open issues and pull requests.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Inspired by the incredible work on Transformer architectures
Built with PyTorch and the amazing open-source ML community
Special thanks to papers on MoE, RoPE, and efficient attention mechanisms

📞 Contact

GitHub: @YuvrajSingh-mist
Issues: Feel free to open GitHub issues for questions or bugs

"Small models, big possibilities!" 🌟

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
SmolLlama		SmolLlama
SmolMixtral		SmolMixtral
SmolTransformer		SmolTransformer
StoryKimi		StoryKimi
StoryLlama		StoryLlama
StoryMixtral		StoryMixtral
hf-spaces		hf-spaces
README.md		README.md

YuvrajSingh-mist/SmolHub

Folders and files

Latest commit

History

Repository files navigation

🚀 SmolHub: Small Language Models from Scratch

🎯 Overview

🏗️ Architecture Implementations

📚 Available Models

🔧 Key Components Implemented

Attention Mechanisms

Position Encodings

Activation Functions & Normalization

Advanced Features

🚀 Quick Start

Prerequisites

Training a Model

Running Inference

📊 Model Performance & Results

Training Achievements

Key Performance Features

Benchmark Results

🏋️ Training Features

Advanced Training Techniques

Optimization Features

🗂️ Repository Structure

🎮 Interactive Demos

📈 Training Data & Methodology

Datasets Used

Training Methodology

🔬 Research & Experiments

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Contact

🗺️ Roadmap

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages