Building powerful small language models (100-300M parameters) completely from scratch!
SmolHub is a collection of small but mighty language models implemented entirely from scratch using PyTorch. This repository focuses on creating efficient, lightweight LLMs that can run on modest hardware while still delivering impressive performance. All models are designed to be in the 100-300M parameter range, making them perfect for research, experimentation, and resource-constrained environments.
| Model | Parameters | Architecture | Key Features | Training Dataset |
|---|---|---|---|---|
| SmolMixtral | ~124M (8x12M) | Mixture of Experts | Sparse activation, Flash Attention, SwiGLU | TinyStories (1M texts, 14K steps) |
| SmolTransformer | ~150M | Standard Transformer | Classic attention mechanism, RMSNorm | FineWeb |
| StoryLlama | ~88M | Llama-inspired | RoPE, SwiGLU, MQA/GQA, RMSNorm | TinyStories (4B tokens, 5K steps) |
| StoryMixtral | ~200M | Mixtral variant | Story-focused MoE training | Custom story dataset |
| StoryKimi | ~180M | Custom architecture | Optimized for narrative generation | Story corpus |
- Multi-Head Attention (MHA): Classic transformer attention
- Multi-Query Attention (MQA): Shared key-value heads for efficiency
- Grouped-Query Attention (GQA): Balanced efficiency and performance
- Flash Attention: Memory-efficient attention computation
- Rotary Position Embeddings (RoPE): Advanced position encoding
- Learned Position Embeddings: Traditional approach
- SwiGLU: Gated Linear Unit with Swish activation
- Swish: Smooth activation function
- RMSNorm: Root Mean Square Layer Normalization
- LayerNorm: Standard layer normalization
- Mixture of Experts (MoE): 8 experts with top-2 routing in SmolMixtral
- Noisy Top-K Routing: Enhanced expert selection
- Weight Tying: Shared embedding and output projection weights
- Gradient Checkpointing: Memory optimization during training
pip install torch torchvision transformers datasets wandb tqdm# Navigate to any model directory
cd SmolMixtral
# Install dependencies
bash install.sh
# Start training
python trainer.pyfrom inference import generate_text
# Load your trained model
model = load_model("path/to/checkpoint")
# Generate text
output = generate_text(
model=model,
prompt="Once upon a time",
max_length=100,
temperature=0.7
)
print(output)- SmolMixtral: Successfully trained 124M parameter MoE model
All models are optimized for:
- Fast inference on consumer GPUs (RTX 3090, RTX 4090)
- Low memory footprint (fits in 8GB VRAM)
- Quality text generation with coherent outputs
- Efficient training with gradient checkpointing and mixed precision
- Generated high-quality stories and narratives
- Competitive perplexity scores for model size
- Fast inference speeds (>100 tokens/second on RTX 4090)
- Distributed Training: Multi-GPU support with PyTorch DDP
- Memory Optimization:
- Gradient checkpointing for large models
- Mixed precision training (FP16/BF16)
- Flash Attention for memory efficiency
- Advanced Scheduling:
- Cosine annealing with warmup
- Learning rate scheduling with restarts
- Monitoring:
- Wandb integration for experiment tracking
- Real-time loss visualization
- Gradient norm monitoring
- Checkpointing:
- Automatic model saving and resuming
- State-dict preservation for optimizers and schedulers
- AdamW Optimizer: With configurable weight decay
- Gradient Clipping: Prevents exploding gradients
- Dynamic Loss Scaling: For mixed precision training
- Expert Load Balancing: For MoE models (auxiliary loss)
- Tokenizer Integration: Custom BPE and GPT-2 tokenizers
SmolHub/
โโโ SmolMixtral/ # Mixture of Experts implementation
โโโ SmolTransformer/ # Standard transformer architecture
โโโ StoryLlama/ # Llama-inspired model for stories
โโโ StoryMixtral/ # Mixtral variant for narratives
โโโ StoryKimi/ # Custom story-focused architecture
โโโ smolhub_hub/ # Package management utilities
โโโ README.md # This file
Each model directory contains:
model.py- Core architecture implementationtrainer.py- Training loop and optimizationinference.py- Text generation utilitiesconfig.py- Model configurationtokenizer.py- Tokenization utilitiesgradio/- Interactive web demos
Each model comes with a Gradio-powered web interface for easy experimentation:
cd SmolMixtral/gradio
python app.py- TinyStories: High-quality short stories for narrative models
- SmolMixtral: 1M texts, 14K training steps
- StoryLlama: 4B tokens, 5K training steps
- FineWeb: Curated high-quality web text (10BT sample)
- OpenWebText: Diverse internet content
- Custom datasets: Domain-specific content for specialized models
- From Scratch Training: All models trained without pre-training transfer
- Progressive Training: Gradual increase in sequence length and complexity
- Data Quality Focus: Careful dataset curation and filtering
- Evaluation Strategy: Regular validation on held-out sets
- Text Generation: Real-time quality assessment during training
This repository serves as a research platform for:
- Architecture exploration: Testing new attention mechanisms
- Scaling laws: Understanding parameter efficiency
- Training techniques: Experimenting with optimization strategies
- Evaluation: Comprehensive benchmarking of small models
Contributions are welcome! Whether you want to:
- Add new model architectures
- Improve training efficiency
- Add evaluation scripts
- Fix bugs or improve documentation
Please feel free to open issues and pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
- Inspired by the incredible work on Transformer architectures
- Built with PyTorch and the amazing open-source ML community
- Special thanks to papers on MoE, RoPE, and efficient attention mechanisms
- GitHub: @YuvrajSingh-mist
- Issues: Feel free to open GitHub issues for questions or bugs
"Small models, big possibilities!" ๐
- Add more architecture variants
- Implement quantization techniques
- Add comprehensive benchmarks
- Create model comparison tools
- Add fine-tuning scripts
- Implement RLHF training