Skip to content
/ slms Public

SLMS is an open-source research project that integrates Regular Transformers with biologically-inspired Spiking Neural Networks to develop efficient 1B-parameter language models.

Notifications You must be signed in to change notification settings

g-hano/slms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SLMS

πŸŽ“ Want to train your own model? Check out our comprehensive step-by-step guide: TRAIN.md

πŸ“– The Story & Vision

Origins

The inspiration for this project came after attending the ZurichNLP #17 seminar, where Matteo Saponati mentioned his work on Spiking Neural Networks (SNNs). His talk sparked my curiosity about the potential of neuromorphic computing for language models, leading me to explore how SNNs could be applied to natural language processing.

This curiosity evolved into a concrete question: Can we create highly efficient, specialized language models that run natively on edge devices without sacrificing deep reasoning capabilities? The goal became clear: bridge the gap between heavy cloud-based LLMs and the restricted computational power of mobile hardware, while leveraging the energy efficiency of biologically-inspired spiking architectures.

The Multilingual Experiment

Initially, the project aimed for a massive scope. I attempted to train a multilingual foundation model covering English, Turkish, Russian, French, German, and Chinese.

  • The Complexity of Chinese: During the early iterations, I discovered that the token density and structural complexity of Chinese required a significantly larger vocabulary and deeper embedding space than my target architecture allowed.
  • The Budget Constraint: Training a high-quality multilingual model requires immense computational resources. As an independent developer, I had to make strategic decisions to maximize the "intelligence-per-dollar" ratio. To ensure the model reached a high level of reasoning (GSM8K standards) within a realistic budget, I narrowed the focus exclusively to English.

Why Spiking Neural Networks (SNN)?

Beyond standard Transformers, I integrated SNN architectures to explore biological efficiency. SNNs process information through discrete pulses (spikes), potentially offering a path to "always-on" AI with a fraction of the power consumption of traditional dense models.


🌟 Key Features

  • Hybrid Engine: Toggle between Regular Transformer (for stability) and SNN (for efficiency).
  • Strategic Curriculum: A three-phase training approach (256 β†’ 1024 β†’ 2048) to stabilize long-context learning.
  • Edge-First Design: Native CoreML export support for high-speed inference on Apple Neural Engine (ANE).
  • Advanced Optimization: Utilizing the Muon Optimizer for 2D parameters to achieve faster convergence during pre-training.

πŸ—οΈ Architecture Deep Dive

Regular Transformer

  • Attention: Grouped Query Attention (GQA) with Sliding Window mechanisms.

  • Normalization: Pre-RMSNorm for improved gradient stability.

  • Memory Efficiency: Gradient Checkpointing integrated into every block to allow training 1B+ models on consumer GPUs.

Spiking Neural Network

  • Neuronal Model: Leaky Integrate-and-Fire (LIF) neurons from snntorch.
  • Temporal Processing: Multi-step spike integration to capture sequential dependencies without the quadratic cost of full attention.

πŸ“ˆ Training Methodology & Progress

Training Pipeline

The training followed a two-stage approach combining pre-training and knowledge distillation:

Stage 1: Foundation Pre-training

The model was trained through a strategic curriculum across three phases:

Phase Objective Context Dataset Status
Phase 1 Language Foundations 256 phase1-256 Completed (100k steps)
Phase 2 Reasoning & Logic 1,024 phase2-1024 Completed
Phase 3 Long Context 2,048 phase3-2048 Completed

Result: The Flow model achieved 4.1 perplexity after the initial pre-training phase, demonstrating strong language modeling capabilities.

Stage 2: Knowledge Distillation from Qwen3-4B

To enhance mathematical reasoning without extensive compute requirements, I employed knowledge distillation:

  1. Teacher Model: Used Qwen3-4B-Instruct-2507 (4B parameters) as the teacher
  2. Data Generation: Generated solutions to GSM8K math problems using the teacher model
  3. Distillation Training: Fine-tuned the Flow-1B and Pulse-1B models on the teacher's reasoning traces

This approach allowed the 1B parameter student models to "learn" from a larger, more capable teacher without requiring the computational budget to train on raw data alone.

Training Metrics (Flow Model)

Below are the training curves from Weights & Biases for the Flow-1B model during pre-training:

Training Loss Validation Loss

Learning Rate Schedule

Key Observations:

  • Training Loss: Steady convergence from ~7.0 to ~4.3 over 20k steps
  • Validation Loss: Final validation loss of ~4.1 (corresponding to 4.1 perplexity)
  • Learning Rate: Cosine annealing schedule with warm restarts to prevent local minima

Training Metrics (Pulse Model - SNN)

The Pulse-1B model uses Spiking Neural Network architecture with Leaky Integrate-and-Fire (LIF) neurons for energy-efficient inference:

Pulse Training Loss Pulse Validation Loss

Pulse Learning Rate Schedule

Key Observations:

  • SNN Architecture: Uses spiking neurons instead of continuous activations
  • Energy Efficiency: Designed for ultra-low power inference on neuromorphic hardware
  • Training Challenge: Surrogate gradient methods to overcome non-differentiable spike functions
  • Performance: Competitive accuracy with significantly reduced power consumption

Budget & Infrastructure

This project was built with extreme resource constraints in mind:

  • Total Budget: Approximately $50 USD for all training phases
  • GPU Provider: vast.ai - rented affordable GPU instances (primarily RTX 3090/4090 on the spot market)
  • Training Duration: Due to the limited budget, training was kept minimal with carefully selected checkpoints rather than exhaustive long runs
  • Strategic Choices: Every decisionβ€”from curriculum design to architecture choicesβ€”was optimized for maximum "intelligence per dollar"

This constraint forced innovation: using gradient checkpointing, mixed precision, and the Muon optimizer allowed me to achieve respectable reasoning capabilities on a shoestring budget that would typically be unrealistic for LLM training.


πŸ€— Trained Models & Datasets

All trained models and datasets are publicly available on Hugging Face:

Pre-trained Models

  • Flow-1B-gsm8k - Regular Transformer model fine-tuned on GSM8K reasoning tasks
  • Pulse-1B-gsm8k - Spiking Neural Network (SNN) variant trained on GSM8K

Custom Tokenizer

Training Datasets

  • phase1-256 - Foundational language training (256 context length)
  • phase2-1024 - Reasoning and logic training (1,024 context length)
  • phase3-2048 - Long-context polish training (2,048 context length)

πŸ‘¨β€πŸ’» Author

This project was developed by Cihan YalΓ§Δ±n (Chan-Y).

Connect with me:


πŸ“‚ Project Structure

.
β”œβ”€β”€ data/               # Cleaning, Tokenization, and HF Upload scripts
β”œβ”€β”€ models/             # Transformer & SNN implementations
β”œβ”€β”€ tokenizer/          # Custom BPE Tokenizer training
β”œβ”€β”€ distill/            # Knowledge distillation scripts
β”œβ”€β”€ config.yaml         # Centralized training & model configuration
β”œβ”€β”€ multigpu_train.py   # Distributed training entry point
β”œβ”€β”€ inference.py        # Inference script for pretrained models
└── convert2coreml.py   # CoreML / Edge deployment utility


πŸš€ How to Use

Inference with Pretrained Models

The easiest way to get started is to use our pretrained models directly:

# Interactive chat with Flow-1B model
python inference.py --model_type flow

# Interactive chat with Pulse-1B (SNN) model
python inference.py --model_type pulse

# Single question
python inference.py --model_type flow --prompt "What is 15 * 23?"

# Batch generation from file
python inference.py --model_type flow --prompts_file questions.txt --output results.jsonl

Available Models:

Both models use the flow-pulse-tokenizer which will be automatically downloaded.

The inference script will automatically download model weights on first run and cache them locally.

Training Your Own Model

  1. Configure: Update config.yaml with your HuggingFace tokens and local paths.
  2. Setup: Run accelerate config to map your local GPUs.
  3. Train: Start the curriculum with accelerate launch multigpu_train.py.

Full Training Guide

πŸ“– For a complete step-by-step training guide, see TRAIN.md

This comprehensive guide covers:

  • Environment setup and dependencies
  • Data pipeline (download, prepare, tokenize)
  • Custom tokenizer training
  • Multi-phase curriculum learning
  • Knowledge distillation
  • Model deployment and sharing

This framework is built upon the synthesis of several state-of-the-art research papers and architectures. The implementation of specific modules (Attention, RoPE, SNN) is informed by the following literature:

Core Transformer Architectures

Spiking Neural Networks (SNN)

  • snntorch (Luo & Eshraghian): The fundamental library used for spiking neuron dynamics and surrogate gradient learning.
  • SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks
  • SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking
  • Attention Spiking Neural Networks: Advanced attention mechanisms for spiking architectures

Optimization & Scaling

  • Muon Optimizer (Kellin Pelrine et al.): Implementation of the Matrix-valued Orthogonalized Newton (Muon) optimizer for faster convergence on 2D weight matrices during the pre-training phase.
  • Chinchilla Scaling Laws (Hoffmann et al.): Used to determine the optimal balance between parameter count (1B) and dataset size (30GB) for maximum efficiency.

πŸ› οΈ Technical Implementation Notes

  • Weight Tying: To reduce memory footprint and improve convergence, the model uses tied weights between the token_embedding and the lm_head.
  • Mixed Precision (BF16): All training is conducted in bfloat16 to leverage the hardware acceleration of RTX 40-series Tensor Cores while maintaining numerical stability.
  • Gradient Checkpointing: Strategically applied to Transformer blocks to enable training with sequence lengths up to 2,048 on 16GB/32GB VRAM hardware.
  • Hybrid Optimizer Strategy (Muon + AdamW): The training employs a dual-optimizer approach that significantly accelerates convergence:
    • Muon Optimizer is applied to 2D parameters (weight matrices): token embeddings, attention projection weights (W_q, W_k, W_v, W_o), and feed-forward network weights. Muon's second-order curvature information enables ~5x faster convergence on these high-dimensional matrices compared to standard AdamW.
    • AdamW Optimizer is used for 1D parameters: biases, layer normalization parameters, and other low-dimensional tensors where first-order methods remain efficient.
    • This hybrid approach balances computational efficiency with convergence speed, particularly crucial given the limited training budget (~$50). Muon's superior performance on embeddings and attention matrices directly translates to better language modeling capabilities with fewer training steps.

About

SLMS is an open-source research project that integrates Regular Transformers with biologically-inspired Spiking Neural Networks to develop efficient 1B-parameter language models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages