π Want to train your own model? Check out our comprehensive step-by-step guide: TRAIN.md
The inspiration for this project came after attending the ZurichNLP #17 seminar, where Matteo Saponati mentioned his work on Spiking Neural Networks (SNNs). His talk sparked my curiosity about the potential of neuromorphic computing for language models, leading me to explore how SNNs could be applied to natural language processing.
This curiosity evolved into a concrete question: Can we create highly efficient, specialized language models that run natively on edge devices without sacrificing deep reasoning capabilities? The goal became clear: bridge the gap between heavy cloud-based LLMs and the restricted computational power of mobile hardware, while leveraging the energy efficiency of biologically-inspired spiking architectures.
Initially, the project aimed for a massive scope. I attempted to train a multilingual foundation model covering English, Turkish, Russian, French, German, and Chinese.
- The Complexity of Chinese: During the early iterations, I discovered that the token density and structural complexity of Chinese required a significantly larger vocabulary and deeper embedding space than my target architecture allowed.
- The Budget Constraint: Training a high-quality multilingual model requires immense computational resources. As an independent developer, I had to make strategic decisions to maximize the "intelligence-per-dollar" ratio. To ensure the model reached a high level of reasoning (GSM8K standards) within a realistic budget, I narrowed the focus exclusively to English.
Beyond standard Transformers, I integrated SNN architectures to explore biological efficiency. SNNs process information through discrete pulses (spikes), potentially offering a path to "always-on" AI with a fraction of the power consumption of traditional dense models.
- Hybrid Engine: Toggle between Regular Transformer (for stability) and SNN (for efficiency).
- Strategic Curriculum: A three-phase training approach (256 β 1024 β 2048) to stabilize long-context learning.
- Edge-First Design: Native CoreML export support for high-speed inference on Apple Neural Engine (ANE).
- Advanced Optimization: Utilizing the Muon Optimizer for 2D parameters to achieve faster convergence during pre-training.
-
Attention: Grouped Query Attention (GQA) with Sliding Window mechanisms.
-
Normalization: Pre-RMSNorm for improved gradient stability.
-
Memory Efficiency: Gradient Checkpointing integrated into every block to allow training 1B+ models on consumer GPUs.
- Neuronal Model: Leaky Integrate-and-Fire (LIF) neurons from
snntorch. - Temporal Processing: Multi-step spike integration to capture sequential dependencies without the quadratic cost of full attention.
The training followed a two-stage approach combining pre-training and knowledge distillation:
The model was trained through a strategic curriculum across three phases:
| Phase | Objective | Context | Dataset | Status |
|---|---|---|---|---|
| Phase 1 | Language Foundations | 256 | phase1-256 | Completed (100k steps) |
| Phase 2 | Reasoning & Logic | 1,024 | phase2-1024 | Completed |
| Phase 3 | Long Context | 2,048 | phase3-2048 | Completed |
Result: The Flow model achieved 4.1 perplexity after the initial pre-training phase, demonstrating strong language modeling capabilities.
To enhance mathematical reasoning without extensive compute requirements, I employed knowledge distillation:
- Teacher Model: Used Qwen3-4B-Instruct-2507 (4B parameters) as the teacher
- Data Generation: Generated solutions to GSM8K math problems using the teacher model
- Distillation Training: Fine-tuned the Flow-1B and Pulse-1B models on the teacher's reasoning traces
This approach allowed the 1B parameter student models to "learn" from a larger, more capable teacher without requiring the computational budget to train on raw data alone.
Below are the training curves from Weights & Biases for the Flow-1B model during pre-training:
Key Observations:
- Training Loss: Steady convergence from ~7.0 to ~4.3 over 20k steps
- Validation Loss: Final validation loss of ~4.1 (corresponding to 4.1 perplexity)
- Learning Rate: Cosine annealing schedule with warm restarts to prevent local minima
The Pulse-1B model uses Spiking Neural Network architecture with Leaky Integrate-and-Fire (LIF) neurons for energy-efficient inference:
Key Observations:
- SNN Architecture: Uses spiking neurons instead of continuous activations
- Energy Efficiency: Designed for ultra-low power inference on neuromorphic hardware
- Training Challenge: Surrogate gradient methods to overcome non-differentiable spike functions
- Performance: Competitive accuracy with significantly reduced power consumption
This project was built with extreme resource constraints in mind:
- Total Budget: Approximately $50 USD for all training phases
- GPU Provider: vast.ai - rented affordable GPU instances (primarily RTX 3090/4090 on the spot market)
- Training Duration: Due to the limited budget, training was kept minimal with carefully selected checkpoints rather than exhaustive long runs
- Strategic Choices: Every decisionβfrom curriculum design to architecture choicesβwas optimized for maximum "intelligence per dollar"
This constraint forced innovation: using gradient checkpointing, mixed precision, and the Muon optimizer allowed me to achieve respectable reasoning capabilities on a shoestring budget that would typically be unrealistic for LLM training.
All trained models and datasets are publicly available on Hugging Face:
- Flow-1B-gsm8k - Regular Transformer model fine-tuned on GSM8K reasoning tasks
- Pulse-1B-gsm8k - Spiking Neural Network (SNN) variant trained on GSM8K
- flow-pulse-tokenizer - BPE tokenizer with 32,768 vocab size, optimized for English
- phase1-256 - Foundational language training (256 context length)
- phase2-1024 - Reasoning and logic training (1,024 context length)
- phase3-2048 - Long-context polish training (2,048 context length)
This project was developed by Cihan YalΓ§Δ±n (Chan-Y).
Connect with me:
- πΌ LinkedIn: linkedin.com/in/chanyalcin
- π GitHub: github.com/g-hano
- π€ Hugging Face: huggingface.co/Chan-Y
.
βββ data/ # Cleaning, Tokenization, and HF Upload scripts
βββ models/ # Transformer & SNN implementations
βββ tokenizer/ # Custom BPE Tokenizer training
βββ distill/ # Knowledge distillation scripts
βββ config.yaml # Centralized training & model configuration
βββ multigpu_train.py # Distributed training entry point
βββ inference.py # Inference script for pretrained models
βββ convert2coreml.py # CoreML / Edge deployment utility
The easiest way to get started is to use our pretrained models directly:
# Interactive chat with Flow-1B model
python inference.py --model_type flow
# Interactive chat with Pulse-1B (SNN) model
python inference.py --model_type pulse
# Single question
python inference.py --model_type flow --prompt "What is 15 * 23?"
# Batch generation from file
python inference.py --model_type flow --prompts_file questions.txt --output results.jsonlAvailable Models:
flow: Flow-1B-gsm8k - Regular Transformer (1B params)pulse: Pulse-1B-gsm8k - Spiking Neural Network (1B params)
Both models use the flow-pulse-tokenizer which will be automatically downloaded.
The inference script will automatically download model weights on first run and cache them locally.
- Configure: Update
config.yamlwith your HuggingFace tokens and local paths. - Setup: Run
accelerate configto map your local GPUs. - Train: Start the curriculum with
accelerate launch multigpu_train.py.
π For a complete step-by-step training guide, see TRAIN.md
This comprehensive guide covers:
- Environment setup and dependencies
- Data pipeline (download, prepare, tokenize)
- Custom tokenizer training
- Multi-phase curriculum learning
- Knowledge distillation
- Model deployment and sharing
This framework is built upon the synthesis of several state-of-the-art research papers and architectures. The implementation of specific modules (Attention, RoPE, SNN) is informed by the following literature:
- Llama Series (Meta AI): Implementation of Rotary Positional Embeddings (RoPE) and the use of RMSNorm for pre-normalization stability.
- Gemma 2 & 3 (Google DeepMind): Inspiration for the Sliding Window Attention mechanism and the integration of local/global attention layers.
- Qwen 2 & 3 (Alibaba Cloud): Insights into high-token-density training and architectural scaling for small-parameter models (SLMs).
- snntorch (Luo & Eshraghian): The fundamental library used for spiking neuron dynamics and surrogate gradient learning.
- SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks
- SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking
- Attention Spiking Neural Networks: Advanced attention mechanisms for spiking architectures
- Muon Optimizer (Kellin Pelrine et al.): Implementation of the Matrix-valued Orthogonalized Newton (Muon) optimizer for faster convergence on 2D weight matrices during the pre-training phase.
- Chinchilla Scaling Laws (Hoffmann et al.): Used to determine the optimal balance between parameter count (1B) and dataset size (30GB) for maximum efficiency.
- Weight Tying: To reduce memory footprint and improve convergence, the model uses tied weights between the
token_embeddingand thelm_head. - Mixed Precision (BF16): All training is conducted in
bfloat16to leverage the hardware acceleration of RTX 40-series Tensor Cores while maintaining numerical stability. - Gradient Checkpointing: Strategically applied to Transformer blocks to enable training with sequence lengths up to 2,048 on 16GB/32GB VRAM hardware.
- Hybrid Optimizer Strategy (Muon + AdamW): The training employs a dual-optimizer approach that significantly accelerates convergence:
- Muon Optimizer is applied to 2D parameters (weight matrices): token embeddings, attention projection weights (
W_q,W_k,W_v,W_o), and feed-forward network weights. Muon's second-order curvature information enables ~5x faster convergence on these high-dimensional matrices compared to standard AdamW. - AdamW Optimizer is used for 1D parameters: biases, layer normalization parameters, and other low-dimensional tensors where first-order methods remain efficient.
- This hybrid approach balances computational efficiency with convergence speed, particularly crucial given the limited training budget (~$50). Muon's superior performance on embeddings and attention matrices directly translates to better language modeling capabilities with fewer training steps.
- Muon Optimizer is applied to 2D parameters (weight matrices): token embeddings, attention projection weights (





