Skip to content

Latest commit

 

History

History
673 lines (532 loc) · 18.8 KB

File metadata and controls

673 lines (532 loc) · 18.8 KB

Getting Started with superGPT

A complete guide to training, fine-tuning, and deploying your own GPT models using superGPT.


Table of Contents

  1. What is superGPT?
  2. Installation
  3. Architecture Overview
  4. Quick Start: Train Your First Model
  5. Understanding Model Presets
  6. Data Preparation
  7. Training Deep Dive
  8. Text Generation
  9. Fine-Tuning with LoRA
  10. Knowledge Distillation
  11. Multi-GPU Training (FSDP)
  12. Advanced Features
  13. Monitoring & Debugging
  14. Exporting & Serving
  15. Troubleshooting

1. What is superGPT?

superGPT is a production-grade LLM training framework that implements the latest architectural innovations from leading labs:

Feature Description From
Multi-Head Attention (MHA) Standard transformer attention GPT-2/3
Grouped Query Attention (GQA) Reduced KV heads for efficiency Llama 2
Multi-head Latent Attention (MLA) Low-rank KV compression DeepSeek-V3
Mixture of Experts (MoE) Sparse expert routing DeepSeek-V3
Multi-Token Prediction (MTP) Predict N future tokens simultaneously DeepSeek-V3
Rotary Position Embeddings (RoPE) Relative position encoding LLaMA
SwiGLU Activation Gated linear unit with Swish LLaMA
Sliding Window Attention Fixed window for efficiency Mistral
Native Sparse Attention (NSA) Learned sparse patterns DeepSeek (2025)
FP8 Training 8-bit floating point on H100 DeepSeek-V3
LoRA Fine-Tuning Low-rank adapter training Microsoft
Speculative Decoding Fast inference with draft model Google
FSDP Multi-GPU model parallelism PyTorch

Project Structure

supergpt/
├── core/
│   ├── config.py          # All model & training configurations
│   ├── model.py           # GPT model (MHA, GQA, MLA, MoE, etc.)
│   └── flash_mla.py       # FlashAttention for MLA
├── training/
│   ├── train.py           # Main training loop (single & multi-GPU)
│   ├── data_pipeline.py   # Streaming data pipeline + quality filtering
│   ├── finetune.py        # LoRA fine-tuning
│   ├── distill.py         # Knowledge distillation
│   ├── lora.py            # LoRA module implementation
│   ├── fp8_utils.py       # FP8 training utilities (H100)
│   ├── parallel.py        # FSDP parallelism helpers
│   ├── streaming.py       # Streaming dataset utilities
│   └── expert_parallel.py # MoE expert parallelism
├── inference/
│   ├── generate.py        # Text generation with sampling
│   ├── evaluate.py        # Model evaluation & benchmarks
│   ├── export.py          # Export to ONNX/TorchScript
│   └── serve.py           # HTTP API server
├── alignment/
│   ├── align.py           # DPO alignment
│   ├── rlhf.py            # RLHF with PPO
│   └── rlvr.py            # RLVR (rule-based verification)
└── tools/
    └── visualize.py       # Architecture visualization dashboard

2. Installation

Prerequisites

  • Python 3.10+
  • PyTorch 2.0+ (with CUDA for GPU training)
  • A GPU with at least 8GB VRAM (for training; CPU works for inference)

Setup

# Clone the repository
git clone https://github.com/viralcode/superGPT.git
cd superGPT

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate   # Windows

# Install dependencies
pip install torch torchvision torchaudio
pip install transformers datasets tiktoken numpy

# Optional: for advanced features
pip install wandb          # Training monitoring
pip install tensorboard    # Training visualization
pip install flash-attn     # Flash Attention (faster training)
pip install onnx           # Model export

Verify Installation

python -c "import supergpt; print('superGPT loaded successfully!')"
python -c "import torch; print(f'PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}')"

3. Architecture Overview

The Transformer Block

Every superGPT model is built from stacked transformer blocks:

Input Tokens
     │
     ▼
┌─────────────────┐
│  Token Embedding │ ← Convert token IDs to vectors
│  + Position Enc  │ ← RoPE (Rotary Positional Embedding)
└────────┬────────┘
         │
         ▼  (repeat N times)
┌─────────────────────────────┐
│    Layer Norm               │
│         │                   │
│    Attention                │
│    ┌────┴─────┐             │
│    │MHA / GQA │ ← Standard  │
│    │  MLA     │ ← DeepSeek  │
│    │  NSA     │ ← Sparse    │
│    └────┬─────┘             │
│         │ + residual        │
│    Layer Norm               │
│         │                   │
│    Feed Forward             │
│    ┌────┴─────┐             │
│    │ SwiGLU   │ ← Dense     │
│    │ MoE      │ ← Sparse    │
│    └────┬─────┘             │
│         │ + residual        │
└─────────┴───────────────────┘
         │
         ▼
┌─────────────────┐
│   Layer Norm    │
│   LM Head       │ ← Project to vocab size
│   (+ MTP Heads) │ ← Multi-Token Prediction
└─────────────────┘

Attention Variants

Variant KV Heads Memory Speed Best For
MHA Same as Q High Baseline Small models (≤1B)
GQA 1/4 of Q ~60% ~1.3× Medium models (1-10B)
MLA Latent ~20% ~1.5× Large models (≥10B)
# MHA: 6 query heads, 6 KV heads (full attention)
GPTConfig(n_head=6, n_kv_head=6)

# GQA: 12 query heads, 4 KV heads (grouped)
GPTConfig(n_head=12, n_kv_head=4)

# MLA: Low-rank latent attention (DeepSeek V3)
GPTConfig(use_mla=True, kv_lora_rank=512, q_lora_rank=1536)

4. Quick Start: Train Your First Model

The fastest way to get started — train on Shakespeare in under 5 minutes:

# Step 1: Prepare Shakespeare data
python -m supergpt.training.data_pipeline --dataset shakespeare --output-dir data/

# Step 2: Train (CPU is fine for this test)
python -m supergpt.training.train \
    --preset small \
    --data-dir data/ \
    --max-iters 1000 \
    --device cpu

# Step 3: Generate text
python -m supergpt.inference.generate \
    --checkpoint checkpoints/best.pt \
    --prompt "To be or not to be" \
    --max-tokens 200

With a GPU (much faster — 2 minutes):

python -m supergpt.training.train \
    --preset small \
    --data-dir data/ \
    --max-iters 2000 \
    --compile \
    --device cuda

5. Understanding Model Presets

superGPT comes with pre-configured model sizes:

Preset Layers Heads Embed Params Context GPU VRAM Training Time
small 6 6 384 ~10M 256 <2 GB Minutes
medium 12 12 768 ~124M 512 ~8 GB Hours
large 24 16 1024 ~350M 2048 ~24 GB Hours
xl 24 16 2048 ~774M 4096 ~40 GB Days
gpt4 32 32 4096 ~1.3B 8192 ~80 GB Days
deepseek 27 16 2048 ~680M 4096 ~40 GB Days
mistral 32 32 4096 ~1.3B 8192 ~80 GB Days
gemma2 28 16 2304 ~500M 8192 ~40 GB Days

Custom Configuration

# Override any preset parameter
python -m supergpt.training.train \
    --preset medium \
    --n-layer 8 \
    --n-head 8 \
    --block-size 1024 \
    --data-dir data/

DeepSeek V3 Architecture

# Train with the full DeepSeek V3 feature set:
# MLA + MoE (64 experts, 6 active) + MTP (3 tokens) + shared experts
python -m supergpt.training.train \
    --preset deepseek \
    --data-dir data/ \
    --max-iters 50000 \
    --compile

6. Data Preparation

Option A: Built-In Datasets

# Shakespeare (tiny, for testing)
python -m supergpt.training.data_pipeline --dataset shakespeare --output-dir data/

# FineWeb-Edu (high-quality web text, any size)
python -m supergpt.training.data_pipeline \
    --dataset HuggingFaceFW/fineweb-edu \
    --tokenizer Qwen/Qwen2.5-0.5B \
    --max-tokens 100000000 \
    --output-dir data/

# Wikipedia
python -m supergpt.training.data_pipeline \
    --dataset wikipedia \
    --subset 20220301.en \
    --tokenizer Qwen/Qwen2.5-0.5B \
    --max-tokens 500000000 \
    --output-dir data/

Option B: Any HuggingFace Dataset

# The data pipeline works with ANY HuggingFace dataset that has a "text" column
python -m supergpt.training.data_pipeline \
    --dataset allenai/dolma \
    --tokenizer Qwen/Qwen2.5-0.5B \
    --max-tokens 1000000000 \
    --output-dir data/ \
    --text-field text

# Custom text field name
python -m supergpt.training.data_pipeline \
    --dataset bigcode/starcoderdata \
    --tokenizer Qwen/Qwen2.5-0.5B \
    --text-field content \
    --output-dir data/

Option C: Your Own Text Files

# Put .txt files in a directory, then tokenize them:
mkdir -p my_data/
# Copy your text files into my_data/
python -m supergpt.training.data_pipeline \
    --input-dir my_data/ \
    --tokenizer Qwen/Qwen2.5-0.5B \
    --output-dir data/

Quality Filtering & Deduplication

The data pipeline includes built-in quality filtering:

# With quality filtering + dedup (default)
python -m supergpt.training.data_pipeline \
    --dataset HuggingFaceFW/fineweb-edu \
    --output-dir data/

# Disable filtering (raw data)
python -m supergpt.training.data_pipeline \
    --dataset HuggingFaceFW/fineweb-edu \
    --no-quality-filter \
    --no-dedup \
    --output-dir data/

Quality filters applied:

  • Minimum 50 characters, 10 words
  • Maximum 100K characters
  • ASCII ratio ≥ 90% (English filter)
  • Special character ratio ≤ 10%
  • Bigram repetition ≤ 30%
  • MinHash deduplication (Jaccard similarity > 0.8)

What Gets Created

data/
├── train.bin    # Tokenized training data (uint32 binary)
├── val.bin      # Tokenized validation data
└── meta.pkl     # Metadata (vocab size, tokenizer, stats)

7. Training Deep Dive

Basic Training

python -m supergpt.training.train \
    --preset small \
    --data-dir data/ \
    --max-iters 10000 \
    --batch-size 32 \
    --lr 3e-4 \
    --device cuda

Key Training Parameters

Parameter Flag Default Description
Model preset --preset small Model architecture
Data directory --data-dir data Path to tokenized data
Max iterations --max-iters 5000 Total training steps
Batch size --batch-size 64 Sequences per batch
Learning rate --lr 3e-4 Peak learning rate
LR schedule --lr-schedule cosine cosine or wsd
Warmup --warmup-iters 100 LR warmup steps
Grad accumulation --grad-accum 1 Effective batch multiplier
Eval interval --eval-interval 500 Evaluate every N steps
Gradient clipping --grad-clip 1.0 Max gradient norm
Weight decay --weight-decay 0.1 AdamW weight decay
Compile --compile off torch.compile() (2× speed)
Device --device auto cuda, cpu, mps
Dtype --dtype auto bfloat16, float16, float32

Using torch.compile (Highly Recommended)

# ~2× faster training on GPU
python -m supergpt.training.train \
    --preset small \
    --data-dir data/ \
    --compile \
    --device cuda

Note: First iteration takes 1-2 minutes to compile. After that, training is ~2× faster.

Gradient Accumulation (Simulate Larger Batches)

# Effective batch size = batch_size × grad_accum = 8 × 16 = 128
python -m supergpt.training.train \
    --preset large \
    --batch-size 8 \
    --grad-accum 16 \
    --device cuda

Auto-Resume

superGPT automatically resumes from the latest checkpoint:

# First run: trains from scratch, writes checkpoints/latest.pt
python -m supergpt.training.train --preset small --max-iters 5000

# If interrupted, just re-run — it auto-resumes from iter 5000:
python -m supergpt.training.train --preset small --max-iters 10000

8. Text Generation

Basic Generation

python -m supergpt.inference.generate \
    --checkpoint checkpoints/best.pt \
    --prompt "The theory of general relativity states that" \
    --max-tokens 200 \
    --temperature 0.8

Sampling Parameters

Parameter Flag Default Description
Temperature --temperature 0.8 Randomness (0=greedy, 1=creative)
Top-k --top-k 50 Keep top K tokens
Top-p --top-p 0.9 Nucleus sampling threshold
Min-p --min-p 0.05 Minimum probability threshold
Repetition penalty --rep-penalty 1.1 Penalize repeated tokens
Max tokens --max-tokens 200 Max tokens to generate

Interactive Mode

# Chat-like interface
python -m supergpt.inference.generate \
    --checkpoint checkpoints/best.pt \
    --interactive

# You:> What is machine learning?
# Model: Machine learning is a branch of artificial intelligence...

Speculative Decoding (2-3× Faster)

# Use a small "draft" model to speed up generation
python -m supergpt.inference.generate \
    --checkpoint checkpoints/large_model.pt \
    --draft-checkpoint checkpoints/small_model.pt \
    --spec-k 5 \
    --prompt "Explain quantum computing"

9. Fine-Tuning with LoRA

LoRA (Low-Rank Adaptation) lets you fine-tune with minimal memory:

# Fine-tune on your custom data
python -m supergpt.training.finetune \
    --checkpoint checkpoints/best.pt \
    --data-dir data/my_finetune_data/ \
    --lora-rank 64 \
    --lora-alpha 16 \
    --max-iters 5000 \
    --lr 2e-5 \
    --batch-size 16 \
    --compile --device cuda

LoRA Parameters

Rank Memory Savings Quality Best For
4-8 ~95% Good Quick experiments
16-32 ~90% Great Domain adaptation
64-128 ~80% Excellent Full fine-tuning substitute

10. Knowledge Distillation

Transfer knowledge from a large "teacher" model to a smaller "student":

python -m supergpt.training.distill \
    --teacher checkpoints/large_model.pt \
    --student-preset small \
    --data-dir data/ \
    --alpha 0.7 \
    --temperature 2.0 \
    --max-iters 10000 \
    --compile --device cuda
  • alpha: Balance between teacher soft targets (higher α) and hard targets (lower α)
  • temperature: Higher = softer teacher output = more knowledge transfer

11. Multi-GPU Training (FSDP)

Scale to multiple GPUs with Fully Sharded Data Parallel:

# 4 GPUs
torchrun --nproc_per_node=4 \
    -m supergpt.training.train \
    --preset large \
    --data-dir data/ \
    --distributed \
    --compile

# 8 GPUs across 2 nodes
torchrun --nproc_per_node=4 --nnodes=2 --node-rank=0 \
    --master-addr=node0 --master-port=29500 \
    -m supergpt.training.train \
    --preset xl \
    --distributed \
    --compile

12. Advanced Features

Mixture of Experts (MoE)

# Train with 8 experts, 2 active per token
python -m supergpt.training.train \
    --preset small \
    --use-moe \
    --n-experts 8 \
    --n-experts-active 2 \
    --data-dir data/ \
    --compile --device cuda

Multi-Token Prediction (MTP)

# Predict 3 tokens ahead simultaneously (DeepSeek V3 style)
python -m supergpt.training.train \
    --preset small \
    --n-predict-tokens 3 \
    --data-dir data/ \
    --compile --device cuda

FP8 Training (H100 Only)

# 2× memory savings on H100 GPUs
python -m supergpt.training.train \
    --preset xl \
    --fp8 \
    --data-dir data/ \
    --compile --device cuda

Sliding Window Attention (Mistral Style)

python -m supergpt.training.train \
    --preset small \
    --sliding-window 1024 \
    --block-size 4096 \
    --data-dir data/

13. Monitoring & Debugging

WandB Integration

pip install wandb
wandb login

python -m supergpt.training.train \
    --preset small \
    --data-dir data/ \
    --wandb

TensorBoard

pip install tensorboard

python -m supergpt.training.train \
    --preset small \
    --data-dir data/ \
    --tensorboard

# View in browser:
tensorboard --logdir runs/

Architecture Visualizer

# Interactive dashboard showing model architecture, attention patterns, weights
python -m supergpt.tools.visualize --checkpoint checkpoints/best.pt
# Opens browser at http://localhost:5000

14. Exporting & Serving

Export to ONNX

python -m supergpt.inference.export \
    --checkpoint checkpoints/best.pt \
    --format onnx \
    --output model.onnx

HTTP API Server

python -m supergpt.inference.serve \
    --checkpoint checkpoints/best.pt \
    --port 8000

# Test:
curl "http://localhost:8000/generate?prompt=Hello%20world&max_tokens=100"

15. Troubleshooting

Common Issues

Problem Solution
CUDA out of memory Reduce --batch-size, add --grad-accum, or use a smaller --preset
torch.compile() slow First iteration takes 1-2 min. Add --compile flag (worth the wait)
Loss not decreasing Check data quality, try lower --lr, ensure data isn't corrupted
Generated text is garbage Train longer, check tokenizer matches between train and generate
uint16/uint32 mismatch Auto-detected now! Ensure meta.pkl has correct vocab_size
Checkpoint size mismatch Delete old checkpoints if switching presets: rm checkpoints/*.pt
NaN in loss Reduce --lr, add --grad-clip 1.0, check for data corruption
Multi-GPU hangs Ensure NCCL_DEBUG=INFO, check --nproc_per_node matches GPU count

Getting Help