Deploy superGPT on RunPod for GPU Training

A step-by-step guide to renting cloud GPUs on RunPod and training LLMs with superGPT — from zero to a trained model.

Why RunPod?
Creating a RunPod Account
Choosing Your GPU
Launching a Pod
Setting Up superGPT
Preparing Training Data
Running Training
Monitoring Training
Testing Your Model
Downloading Your Model
Multi-GPU Training
Automation Script
Cost Optimization Tips
Troubleshooting

1. Why RunPod?

RunPod provides on-demand cloud GPUs at competitive prices. Here's how it compares:

Provider	A100 80GB	H100 80GB	Min Billing	Spot Pricing
RunPod	~$1.64/hr	~$3.29/hr	Per second	Yes (40% off)
AWS SageMaker	~$4.10/hr	~$8.22/hr	Per hour	Yes
Google Cloud	~$3.67/hr	~$11.98/hr	Per minute	Yes
Lambda Labs	~$1.10/hr	~$2.49/hr	Per hour	No

RunPod advantages:

Per-second billing — only pay for what you use
Spot instances — 40-70% cheaper for fault-tolerant training
Pre-built templates — PyTorch, CUDA, cuDNN ready to go
Persistent storage — your data survives pod restarts
SSH access — full control, just like a local machine

2. Creating a RunPod Account

Go to runpod.io
Click Sign Up → create account with email or Google
Go to Billing → add a payment method
Add credits ($10-25 is enough for testing)
Go to Settings → SSH Keys → add your public SSH key:

# On your local machine:
# If you don't have an SSH key yet:
ssh-keygen -t ed25519 -C "your@email.com"

# Copy your public key:
cat ~/.ssh/id_ed25519.pub
# → Paste this into RunPod's SSH Keys settings

3. Choosing Your GPU

GPU Selection Guide

Your Goal	GPU	VRAM	RunPod Cost	What You Can Train
Testing / Learning	RTX 4090	24 GB	~$0.44/hr	Models up to ~350M params
Serious Training	A100 80GB	80 GB	~$1.64/hr	Models up to ~1.5B params
Large Models	H100 80GB	80 GB	~$3.29/hr	Models up to ~3B params + FP8
Multi-GPU	4× A100	320 GB	~$6.56/hr	Models up to ~7B params
Production	8× H100	640 GB	~$26.32/hr	Models up to ~70B params

Model Size → GPU Requirements

superGPT Preset	Params	Min VRAM	Recommended GPU
`small`	~10M	2 GB	Any (even CPU)
`medium`	~124M	8 GB	RTX 4090 / A100
`large`	~350M	24 GB	A100 40GB
`xl`	~774M	40 GB	A100 80GB
`gpt4`	~1.3B	60 GB	A100 80GB
`deepseek`	~680M	40 GB	A100 80GB

Pro tip: Start with an RTX 4090 ($0.44/hr) for testing, then scale up to A100 for real training.

4. Launching a Pod

Step-by-Step

Go to runpod.io/console/pods
Click + Deploy
Choose your GPU (e.g., 1x A100 80GB)
Select template: RunPod PyTorch 2.1 (or latest PyTorch template)
Set Volume Disk: 50-100 GB (for data + checkpoints)
Set Container Disk: 20 GB
Click Deploy On-Demand (or Spot for cheaper)

Connect via SSH

After the pod starts (1-2 minutes):

# Find your SSH command in the RunPod dashboard:
# Pod → Connect → SSH Terminal
# It will look like:
ssh root@<IP_ADDRESS> -p <PORT> -i ~/.ssh/id_ed25519

# Example:
ssh root@194.68.245.39 -p 22193 -i ~/.ssh/id_ed25519

Verify GPU

# On the RunPod machine:
nvidia-smi
# Should show your GPU with full VRAM available

python3 -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, GPU: {torch.cuda.get_device_name(0)}')"
# → CUDA: True, GPU: NVIDIA A100 80GB

5. Setting Up superGPT

Once you're SSH'd into RunPod:

# Navigate to persistent storage (survives restarts)
cd /workspace

# Clone superGPT
git clone https://github.com/viralcode/superGPT.git
cd superGPT

# Install dependencies (PyTorch is already installed on RunPod)
pip install transformers datasets tiktoken numpy

# Optional: for monitoring
pip install wandb

# Verify everything works
python -c "import supergpt; print('superGPT ready!')"
python -c "import torch; print(f'GPU: {torch.cuda.get_device_name(0)}, Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.0f} GB')"

6. Preparing Training Data

Option A: FineWeb-Edu (Recommended for Quality)

# 100M tokens (~15 min to prepare on RunPod)
python -m supergpt.training.data_pipeline \
    --dataset HuggingFaceFW/fineweb-edu \
    --tokenizer Qwen/Qwen2.5-0.5B \
    --max-tokens 100000000 \
    --output-dir data/

# 1B tokens (~2 hours to prepare)
python -m supergpt.training.data_pipeline \
    --dataset HuggingFaceFW/fineweb-edu \
    --tokenizer Qwen/Qwen2.5-0.5B \
    --max-tokens 1000000000 \
    --output-dir data/

Option B: Quick Test with Shakespeare

python -m supergpt.training.data_pipeline \
    --dataset shakespeare \
    --output-dir data/

Option C: Custom Dataset from HuggingFace

# Train on code
python -m supergpt.training.data_pipeline \
    --dataset bigcode/starcoderdata \
    --tokenizer Qwen/Qwen2.5-0.5B \
    --text-field content \
    --max-tokens 500000000 \
    --output-dir data/

# Train on Wikipedia
python -m supergpt.training.data_pipeline \
    --dataset wikipedia \
    --subset 20220301.en \
    --tokenizer Qwen/Qwen2.5-0.5B \
    --max-tokens 500000000 \
    --output-dir data/

Verify Data

ls -lh data/
# train.bin  ~380 MB (100M tokens × 4 bytes)
# val.bin    ~7.8 MB
# meta.pkl   ~1 KB

python -c "
import numpy as np, pickle
train = np.memmap('data/train.bin', dtype=np.uint32, mode='r')
with open('data/meta.pkl','rb') as f: meta = pickle.load(f)
print(f'Tokens: {len(train):,}')
print(f'Vocab:  {meta[\"vocab_size\"]:,}')
print(f'Size:   {train.nbytes / 1e9:.2f} GB')
"

7. Running Training

Basic Training (Single GPU)

# Train a small model (10M params, ~1 hour)
python -m supergpt.training.train \
    --preset small \
    --data-dir data/ \
    --max-iters 10000 \
    --batch-size 32 \
    --lr 3e-4 \
    --compile \
    --device cuda

Medium Model (124M params)

# Requires ~8GB VRAM (A100 or H100)
python -m supergpt.training.train \
    --preset medium \
    --data-dir data/ \
    --max-iters 50000 \
    --batch-size 16 \
    --lr 1.5e-4 \
    --compile \
    --device cuda

Background Training (Keeps Running After SSH Disconnect)

# IMPORTANT: Use nohup so training continues even if SSH disconnects
nohup python -u -m supergpt.training.train \
    --preset small \
    --data-dir data/ \
    --max-iters 10000 \
    --batch-size 32 \
    --lr 3e-4 \
    --compile \
    --device cuda \
    > /workspace/training.log 2>&1 &

echo "Training started! PID: $!"
echo "Monitor with: tail -f /workspace/training.log"

Critical: Always use nohup for training runs. SSH connections can drop at any time. Without nohup, your training will be killed when SSH disconnects.

Full Training Script (One Command)

#!/bin/bash
# save as: /workspace/run_training.sh

set -e

echo "============================================"
echo "  superGPT Training Pipeline"
echo "============================================"

cd /workspace/superGPT

# Step 1: Prepare data (skip if already done)
if [ ! -f data/train.bin ]; then
    echo "Step 1: Preparing data..."
    python -m supergpt.training.data_pipeline \
        --dataset HuggingFaceFW/fineweb-edu \
        --tokenizer Qwen/Qwen2.5-0.5B \
        --max-tokens 100000000 \
        --output-dir data/
else
    echo "Step 1: Data already prepared, skipping."
fi

# Step 2: Train
echo "Step 2: Starting training..."
python -u -m supergpt.training.train \
    --preset small \
    --data-dir data/ \
    --max-iters 10000 \
    --batch-size 32 \
    --lr 3e-4 \
    --compile \
    --device cuda

# Step 3: Test generation
echo "Step 3: Testing generation..."
python -m supergpt.inference.generate \
    --checkpoint checkpoints/best.pt \
    --prompt "The most important concepts in machine learning are" \
    --max-tokens 200 \
    --temperature 0.7

echo "============================================"
echo "  Training complete!"
echo "  Checkpoint: checkpoints/best.pt"
echo "============================================"

Run it:

chmod +x /workspace/run_training.sh
nohup /workspace/run_training.sh > /workspace/full_run.log 2>&1 &
tail -f /workspace/full_run.log

8. Monitoring Training

From Your Local Machine

# Check training progress
ssh root@<IP> -p <PORT> -i ~/.ssh/id_ed25519 "tail -20 /workspace/training.log"

# Check GPU utilization
ssh root@<IP> -p <PORT> -i ~/.ssh/id_ed25519 "nvidia-smi --query-gpu=utilization.gpu,memory.used,temperature.gpu --format=csv"

# Check if training is still running
ssh root@<IP> -p <PORT> -i ~/.ssh/id_ed25519 "ps aux | grep train | grep -v grep"

What to Look For

  Step      0 | train loss: 11.9940 | val loss: 11.9936 | lr: 0.00e+00   ← Starting (random)
  iter    100 | loss 8.0898 | 61411 tok/s | lr 3.00e-04                   ← Dropping fast (good!)
  iter    500 | loss 6.1243 | 61372 tok/s | lr 2.99e-04                   ← Still dropping
  Step   1000 | train loss: 5.5000 | val loss: 5.6000 | lr: 2.95e-04     ← Val > Train = slightly overfitting (ok)
  Step   5000 | train loss: 4.8000 | val loss: 5.0000 | lr: 1.50e-04     ← Converging
  Step  10000 | train loss: 4.5000 | val loss: 4.7000 | lr: 3.00e-05     ← Final (cosine decay)

Good signs:

Loss decreasing consistently
Validation loss close to training loss
GPU utilization > 80%
50K+ tokens/second throughput

Bad signs:

Loss stuck or increasing → lower learning rate
Val loss much higher than train loss → overfitting, need more data
GPU utilization < 50% → increase batch size
NaN in loss → reduce learning rate, check data

With WandB (Pretty Dashboard)

pip install wandb
wandb login  # Enter your API key

python -m supergpt.training.train \
    --preset small \
    --data-dir data/ \
    --wandb \
    --compile --device cuda

9. Testing Your Model

After training completes:

cd /workspace/superGPT

# Generate text
python -m supergpt.inference.generate \
    --checkpoint checkpoints/best.pt \
    --prompt "In the field of artificial intelligence, transformers are" \
    --max-tokens 300 \
    --temperature 0.7

# Try different prompts
python -m supergpt.inference.generate \
    --checkpoint checkpoints/best.pt \
    --prompt "def fibonacci(n):" \
    --max-tokens 200 \
    --temperature 0.3

# Interactive mode
python -m supergpt.inference.generate \
    --checkpoint checkpoints/best.pt \
    --interactive

10. Downloading Your Model

Option A: SCP (Simplest)

# From your LOCAL machine:
scp -P <PORT> -i ~/.ssh/id_ed25519 \
    root@<IP>:/workspace/superGPT/checkpoints/best.pt \
    ./my_model.pt

# Download everything
scp -r -P <PORT> -i ~/.ssh/id_ed25519 \
    root@<IP>:/workspace/superGPT/checkpoints/ \
    ./checkpoints/

Option B: Push to GitHub

# On RunPod:
cd /workspace/superGPT
git add checkpoints/best.pt
git commit -m "Add trained model checkpoint"
git push

# Then pull on your local machine

Option C: Upload to HuggingFace Hub

pip install huggingface_hub
huggingface-cli login

python -c "
from huggingface_hub import HfApi
api = HfApi()
api.upload_file(
    path_or_fileobj='checkpoints/best.pt',
    path_in_repo='model.pt',
    repo_id='your-username/my-supergpt-model',
    repo_type='model',
)
print('Uploaded to HuggingFace!')
"

11. Multi-GPU Training

RunPod Multi-GPU Pods

In RunPod, select a multi-GPU option (e.g., 4× A100 80GB)
SSH in and verify:

nvidia-smi
# Should show 4 GPUs

python -c "import torch; print(f'GPUs: {torch.cuda.device_count()}')"
# → GPUs: 4

Run with FSDP:

cd /workspace/superGPT

torchrun --nproc_per_node=4 \
    -m supergpt.training.train \
    --preset large \
    --data-dir data/ \
    --max-iters 50000 \
    --distributed \
    --compile

Multi-GPU Performance

GPUs	Preset	Params	Throughput	Time for 10K iters
1× A100	small	10M	~60K tok/s	~45 min
1× A100	medium	124M	~30K tok/s	~2 hrs
4× A100	large	350M	~80K tok/s	~3 hrs
8× A100	xl	774M	~100K tok/s	~5 hrs

12. Automation Script

Save this on your local machine to automate the entire process:

#!/bin/bash
# File: train_on_runpod.sh
# Usage: ./train_on_runpod.sh <runpod-ip> <runpod-port>

IP=$1
PORT=$2
KEY=~/.ssh/id_ed25519

SSH="ssh -o StrictHostKeyChecking=no root@$IP -p $PORT -i $KEY"

echo "🚀 Setting up superGPT on RunPod..."

# 1. Clone and setup
$SSH "cd /workspace && \
    git clone https://github.com/viralcode/superGPT.git 2>/dev/null; \
    cd superGPT && git pull && \
    pip install -q transformers datasets tiktoken numpy"

# 2. Prepare data (if not already done)
$SSH "cd /workspace/superGPT && \
    if [ ! -f data/train.bin ]; then \
        python -m supergpt.training.data_pipeline \
            --dataset HuggingFaceFW/fineweb-edu \
            --tokenizer Qwen/Qwen2.5-0.5B \
            --max-tokens 100000000 \
            --output-dir data/; \
    fi"

# 3. Start training in background
$SSH "cd /workspace/superGPT && \
    nohup python -u -m supergpt.training.train \
        --preset small \
        --data-dir data/ \
        --max-iters 10000 \
        --batch-size 32 \
        --lr 3e-4 \
        --compile \
        --device cuda \
        > /workspace/training.log 2>&1 &"

echo "✅ Training started!"
echo "📊 Monitor: $SSH 'tail -f /workspace/training.log'"
echo "🖥️  GPU:     $SSH 'nvidia-smi'"

13. Cost Optimization Tips

1. Use Spot Instances (40-70% Off)

Spot instances can be preempted but are much cheaper:

On-Demand A100: ~$1.64/hr
Spot A100:      ~$0.99/hr  (40% savings!)

Use for: Quick experiments, data preparation, anything you can restart. Don't use for: Long training runs without checkpointing (use nohup + checkpoint resuming).

2. Use the Smallest GPU That Works

Task	Cheapest GPU	Cost
Data prep	Any (CPU works)	$0.20/hr
Train ≤124M params	RTX 4090	$0.44/hr
Train ≤350M params	A100 40GB	$1.10/hr
Train ≤1B params	A100 80GB	$1.64/hr

3. Use `torch.compile()` (Free 2× Speedup)

Always add --compile. It halves your bill by doubling throughput.

4. Stop Your Pod When Not Training

RunPod charges per second. Stop your pod when you're not using it. Your data persists on the volume disk.

5. Estimate Cost Before Training

# Quick cost estimator
tokens_per_second = 60000   # With --compile on A100
cost_per_hour = 1.64        # A100 80GB

total_tokens = 100_000_000  # 100M tokens
max_iters = 10000
batch_size = 32
block_size = 256

tokens_per_iter = batch_size * block_size  # 8,192
total_time_seconds = max_iters * tokens_per_iter / tokens_per_second
total_hours = total_time_seconds / 3600
total_cost = total_hours * cost_per_hour

print(f"Estimated time: {total_hours:.1f} hours")
print(f"Estimated cost: ${total_cost:.2f}")
# → Estimated time: 0.4 hours
# → Estimated cost: $0.60

14. Troubleshooting

Pod Won't Start

"No available machines" → Try a different GPU type or region
"Insufficient funds" → Add more credits in Billing

SSH Connection Refused

# Make sure pod is running (green status in dashboard)
# Try the web terminal in RunPod dashboard first
# Check that you're using the right port (not 22)
ssh root@<IP> -p <PORT> -i ~/.ssh/id_ed25519

Training Killed When SSH Disconnects

# ALWAYS use nohup or tmux:
nohup python -u -m supergpt.training.train ... > /workspace/log.log 2>&1 &

# OR use tmux:
tmux new -s train
python -m supergpt.training.train ...
# Press Ctrl+B, then D to detach
# Reconnect later: tmux attach -t train

CUDA Out of Memory

# Reduce batch size:
--batch-size 16  # instead of 32 or 64

# Use gradient accumulation:
--batch-size 8 --grad-accum 4  # effective batch = 32

# Use a smaller preset:
--preset small  # instead of medium or large

# Enable gradient checkpointing:
--gradient-checkpointing

Slow Training (< 20K tok/s)

# Add --compile (most impactful)
--compile

# Increase batch size to saturate the GPU
--batch-size 64

# Check GPU utilization
nvidia-smi  # Should be > 80%

Data Preparation Hangs

# HuggingFace datasets download can be slow
# Try with a subset first:
--max-tokens 10000000  # 10M tokens (quick test)

# Or use the sample subset:
--subset sample-10BT

Checkpoint Too Large to Download

# Compress before downloading:
# On RunPod:
tar czf /workspace/model.tar.gz -C /workspace/superGPT/checkpoints best.pt

# On local machine:
scp -P <PORT> root@<IP>:/workspace/model.tar.gz ./
tar xzf model.tar.gz

Quick Reference Card

# ──────────────────────────────────────────────────
# superGPT RunPod Cheat Sheet
# ──────────────────────────────────────────────────

# SSH into RunPod
ssh root@<IP> -p <PORT> -i ~/.ssh/id_ed25519

# Setup (first time only)
cd /workspace && git clone https://github.com/viralcode/superGPT.git
cd superGPT && pip install transformers datasets tiktoken numpy

# Prepare data
python -m supergpt.training.data_pipeline \
    --dataset HuggingFaceFW/fineweb-edu \
    --tokenizer Qwen/Qwen2.5-0.5B \
    --max-tokens 100000000 --output-dir data/

# Train (background, survives SSH disconnect)
nohup python -u -m supergpt.training.train \
    --preset small --data-dir data/ --max-iters 10000 \
    --batch-size 32 --lr 3e-4 --compile --device cuda \
    > /workspace/training.log 2>&1 &

# Monitor
tail -f /workspace/training.log
nvidia-smi

# Generate text
python -m supergpt.inference.generate \
    --checkpoint checkpoints/best.pt \
    --prompt "Hello world" --max-tokens 200

# Download model (from local machine)
scp -P <PORT> root@<IP>:/workspace/superGPT/checkpoints/best.pt ./

FilesExpand file tree

deploy-runpod.md

Latest commit

History