-
-
Notifications
You must be signed in to change notification settings - Fork 0
Training Guide
MASELKO-95 edited this page Nov 28, 2025
·
1 revision
# Start training
./run.sh
# Interactive commands:
# - Type 'stop' to save and exit
# - Type 'status' for current progressEdit src/train.py to customize training:
# Hardware Settings
BATCH_SIZE = 512 # Adjust for your VRAM (256-1024)
NUM_WORKERS = 20 # CPU threads for data loading
PREFETCH_FACTOR = 4 # Prefetch batches per worker
# Model Architecture
D_MODEL = 128 # Model dimension (128, 256, 512)
NUM_LAYERS = 4 # Transformer layers (4, 6, 8)
NHEAD = 4 # Attention heads (4, 8, 16)
# Training Parameters
LEARNING_RATE = 1e-4 # Initial learning rate
STEPS = 1000000 # Total training steps
SAVE_EVERY = 1000 # Checkpoint frequencyTwo modes:
- Normal Mode (50%): Sequential numbers
- Hard Mode (50%): Numbers > 2^68, special patterns
Generation happens in parallel across 20 CPU workers
# Input: Parity vector [0, 1, 0, 1, ...]
# Output: (stopping_time_pred, next_step_logits)
with torch.amp.autocast('cuda'):
stopping_pred, next_step_logits = model(src, src_key_padding_mask)# Stopping time (log-space Huber loss)
log_stopping_times = torch.log1p(stopping_times)
loss_stopping = criterion_stopping(stopping_pred, log_stopping_times)
# Sequence (cross-entropy)
loss_next_step = criterion_next_step(next_step_logits, target_seq)
# Total
loss = loss_stopping + loss_next_step# Scale loss for mixed precision
scaler.scale(loss).backward()
# Gradient clipping
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Optimizer step
scaler.step(optimizer)
scaler.update()
scheduler.step()Benefits:
- 40% VRAM reduction
- 30% speed increase
- No accuracy loss
Implementation:
scaler = torch.amp.GradScaler('cuda')
with torch.amp.autocast('cuda'):
# Forward pass
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()Prevents exploding gradients:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)Cosine Annealing:
scheduler = optim.lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=STEPS,
eta_min=1e-6
)Step 100000 | Loss: 0.3698 (Stop: 0.0003, Seq: 0.3695) | LR: 4.13e-05 | Time: 26.83s
Configure webhook in src/discord_bot.py:
DISCORD_WEBHOOK_URL = "your_webhook_url_here"Alerts sent for:
- Training start/stop
- Anomalies detected
- Non-trivial cycles found
- Every 500 steps
Saved every 1000 steps:
checkpoints/model_step_100000.pth
Contains:
- Model weights
- Optimizer state
- Scheduler state
- Scaler state (AMP)
- Current step
Training automatically resumes from the latest checkpoint:
./run.sh
# Output: "Loading checkpoint: checkpoints/model_step_100000.pth"
# Output: "Resumed from step 100000"Solution 1: Reduce batch size
BATCH_SIZE = 256 # or 384Solution 2: Reduce workers
NUM_WORKERS = 12 # instead of 20Solution 3: Enable memory optimization
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:TrueCheck CPU utilization:
htop # Should see ~85% usageIncrease workers if CPU < 80%:
NUM_WORKERS = 24 # if you have more coresPossible causes:
- Learning rate too high → Reduce to 5e-5
- Gradient explosion → Check gradient norms
- Data quality → Verify Hard Mode generation
# Coming soon!
# Will support DDP across multiple GPUs# Adjust importance of each loss component
loss = 2.0 * loss_stopping + 1.0 * loss_next_step# Add to training loop
if loss < 0.30:
print("Target loss reached!")
breakNext: Loop Searcher