Training Guide

Quick Start

# Start training
./run.sh

# Interactive commands:
# - Type 'stop' to save and exit
# - Type 'status' for current progress

Configuration

Edit src/train.py to customize training:

# Hardware Settings
BATCH_SIZE = 512          # Adjust for your VRAM (256-1024)
NUM_WORKERS = 20          # CPU threads for data loading
PREFETCH_FACTOR = 4       # Prefetch batches per worker

# Model Architecture
D_MODEL = 128             # Model dimension (128, 256, 512)
NUM_LAYERS = 4            # Transformer layers (4, 6, 8)
NHEAD = 4                 # Attention heads (4, 8, 16)

# Training Parameters
LEARNING_RATE = 1e-4      # Initial learning rate
STEPS = 1000000           # Total training steps
SAVE_EVERY = 1000         # Checkpoint frequency

Training Process

1. Data Generation

Two modes:

Normal Mode (50%): Sequential numbers
Hard Mode (50%): Numbers > 2^68, special patterns

Generation happens in parallel across 20 CPU workers

2. Forward Pass

# Input: Parity vector [0, 1, 0, 1, ...]
# Output: (stopping_time_pred, next_step_logits)

with torch.amp.autocast('cuda'):
    stopping_pred, next_step_logits = model(src, src_key_padding_mask)

3. Loss Calculation

# Stopping time (log-space Huber loss)
log_stopping_times = torch.log1p(stopping_times)
loss_stopping = criterion_stopping(stopping_pred, log_stopping_times)

# Sequence (cross-entropy)
loss_next_step = criterion_next_step(next_step_logits, target_seq)

# Total
loss = loss_stopping + loss_next_step

4. Backward Pass (AMP)

# Scale loss for mixed precision
scaler.scale(loss).backward()

# Gradient clipping
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Optimizer step
scaler.step(optimizer)
scaler.update()
scheduler.step()

Optimizations

Mixed Precision Training (AMP)

Benefits:

40% VRAM reduction
30% speed increase
No accuracy loss

Implementation:

scaler = torch.amp.GradScaler('cuda')

with torch.amp.autocast('cuda'):
    # Forward pass
    
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Gradient Clipping

Prevents exploding gradients:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Learning Rate Scheduling

Cosine Annealing:

scheduler = optim.lr_scheduler.CosineAnnealingLR(
    optimizer, 
    T_max=STEPS, 
    eta_min=1e-6
)

Monitoring

Console Output

Step 100000 | Loss: 0.3698 (Stop: 0.0003, Seq: 0.3695) | LR: 4.13e-05 | Time: 26.83s

Discord Alerts

Configure webhook in src/discord_bot.py:

DISCORD_WEBHOOK_URL = "your_webhook_url_here"

Alerts sent for:

Training start/stop
Anomalies detected
Non-trivial cycles found
Every 500 steps

Checkpoints

Saved every 1000 steps:

checkpoints/model_step_100000.pth

Contains:

Model weights
Optimizer state
Scheduler state
Scaler state (AMP)
Current step

Resuming Training

Training automatically resumes from the latest checkpoint:

./run.sh
# Output: "Loading checkpoint: checkpoints/model_step_100000.pth"
# Output: "Resumed from step 100000"

Troubleshooting

Out of Memory (OOM)

Solution 1: Reduce batch size

BATCH_SIZE = 256  # or 384

Solution 2: Reduce workers

NUM_WORKERS = 12  # instead of 20

Solution 3: Enable memory optimization

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Slow Training

Check CPU utilization:

htop  # Should see ~85% usage

Increase workers if CPU < 80%:

NUM_WORKERS = 24  # if you have more cores

Loss Not Decreasing

Possible causes:

Learning rate too high → Reduce to 5e-5
Gradient explosion → Check gradient norms
Data quality → Verify Hard Mode generation

Advanced Techniques

Distributed Training (Multi-GPU)

# Coming soon!
# Will support DDP across multiple GPUs

Custom Loss Weights

# Adjust importance of each loss component
loss = 2.0 * loss_stopping + 1.0 * loss_next_step

Early Stopping

# Add to training loop
if loss < 0.30:
    print("Target loss reached!")
    break

Next: Loop Searcher

Support me

crypto wallet reflink -19% fee

OKX bloker ref link

My Usdc wallet

MY bat wallet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Training Guide

Training Guide

Quick Start

Configuration

Training Process

1. Data Generation

2. Forward Pass

3. Loss Calculation

4. Backward Pass (AMP)

Optimizations

Mixed Precision Training (AMP)

Gradient Clipping

Learning Rate Scheduling

Monitoring

Console Output

Discord Alerts

Checkpoints

Resuming Training

Troubleshooting

Out of Memory (OOM)

Slow Training

Loss Not Decreasing

Advanced Techniques

Distributed Training (Multi-GPU)

Custom Loss Weights

Early Stopping

Uh oh!

Uh oh!

Uh oh!

Support me

Clone this wiki locally