Practical guide to tuning hyperparameters for optimal training performance.
This guide provides systematic approaches to tuning the most impactful hyperparameters in tensorlogic-train. We focus on practical, empirically-validated strategies rather than exhaustive grid searches.
- Learning Rate
- Batch Size
- Optimizer Selection
- Learning Rate Schedules
- Regularization
- Early Stopping
- Practical Workflow
Impact: ⭐⭐⭐⭐⭐ (Highest) Difficulty: Medium Time to tune: 30-60 minutes
Learning rate is the single most important hyperparameter. Too high causes divergence, too low wastes time.
Use the built-in LR finder to automatically discover optimal learning rate:
use tensorlogic_train::{LearningRateFinder, CallbackList};
let mut callbacks = CallbackList::new();
callbacks.add(Box::new(LearningRateFinder::new(
1e-7, // start_lr
1.0, // end_lr
100, // num_steps
).with_exponential_scaling()));
// After training, inspect the LR vs loss curve
// Optimal LR is typically where loss decreases fastestIf not using LR finder, try this sequence:
1. Start conservatively:
let lr = 1e-3; // Good default for Adam/AdamW2. Test 3-5 values spanning 2 orders of magnitude:
let lrs = vec![1e-4, 3e-4, 1e-3, 3e-3, 1e-2];3. Train for 5-10 epochs each, pick the fastest converging:
| Optimizer | Typical Range | Good Default | Notes |
|---|---|---|---|
| SGD | 1e-2 to 1e-1 | 1e-2 | Needs larger LR than adaptive methods |
| SGD + Momentum | 1e-2 to 1e-1 | 3e-2 | Momentum allows higher LR |
| Adam | 1e-4 to 1e-3 | 1e-3 | Most robust choice |
| AdamW | 1e-4 to 1e-3 | 1e-3 | Same as Adam |
| RMSprop | 1e-4 to 1e-3 | 1e-3 | Similar to Adam |
| Adagrad | 1e-2 to 1e-1 | 1e-2 | Learning rate decays automatically |
Too High:
- Loss increases or oscillates wildly
- NaN/Inf values appear
- Training unstable even with gradient clipping
- Validation loss much worse than training loss
Too Low:
- Training very slow (loss barely decreases)
- Stuck in poor local minimum
- Hours of training without progress
For long training runs, consider cyclic learning rates:
use tensorlogic_train::OneCycleLrScheduler;
let scheduler = Box::new(OneCycleLrScheduler::new(
0.0001, // initial_lr
0.01, // max_lr
100, // total_epochs
));Impact: ⭐⭐⭐⭐ (High) Difficulty: Easy Time to tune: 15-30 minutes
Batch size affects:
- Training speed (larger = faster per epoch)
- Gradient noise (larger = more stable)
- Generalization (smaller often better)
- Memory usage (GPU/RAM constraint)
1. Start with power of 2:
let batch_size = 32; // Good default2. Increase until memory constrained:
// Try: 32, 64, 128, 256, 512
// Stop when out of memory3. Adjust learning rate with batch size:
Linear Scaling Rule:
- Double batch size → double learning rate
- Caveat: Only works up to ~256-512
let base_lr = 0.001;
let base_batch = 32;
let new_batch = 128;
let new_lr = base_lr * (new_batch as f64 / base_batch as f64);
// new_lr = 0.004| Dataset Size | Recommended Batch Size | Reason |
|---|---|---|
| < 1000 | 16-32 | Avoid overfitting |
| 1000-10000 | 32-64 | Balanced |
| 10000-100000 | 64-128 | Stable gradients |
| > 100000 | 128-512 | Efficiency |
If you hit memory limits:
Option 1: Gradient Accumulation (not yet implemented)
// Simulate larger batch by accumulating gradients
// effective_batch_size = batch_size * accumulation_stepsOption 2: Reduce batch size, adjust LR
let batch_size = 16; // Smaller
let lr = 0.0005; // Proportionally smaller LRImpact: ⭐⭐⭐ (Medium-High) Difficulty: Easy Time to tune: 5-15 minutes
1. Do you have enough time to tune?
├─ Yes → Try multiple optimizers
└─ No → Use Adam (safest default)
2. Is your dataset small (<10k samples)?
├─ Yes → Try SGD with momentum
└─ No → Use Adam/AdamW
3. Do you need the best possible performance?
├─ Yes → Try AdamW, then tune others
└─ No → Stick with Adam
4. Is training time critical?
├─ Yes → Use Adam (faster convergence)
└─ No → Try SGD (might generalize better)
| Optimizer | Pros | Cons | Best For |
|---|---|---|---|
| Adam | Fast, robust, good default | Can overfit, less interpretable | Most tasks, quick prototyping |
| AdamW | Better generalization than Adam | Slightly slower | Production models, long training |
| SGD | Simple, generalizes well | Slow, needs tuning | When you have time, final tuning |
| SGD + Momentum | Faster than plain SGD | Still needs tuning | Computer vision, proven architectures |
| RMSprop | Good for RNNs | Less stable than Adam | Recurrent models |
| NAdam | Combines Adam + Nesterov | More complex | Advanced use |
| LAMB | Large batch training | Complex, niche | Very large batches |
1. Adam (Default):
use tensorlogic_train::{AdamOptimizer, OptimizerConfig};
let optimizer = AdamOptimizer::new(OptimizerConfig {
learning_rate: 0.001,
beta1: 0.9,
beta2: 0.999,
epsilon: 1e-8,
..Default::default()
});2. AdamW (Better Regularization):
use tensorlogic_train::{AdamWOptimizer, OptimizerConfig};
let optimizer = AdamWOptimizer::new(OptimizerConfig {
learning_rate: 0.001,
weight_decay: 0.01, // L2 regularization
beta1: 0.9,
beta2: 0.999,
..Default::default()
});3. SGD with Momentum (Classical):
use tensorlogic_train::{SgdOptimizer, OptimizerConfig};
let optimizer = SgdOptimizer::new(OptimizerConfig {
learning_rate: 0.01, // Higher LR than Adam
momentum: 0.9, // Standard value
..Default::default()
});Most people never change β₁ and β₂. If you do:
-
β₁ (first moment): Usually 0.9
- Increase to 0.95 for smoother training
- Decrease to 0.8 for faster adaptation
-
β₂ (second moment): Usually 0.999
- Use 0.99 for small batches
- Use 0.9999 for very large batches
-
ε (epsilon): Usually 1e-8
- Increase to 1e-7 if encountering NaN
- Rarely needs tuning
Start with 0.01, adjust based on overfitting:
let weight_decay = if overfitting {
0.1 // Stronger regularization
} else if underfitting {
0.001 // Weaker regularization
} else {
0.01 // Default
};Impact: ⭐⭐⭐ (Medium) Difficulty: Medium Time to tune: 30-60 minutes
✅ Use when:
- Training for many epochs (>50)
- Learning rate plateaus
- Want to squeeze out final performance
❌ Skip when:
- Quick experiments (<10 epochs)
- Already achieving good results
- Limited tuning time
1. Step Decay (Simple, Effective)
use tensorlogic_train::StepLrScheduler;
let scheduler = StepLrScheduler::new(
0.001, // initial_lr
0.1, // gamma (multiply by 0.1)
30, // step_size (every 30 epochs)
);
// Drops: 0.001 → 0.0001 → 0.00001Best for: Known epoch counts, proven pipelines
2. Cosine Annealing (Smooth, Popular)
use tensorlogic_train::CosineAnnealingLrScheduler;
let scheduler = CosineAnnealingLrScheduler::new(
0.001, // initial_lr
0.00001, // min_lr
100, // t_max (total epochs)
);Best for: Long training, smooth convergence
3. OneCycle (Fast Training)
use tensorlogic_train::OneCycleLrScheduler;
let scheduler = OneCycleLrScheduler::new(
0.0001, // initial_lr
0.01, // max_lr (10-100x higher!)
50, // total_epochs
);Best for: When training time is limited, proven to work well
4. ReduceOnPlateau (Adaptive)
use tensorlogic_train::ReduceLrOnPlateauCallback;
let callback = ReduceLrOnPlateauCallback::new(
0.5, // factor (reduce by 50%)
10, // patience (wait 10 epochs)
0.0001, // min_delta
1e-6, // min_lr
);Best for: Uncertain optimal LR, automatic tuning
Training Duration?
├─ Short (<20 epochs)
│ └─ No schedule or Warmup only
├─ Medium (20-100 epochs)
│ ├─ Known duration → CosineAnnealing
│ └─ Uncertain → ReduceOnPlateau
└─ Long (>100 epochs)
├─ Want fastest → OneCycle
├─ Want smoothest → CosineAnnealing
└─ Want automatic → ReduceOnPlateau
Always recommended for Adam/AdamW:
use tensorlogic_train::WarmupScheduler;
let scheduler = WarmupScheduler::new(
0.0, // start_lr
0.001, // target_lr
5, // warmup_epochs
);Prevents instability in early training.
Impact: ⭐⭐⭐ (Medium) Difficulty: Easy Time to tune: 15-30 minutes
1. Weight Decay (L2 Regularization)
use tensorlogic_train::{AdamWOptimizer, OptimizerConfig};
let optimizer = AdamWOptimizer::new(OptimizerConfig {
weight_decay: 0.01, // Start here
..Default::default()
});Tuning:
- Too high (>0.1): Underfitting, loss doesn't decrease
- Too low (<0.001): Overfitting, train/val gap large
- Sweet spot: 0.001 to 0.1
2. Gradient Clipping
use tensorlogic_train::{OptimizerConfig, GradClipMode};
let optimizer = AdamOptimizer::new(OptimizerConfig {
grad_clip: Some(1.0), // Clip at norm=1.0
grad_clip_mode: GradClipMode::Norm,
..Default::default()
});When to use:
- Exploding gradients (NaN/Inf)
- Unstable training
- RNNs / transformers
Typical values:
- Conservative: 0.5
- Standard: 1.0
- Aggressive: 5.0
3. Early Stopping
use tensorlogic_train::EarlyStoppingCallback;
let callback = EarlyStoppingCallback::new(
10, // patience
0.001, // min_delta
);Always recommended! Prevents wasting time and overfitting.
1. Start with no regularization
2. Train to convergence
3. Check train/val gap:
├─ Gap < 5% → No regularization needed
├─ Gap 5-20% → Add weight_decay=0.01
└─ Gap > 20% → Add weight_decay=0.1 + early stopping
4. If still overfitting:
└─ Reduce model size or get more data
Impact: ⭐⭐⭐⭐ (High) Difficulty: Easy Time to tune: 5 minutes
use tensorlogic_train::EarlyStoppingCallback;
let callback = EarlyStoppingCallback::new(
patience, // How many epochs to wait
min_delta, // Minimum improvement threshold
);| Dataset Size | Recommended Patience | Reason |
|---|---|---|
| < 1000 | 5-10 | Quick overfitting |
| 1000-10000 | 10-20 | Moderate |
| > 10000 | 20-50 | Slow convergence |
- 0.001: Standard, catches meaningful improvements
- 0.0001: Very sensitive, might stop too late
- 0.01: Aggressive, stops early
// Typical configuration
let patience = 10;
let min_delta = 0.001;
// For long training
let patience = 20;
let min_delta = 0.0001;
// For quick experiments
let patience = 5;
let min_delta = 0.01;Goal: Get something working
// Use defaults, minimal tuning
let lr = 0.001;
let batch_size = 32;
let optimizer = AdamOptimizer::new(OptimizerConfig::default());
let epochs = 50;
let early_stopping = EarlyStoppingCallback::new(10, 0.001);
// Train and see if it works at allGoal: Find good learning rate and batch size
// Step 1: Use LR Finder (15 min)
let lr_finder = LearningRateFinder::new(1e-7, 1.0, 100);
// → Pick LR where loss decreases fastest
// Step 2: Try batch sizes (30 min)
for batch in [32, 64, 128] {
// Train for 10 epochs each
// Pick fastest converging
}
// Step 3: Add early stopping (5 min)
let early_stopping = EarlyStoppingCallback::new(10, 0.001);Goal: Squeeze out performance
// Step 1: Try different optimizers (1 hour)
// Compare Adam vs AdamW vs SGD
// Step 2: Add scheduling (1 hour)
// Try CosineAnnealing or OneCycle
// Step 3: Tune regularization (1 hour)
// Adjust weight_decay if overfitting
// Step 4: Final polish (1 hour)
// Fine-tune patience, min_deltaGoal: Best possible model
// Everything from refinement, plus:
// - Extensive hyperparameter search
// - Multiple random seeds
// - Cross-validation
// - Ensemble methods❌ Bad:
// Changed LR, batch size, optimizer, AND schedule
// Can't tell what helped!✅ Good:
// Change one thing, see effect, then change next❌ Bad:
// Only monitoring training loss
// Might be overfitting severely✅ Good:
// Always split data, monitor validation❌ Bad:
// Running for 1000 epochs without early stopping
// Wasting time and overfitting✅ Good:
// Use early stopping with reasonable patience❌ Bad:
// Not plotting loss over time
// Missing obvious issues✅ Good:
// Plot train/val loss
// Look for divergence, plateaus| If you increase... | Then adjust... | Direction |
|---|---|---|
| Batch size | Learning rate | Increase |
| Learning rate | Gradient clip | Might need to add/increase |
| Model size | Regularization | Increase |
| Dataset size | Everything | Less tuning needed |
| Training time | Patience | Increase |
| Weight decay | Learning rate | Might increase slightly |
┌─────────────────────────────────────┐
│ QUICK START DEFAULTS │
├─────────────────────────────────────┤
│ Optimizer: Adam │
│ Learning Rate: 0.001 │
│ Batch Size: 32 │
│ Early Stopping: 10 epochs │
│ Weight Decay: 0 (start) │
│ Gradient Clip: None (start) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ TUNING PRIORITY │
├─────────────────────────────────────┤
│ 1. Learning Rate (use LR finder) │
│ 2. Batch Size (32→64→128) │
│ 3. Early Stopping (always use!) │
│ 4. Weight Decay (if overfitting) │
│ 5. LR Schedule (if time permits) │
│ 6. Optimizer (try AdamW) │
└─────────────────────────────────────┘
**Last Updated: 2026-01-28 Version: 0.1.0-rc.1