-
Notifications
You must be signed in to change notification settings - Fork 1
Module 3
The backbone of any deep‑learning system is data plus architecture. Abundant, diverse data lets a network discover robust patterns; without it, even the best model underfits. When raw data are scarce, data augmentation steps in: synthetic transformations of existing samples that expose the model to new viewpoints, lighting conditions, and geometries, improving generalisation across supervised, self‑supervised, and reinforcement‑learning scenarios.
| Strategy | Workflow | Pros | Cons |
|---|---|---|---|
| Online (on‑the‑fly) | Random transforms are applied in memory, per batch, during training. | Infinite variety, no extra storage, easy to randomise probabilities. | Slight CPU/GPU overhead each iteration. |
| Offline (on‑disk) | Augmented images are pre‑generated and stored (e.g. ×5 the original set). | Zero runtime cost; deterministic runs; useful for slow I/O pipelines. | Requires extra disk space; fixed once written; risk of label drift with some tasks. |
In practice, online augmentation dominates modern pipelines because it delivers a broader distribution for free and stays synchronised with labels.
| Augmentation | What it does | Why it helps |
|---|---|---|
| Horizontal Flip | Mirrors the image across the vertical axis. | Makes the model invariant to left–right orientation (natural scenes, faces, street signs). |
| Vertical Flip | Mirrors across the horizontal axis. | Useful in aerial, medical, and microscopic imagery where “up” has no semantic meaning. |
| Rotation | Rotates by a random angle (e.g. ±30 °). | Teaches rotational invariance; be sure to mask black borders in the loss term. |
| Random Crop / Resized Crop | Extracts a random window and resizes back to target size. | Promotes object‑centred robustness; simulates zoom and translations. |
| Color Jitter | Perturbs brightness, contrast, saturation, and hue. | Forces the network to focus on shape and texture rather than absolute colour values. |
| Gaussian Blur | Convolves with a Gaussian kernel. | Models defocus or motion blur; improves resilience to real‑world camera shake. |
| Grayscale | Converts RGB to single‑channel intensity. | Encourages reliance on luminance patterns; handy when colour may vary widely. |
| Solarize / Posterize / Histogram Equalisation | Non‑linear tone‑mapping operations. | Expands dynamic range; highlights edges and low‑contrast details. |
Task‑aware rule of thumb: If colour carries semantic meaning (e.g. fine‑grained species recognition), avoid aggressive colour transforms. If geometry is rigid (e.g. printed digits), large rotations may break label integrity.
-
TorchVision:
torchvision.transformsships battle‑tested building blocks—compose them with explicit probabilities for online pipelines. Docs: https://pytorch.org/vision/main/transforms.html - Albumentations: a high‑performance, task‑agnostic library with GPU support and rich image‑specific transforms (elastic deformations, CLAHE, cutout, mixup, etc.). GitHub: https://github.com/albumentations-team/albumentations
- Custom Python / CUDA kernels: when latency is critical (real‑time inference) or exotic domain‑specific ops are needed.
Some final comments:
- Verify that synthetic pixels introduced by rotations or perspective warps are ignored in the loss (mask them or fill with realistic context).
- combine augmentations probabilistically rather than deterministically to avoid overfitting to a fixed pattern.
- Keep a clean validation set—never augment it—so metrics reflect true generalisation.
Batch Normalization (BN) is a critical innovation introduced to stabilize and accelerate the training of deep neural networks. Proposed by Sergey Ioffe and Christian Szegedy (2015), it normalizes intermediate activations within each mini-batch during training, ensuring stable gradients and facilitating faster convergence.
At a high level, BN performs two main operations:
- Normalization: Computes the mean (μ) and variance (σ²) of each feature dimension across the batch, then normalizes each feature:
where
-
Scaling and Shifting: Introduces learnable parameters
$\gamma$ (scale) and$\beta$ (shift) to restore the representational power lost by normalization:
These operations are applied independently to each feature channel, allowing the model to learn the optimal normalization for each dimension.
- Mitigates Internal Covariate Shift: By reducing distributional changes in intermediate layers, BN stabilizes training dynamics.
- Allows Higher Learning Rates: BN's stabilizing effect permits larger learning rates, accelerating convergence.
- Regularization Effect: Slight stochasticity introduced by batch-level statistics provides mild regularization, potentially reducing overfitting.
- Reduces Dependence on Initialization: Networks become less sensitive to weight initialization, simplifying model training.
Batch Normalization layers are widely available in frameworks like:
- PyTorch:
[torch.nn.BatchNorm2d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)
Typically, BN is placed after convolutional layers and before activation functions.
# PyTorch Example
import torch.nn as nn
layer = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=3),
nn.BatchNorm2d(out_channels),
nn.ReLU()
)Despite its popularity, recent architectures increasingly prefer alternatives like Layer Normalization, Group Normalization, and Instance Normalization, primarily due to:
- Dependence on Batch Size: BN performance degrades significantly with small batch sizes or inconsistent batch statistics (e.g., online learning or reinforcement learning scenarios).
- Complexity in Distributed Training: Calculating batch statistics across multiple GPUs adds complexity and can degrade computational efficiency.
- Instability in Transfer Learning: When fine-tuning pretrained models, different distributions can conflict with learned BN parameters, requiring recalibration or freezing of layers.
Dropout, introduced by Geoffrey Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever (this guy is amazing), and Ruslan Salakhutdinov in 2014, is a powerful regularization technique designed to prevent overfitting in deep neural networks. It is conceptually simple yet highly effective, fundamentally altering the training dynamics and generalization capabilities of neural models.
At each training step, Dropout randomly sets a fraction
Formally, Dropout can be expressed as:
Here,
Dropout acts as implicit ensemble learning. Training with dropout can be interpreted as training exponentially many sub-networks that share parameters. At inference, averaging these sub-networks effectively simulates an ensemble, reducing variance and enhancing robustness.
PyTorch simplifies Dropout implementation with built-in modules such as torch.nn.Dropout. During training, neurons are randomly masked, and retained neurons are scaled by
import torch
import torch.nn as nn
# Example Dropout layer
layer = nn.Sequential(
nn.Linear(in_features, out_features),
nn.ReLU(),
nn.Dropout(p=0.5)
)Illustrating Dropout behavior with a random tensor:
The retained neuron values precisely match explicit scaling by
- Typical values for dropout probability
$p$ range between 0.2 to 0.5; excessively high values may degrade model capacity. - Dropout should not be applied to the output layer.
- Some alternatives like Drop-path are more common today
Imagine a dartboard:
- High Bias, Low Variance: All darts land far from the bullseye but close together.
- Low Bias, High Variance: Darts spread randomly around the board.
- Low Bias, Low Variance: Tight cluster near bullseye (ideal scenario).
- High Bias, High Variance: Chaotic and unreliable performance.
Regularization intentionally introduces bias (simplifies the model) to substantially reduce variance (stabilizes performance). The net result is improved generalization.
In explicit regularization, the loss function
Here:
- Fidelity ensures predictions closely match ground-truth labels.
- Regularization penalizes model complexity, discouraging overfitting.
-
$\lambda$ controls the trade-off: larger values promote simpler models.
| Regularizer | Mathematical Form | Impact on Weights | Use Cases & Effects |
|---|---|---|---|
| L2 Regularization | Shrinks weights uniformly; does not induce exact sparsity. | Stabilizes learning; reduces parameter magnitude (common default). | |
| L1 Regularization | Encourages sparse solutions; drives many weights to zero. | Feature selection, compressing models, promoting interpretability. |
In optimization, particularly with optimizers like SGD, weight decay is mathematically equivalent to adding an L2 penalty. But in Adam, this equivalence breaks due to the adaptive learning rate.
When you apply L2 regularization by adding
This gradient is then passed into Adam, which rescales it using the moving averages of first and second moments (adaptive behavior). Thus, L2 interacts with the moment estimates, which can lead to unintuitive effects.
In Adam, the weight decay is usually implemented by adding wd*w (wd is weight decay here) to the gradients (Ist case), rather than actually subtracting from weights (IInd case).
This final decay step is not influenced by adaptive statistics. It enforces a direct, clean shrinkage on the weights, truly mimicking L2 regularization in a principled way.
- Adam + L2 penalty: mixes gradient with penalty → interacts with momentum.
- AdamW: separates gradient and penalty → pure, unbiased decay.
# Ist: Adam weight decay implementation (L2 regularization)
final_loss = loss + wd * all_weights.pow(2).sum() / 2
# IInd: equivalent to this in SGD
w = w - lr * w.grad - lr * wd * wA network’s learning rate (LR) is the master knob that controls how fast it learns.
Set it too high → divergence; too low → stagnation. Scheduling is the art of changing LR over training so we move fast early and fine‑tune late.
Start bold, finish precise.
Large steps explore the loss landscape; tiny steps carve the final minima.
| Schedule | Intuition | Typical Use‑Cases |
|---|---|---|
| Step | Manual “plateaus” every T steps. | Old but reliable; classification baselines. |
| Exponential | Continuous decay. | Long training jobs where fine control is unnecessary. |
| Polynomial | Smoothly drops to 0. | Semantic segmentation, detection (e.g. DeepLab). |
| Cosine Annealing | Fast start, slow finish. | Modern default for vision (ResNet, ViT). |
| Cyclic / One‑Cycle | Briefly increase LR to escape basins, then anneal. | Fast convergence in ≤100 epochs; works well with Adam/AdamW. |
Small batches, large models, or heavy regularization can make the first updates unstable.
Linear warm‑up ramps LR from 0 to $\eta_0$ over the first 3‑5 epochs, preventing gradient explosions. More info: https://arxiv.org/abs/2406.09405
- Debug with a fixed LR first. If loss doesn’t drop, a fancy schedule won’t rescue you.
- Match batch size. Double the batch → double the LR and every LR on the schedule.
- Combine with weight decay thoughtfully. Aggressive decay late in training can stall progress if LR is already near zero.
- Checkpoint at LR drops. Most performance jumps happen right after decay events.
- Hyper‑parameter search: tune initial LR; the schedule shape usually transfers.
Choosing the right optimizer profoundly affects training speed, stability, and final performance. Optimizers navigate the high-dimensional loss landscape to efficiently minimize error, adapting gradients and update magnitudes intelligently.
SGD is the simplest and most classical optimization method. At each step, parameters
- Pros: Simple, reliable, excellent generalization if tuned correctly.
- Cons: Sensitive to hyperparameters, slow convergence especially in ill-conditioned problems, and requires careful tuning of learning rate schedules.
Adam combines momentum (moving average of gradients) and RMSProp (moving average of squared gradients) to adaptively set learning rates per parameter:
- Pros: Less sensitive to learning rate, robust default, suitable for large-scale problems.
- Cons: May generalize slightly worse than finely-tuned SGD; can exhibit instability in convergence due to adaptive behaviors.
Torch implementation: torch.optim.Adam
AdamW improves Adam by separating weight decay from adaptive gradient calculations. The weight decay directly penalizes parameters independently of gradient momentum, providing clearer and more effective regularization:
- Pros: Significantly improves regularization effectiveness over standard Adam; easier to tune.
- Cons: Still inherits some adaptive instability from Adam; careful tuning of hyperparameters remains essential.
Torch implementation: torch.optim.AdamW
Muon leverages second-order approximations by employing the Newton-Schulz iteration for efficiently approximating the inverse Hessian or Fisher information matrix. It aims to combine the benefits of adaptive methods with second-order optimization principles:
-
Principle:
- Approximates second-order information (curvature) without explicitly computing the Hessian.
- Accelerates convergence especially on ill-conditioned or complex loss surfaces.
-
Pros: Faster convergence compared to first-order methods; robust to complex optimization landscapes.
-
Cons: Higher computational cost per iteration due to second-order approximation.
Reference and implementation details: Muon Optimizer
Natural gradients move parameters considering the geometry of the parameter space defined by the Fisher information matrix
- Pros: Achieves faster convergence by respecting the intrinsic geometry of the probability distribution space; effective in highly complex spaces such as reinforcement learning.
- Cons: Computationally expensive; approximations typically required for large-scale problems.
Torch implementation example: torch.optim.LBFGS can approximate second-order behavior; specialized libraries required for pure natural gradient.
- Start with AdamW for general-purpose training.
- Transition to Muon or natural gradients if faster convergence is critical and computational resources permit.
Weights & Biases (Wandb) is the modern, go-to tool for tracking, visualizing, and managing machine learning experiments. Surpassing traditional solutions like TensorBoard, Wandb streamlines deep-learning workflows with an intuitive, cloud-based platform.
- Rich Visualization: Automatically log metrics, visualize training curves, and interactively compare model runs.
- Collaborative Logging: Share live updates and insights with your team, facilitating collaboration and transparency.
- Media Support: Effortlessly upload images, videos, model predictions, and custom visualizations directly during training.
- Generous Free Tier: Provides 100 GB storage per project, supporting extensive experimentation at zero cost.
- Superior ease of use, setup simplicity, and richer visualizations.
- Cloud-based storage and management eliminate local storage headaches.
- Built-in collaboration tools simplify team workflows and reproducibility.
Official documentation: https://docs.wandb.ai/