Skip to content

Module 3

Fabian Perez *-* edited this page Jul 30, 2025 · 11 revisions

3.1 Data Augmentation

The backbone of any deep‑learning system is data plus architecture. Abundant, diverse data lets a network discover robust patterns; without it, even the best model underfits. When raw data are scarce, data augmentation steps in: synthetic transformations of existing samples that expose the model to new viewpoints, lighting conditions, and geometries, improving generalisation across supervised, self‑supervised, and reinforcement‑learning scenarios.

Online vs Offline Augmentation

Strategy Workflow Pros Cons
Online (on‑the‑fly) Random transforms are applied in memory, per batch, during training. Infinite variety, no extra storage, easy to randomise probabilities. Slight CPU/GPU overhead each iteration.
Offline (on‑disk) Augmented images are pre‑generated and stored (e.g. ×5 the original set). Zero runtime cost; deterministic runs; useful for slow I/O pipelines. Requires extra disk space; fixed once written; risk of label drift with some tasks.

In practice, online augmentation dominates modern pipelines because it delivers a broader distribution for free and stays synchronised with labels.

Canonical Augmentations for Images

Augmentation What it does Why it helps
Horizontal Flip Mirrors the image across the vertical axis. Makes the model invariant to left–right orientation (natural scenes, faces, street signs).
Vertical Flip Mirrors across the horizontal axis. Useful in aerial, medical, and microscopic imagery where “up” has no semantic meaning.
Rotation Rotates by a random angle (e.g. ±30 °). Teaches rotational invariance; be sure to mask black borders in the loss term.
Random Crop / Resized Crop Extracts a random window and resizes back to target size. Promotes object‑centred robustness; simulates zoom and translations.
Color Jitter Perturbs brightness, contrast, saturation, and hue. Forces the network to focus on shape and texture rather than absolute colour values.
Gaussian Blur Convolves with a Gaussian kernel. Models defocus or motion blur; improves resilience to real‑world camera shake.
Grayscale Converts RGB to single‑channel intensity. Encourages reliance on luminance patterns; handy when colour may vary widely.
Solarize / Posterize / Histogram Equalisation Non‑linear tone‑mapping operations. Expands dynamic range; highlights edges and low‑contrast details.

Task‑aware rule of thumb: If colour carries semantic meaning (e.g. fine‑grained species recognition), avoid aggressive colour transforms. If geometry is rigid (e.g. printed digits), large rotations may break label integrity.

data_augments

Implementation Code

  • TorchVision: torchvision.transforms ships battle‑tested building blocks—compose them with explicit probabilities for online pipelines. Docs: https://pytorch.org/vision/main/transforms.html
  • Albumentations: a high‑performance, task‑agnostic library with GPU support and rich image‑specific transforms (elastic deformations, CLAHE, cutout, mixup, etc.). GitHub: https://github.com/albumentations-team/albumentations
  • Custom Python / CUDA kernels: when latency is critical (real‑time inference) or exotic domain‑specific ops are needed.

Some final comments:

  • Verify that synthetic pixels introduced by rotations or perspective warps are ignored in the loss (mask them or fill with realistic context).
  • combine augmentations probabilistically rather than deterministically to avoid overfitting to a fixed pattern.
  • Keep a clean validation set—never augment it—so metrics reflect true generalisation.

3.2 Batch Normalization

Batch Normalization (BN) is a critical innovation introduced to stabilize and accelerate the training of deep neural networks. Proposed by Sergey Ioffe and Christian Szegedy (2015), it normalizes intermediate activations within each mini-batch during training, ensuring stable gradients and facilitating faster convergence.

At a high level, BN performs two main operations:

  1. Normalization: Computes the mean (μ) and variance (σ²) of each feature dimension across the batch, then normalizes each feature:

$$ \hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} $$

where $\epsilon$ is a small constant added for numerical stability.

  1. Scaling and Shifting: Introduces learnable parameters $\gamma$ (scale) and $\beta$ (shift) to restore the representational power lost by normalization:

$$ y_i = \gamma \hat{x}_i + \beta $$

These operations are applied independently to each feature channel, allowing the model to learn the optimal normalization for each dimension.

1_jTPViestByum2ZZ3jXcmfw

Benefits of Batch Normalization

  • Mitigates Internal Covariate Shift: By reducing distributional changes in intermediate layers, BN stabilizes training dynamics.
  • Allows Higher Learning Rates: BN's stabilizing effect permits larger learning rates, accelerating convergence.
  • Regularization Effect: Slight stochasticity introduced by batch-level statistics provides mild regularization, potentially reducing overfitting.
  • Reduces Dependence on Initialization: Networks become less sensitive to weight initialization, simplifying model training.
batch_norm_milestone

Practical Implementation

Batch Normalization layers are widely available in frameworks like:

Typically, BN is placed after convolutional layers and before activation functions.

# PyTorch Example
import torch.nn as nn

layer = nn.Sequential(
    nn.Conv2d(in_channels, out_channels, kernel_size=3),
    nn.BatchNorm2d(out_channels),
    nn.ReLU()
)

Why Batch Normalization is Less Prevalent Today

Despite its popularity, recent architectures increasingly prefer alternatives like Layer Normalization, Group Normalization, and Instance Normalization, primarily due to:

  • Dependence on Batch Size: BN performance degrades significantly with small batch sizes or inconsistent batch statistics (e.g., online learning or reinforcement learning scenarios).
  • Complexity in Distributed Training: Calculating batch statistics across multiple GPUs adds complexity and can degrade computational efficiency.
  • Instability in Transfer Learning: When fine-tuning pretrained models, different distributions can conflict with learned BN parameters, requiring recalibration or freezing of layers.

3.3 Dropout

Dropout, introduced by Geoffrey Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever (this guy is amazing), and Ruslan Salakhutdinov in 2014, is a powerful regularization technique designed to prevent overfitting in deep neural networks. It is conceptually simple yet highly effective, fundamentally altering the training dynamics and generalization capabilities of neural models.

Core Concept

At each training step, Dropout randomly sets a fraction $p$ of input units to zero with probability $p$, effectively creating a thinned network structure for each iteration. Consequently, each neuron learns robust features rather than co-dependencies, diminishing overfitting and improving generalization. During inference, Dropout is disabled, and weights are scaled accordingly to preserve expected activations.

image

Mathematical Perspective

Formally, Dropout can be expressed as:

$$ y_i^{(l)} = m_i^{(l)} \cdot x_i^{(l)}, \quad \text{where} \quad m_i^{(l)} \sim \text{Bernoulli}(1 - p) $$

Here, $m_i^{(l)}$ is a binary mask sampled from a Bernoulli distribution. To maintain expected activations consistent during inference, the activations are typically scaled by $\frac{1}{1 - p}$.

Dropout as Regularization and Ensemble

Dropout acts as implicit ensemble learning. Training with dropout can be interpreted as training exponentially many sub-networks that share parameters. At inference, averaging these sub-networks effectively simulates an ensemble, reducing variance and enhancing robustness.

Implementation in PyTorch

PyTorch simplifies Dropout implementation with built-in modules such as torch.nn.Dropout. During training, neurons are randomly masked, and retained neurons are scaled by $\frac{1}{1-p}$ to maintain activation magnitudes consistent with inference time.

import torch
import torch.nn as nn

# Example Dropout layer
layer = nn.Sequential(
    nn.Linear(in_features, out_features),
    nn.ReLU(),
    nn.Dropout(p=0.5)
)

Illustrating Dropout behavior with a random tensor:

image

The retained neuron values precisely match explicit scaling by $\frac{1}{1 - p}$:

image

Best Practices

  • Typical values for dropout probability $p$ range between 0.2 to 0.5; excessively high values may degrade model capacity.
  • Dropout should not be applied to the output layer.
  • Some alternatives like Drop-path are more common today

3.4 Explicit Regularization (L2 vs Weight Decay, L1)

Visual Intuition: The Bias-Variance Tradeoff

Imagine a dartboard:

  • High Bias, Low Variance: All darts land far from the bullseye but close together.
  • Low Bias, High Variance: Darts spread randomly around the board.
  • Low Bias, Low Variance: Tight cluster near bullseye (ideal scenario).
  • High Bias, High Variance: Chaotic and unreliable performance.

Regularization intentionally introduces bias (simplifies the model) to substantially reduce variance (stabilizes performance). The net result is improved generalization.

Screenshot at Jul 23 14-21-36

Mathematical Formulation

In explicit regularization, the loss function $\mathcal{L}$ combines fidelity (data fit) with a penalty $R(\theta)$ on parameters $\theta$, scaled by a regularization hyperparameter $\lambda$:

CodeCogsEqn - 2025-07-23T142538 092

Here:

  • Fidelity ensures predictions closely match ground-truth labels.
  • Regularization penalizes model complexity, discouraging overfitting.
  • $\lambda$ controls the trade-off: larger values promote simpler models.

Common Regularizers

Regularizer Mathematical Form Impact on Weights Use Cases & Effects
L2 Regularization $R(\theta) = \sum_{i} \theta_i^2$ Shrinks weights uniformly; does not induce exact sparsity. Stabilizes learning; reduces parameter magnitude (common default).
L1 Regularization $R(\theta) = \sum_{i} \theta_i$ Encourages sparse solutions; drives many weights to zero. Feature selection, compressing models, promoting interpretability.

Equivalence: Weight Decay vs. L2 Regularization

In optimization, particularly with optimizers like SGD, weight decay is mathematically equivalent to adding an L2 penalty. But in Adam, this equivalence breaks due to the adaptive learning rate.

Naive L2 in Adam:

When you apply L2 regularization by adding $\lambda \theta$ to the gradients:

$$ \nabla_{\theta}\mathcal{L}(\theta) = \nabla_{\theta}\mathcal{F}(\theta) + \lambda \theta $$

This gradient is then passed into Adam, which rescales it using the moving averages of first and second moments (adaptive behavior). Thus, L2 interacts with the moment estimates, which can lead to unintuitive effects.

AdamW: Decoupled Weight Decay

In Adam, the weight decay is usually implemented by adding wd*w (wd is weight decay here) to the gradients (Ist case), rather than actually subtracting from weights (IInd case).

This final decay step is not influenced by adaptive statistics. It enforces a direct, clean shrinkage on the weights, truly mimicking L2 regularization in a principled way.

  • Adam + L2 penalty: mixes gradient with penalty → interacts with momentum.
  • AdamW: separates gradient and penalty → pure, unbiased decay.
# Ist: Adam weight decay implementation (L2 regularization)
final_loss = loss + wd * all_weights.pow(2).sum() / 2
# IInd: equivalent to this in SGD
w = w - lr * w.grad - lr * wd * w

3.5 Learning‑Rate Scheduling

A network’s learning rate (LR) is the master knob that controls how fast it learns.
Set it too high → divergence; too low → stagnation. Scheduling is the art of changing LR over training so we move fast early and fine‑tune late.

Start bold, finish precise.
Large steps explore the loss landscape; tiny steps carve the final minima.


Canonical Schedules

Schedule Intuition Typical Use‑Cases
Step Manual “plateaus” every T steps. Old but reliable; classification baselines.
Exponential Continuous decay. Long training jobs where fine control is unnecessary.
Polynomial Smoothly drops to 0. Semantic segmentation, detection (e.g. DeepLab).
Cosine Annealing Fast start, slow finish. Modern default for vision (ResNet, ViT).
Cyclic / One‑Cycle Briefly increase LR to escape basins, then anneal. Fast convergence in ≤100 epochs; works well with Adam/AdamW.
CodeCogsEqn - 2025-07-23T142538 092

Warm‑Up

Small batches, large models, or heavy regularization can make the first updates unstable.
Linear warm‑up ramps LR from 0 to $\eta_0$ over the first 3‑5 epochs, preventing gradient explosions. More info: https://arxiv.org/abs/2406.09405


Some tips

  • Debug with a fixed LR first. If loss doesn’t drop, a fancy schedule won’t rescue you.
  • Match batch size. Double the batch → double the LR and every LR on the schedule.
  • Combine with weight decay thoughtfully. Aggressive decay late in training can stall progress if LR is already near zero.
  • Checkpoint at LR drops. Most performance jumps happen right after decay events.
  • Hyper‑parameter search: tune initial LR; the schedule shape usually transfers.

3.6 Optimizers

Choosing the right optimizer profoundly affects training speed, stability, and final performance. Optimizers navigate the high-dimensional loss landscape to efficiently minimize error, adapting gradients and update magnitudes intelligently.

Stochastic Gradient Descent (SGD)

SGD is the simplest and most classical optimization method. At each step, parameters $\theta$ are updated by subtracting a small fraction (learning rate $\eta$) of the computed gradient:

image
  • Pros: Simple, reliable, excellent generalization if tuned correctly.
  • Cons: Sensitive to hyperparameters, slow convergence especially in ill-conditioned problems, and requires careful tuning of learning rate schedules.

Adaptive Moment Estimation (Adam)

Adam combines momentum (moving average of gradients) and RMSProp (moving average of squared gradients) to adaptively set learning rates per parameter:

image
  • Pros: Less sensitive to learning rate, robust default, suitable for large-scale problems.
  • Cons: May generalize slightly worse than finely-tuned SGD; can exhibit instability in convergence due to adaptive behaviors.

Torch implementation: torch.optim.Adam

AdamW (Decoupled Weight Decay)

AdamW improves Adam by separating weight decay from adaptive gradient calculations. The weight decay directly penalizes parameters independently of gradient momentum, providing clearer and more effective regularization:

image
  • Pros: Significantly improves regularization effectiveness over standard Adam; easier to tune.
  • Cons: Still inherits some adaptive instability from Adam; careful tuning of hyperparameters remains essential.

Torch implementation: torch.optim.AdamW

Muon (Second-order Approximation via Newton-Schulz)

Muon leverages second-order approximations by employing the Newton-Schulz iteration for efficiently approximating the inverse Hessian or Fisher information matrix. It aims to combine the benefits of adaptive methods with second-order optimization principles:

image
  • Principle:

    • Approximates second-order information (curvature) without explicitly computing the Hessian.
    • Accelerates convergence especially on ill-conditioned or complex loss surfaces.
  • Pros: Faster convergence compared to first-order methods; robust to complex optimization landscapes.

  • Cons: Higher computational cost per iteration due to second-order approximation.

Reference and implementation details: Muon Optimizer

Natural Gradient Optimization

Natural gradients move parameters considering the geometry of the parameter space defined by the Fisher information matrix $F$. Updates follow the direction of steepest descent in a metric space determined by $F^{-1}$:

$$ \theta \leftarrow \theta - \eta F^{-1}\nabla_{\theta}\mathcal{L}(\theta) $$

  • Pros: Achieves faster convergence by respecting the intrinsic geometry of the probability distribution space; effective in highly complex spaces such as reinforcement learning.
  • Cons: Computationally expensive; approximations typically required for large-scale problems.

Torch implementation example: torch.optim.LBFGS can approximate second-order behavior; specialized libraries required for pure natural gradient.

Recommendations

  • Start with AdamW for general-purpose training.
  • Transition to Muon or natural gradients if faster convergence is critical and computational resources permit.

3.7 Logger (Wandb)

Weights & Biases (Wandb) is the modern, go-to tool for tracking, visualizing, and managing machine learning experiments. Surpassing traditional solutions like TensorBoard, Wandb streamlines deep-learning workflows with an intuitive, cloud-based platform.

Core Advantages of Wandb:

  • Rich Visualization: Automatically log metrics, visualize training curves, and interactively compare model runs.
  • Collaborative Logging: Share live updates and insights with your team, facilitating collaboration and transparency.
  • Media Support: Effortlessly upload images, videos, model predictions, and custom visualizations directly during training.
  • Generous Free Tier: Provides 100 GB storage per project, supporting extensive experimentation at zero cost.

Why Prefer Wandb over TensorBoard?

  • Superior ease of use, setup simplicity, and richer visualizations.
  • Cloud-based storage and management eliminate local storage headaches.
  • Built-in collaboration tools simplify team workflows and reproducibility.

Example interface

image

Official documentation: https://docs.wandb.ai/