Skip to content

Latest commit

 

History

History
1153 lines (902 loc) · 84.9 KB

File metadata and controls

1153 lines (902 loc) · 84.9 KB

Alright, let's break down Unit-3: Machine Learning for Vision. This unit is a cornerstone of modern computer vision, focusing heavily on deep learning techniques, especially Convolutional Neural Networks (CNNs), and essential practices that enable their effective application.

Unit-3: Machine Learning for Vision [12 Hrs.]

3.1 Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are a specialized type of artificial neural network primarily used for analyzing visual imagery. They are designed to automatically and adaptively learn spatial hierarchies of features from input data, making them highly effective for tasks like image classification, object detection, and segmentation.

What is CNN?

As the provided material states, a Convolutional Neural Network (CNN) is a type of artificial neural network used for image and signal processing. It is designed to recognize patterns in data, whether it’s in images (like objects, textures, or edges) or signals (like speech, music, or sensor data). CNNs are widely used in applications such as image classification, object detection, speech recognition, and even medical signal analysis.

How is CNN Derived?

CNNs originate from the field of signal processing, where the operation called convolution is fundamental for analyzing and processing signals.

  • In image processing, convolution helps detect edges, textures, and shapes.
  • In audio or other signals, convolution can extract features like frequencies, pitch, or speech patterns. CNN applies this concept in a structured way to automatically learn meaningful patterns from data.

Weighted Average in 1D Convolution (Audio Signal Processing): In 1D convolution for audio, a filter (kernel) is applied as a weighted average over a small segment (window) of the signal. This helps extract important features like smoothness, sharp changes, or specific frequencies.

How Does Sliding Window Relate to CNN? The sliding window method is a key idea in both traditional signal processing and CNNs. It refers to moving a small window across data (image or signal) to analyze local patterns.

  • Example 1: Sliding Window in Image Processing (Before CNN): Traditionally, fixed sliding windows scanned images for objects or features. This approach was slow and required manual feature selection. CNNs replace this with trainable convolutional filters that scan the image efficiently, learning features automatically.
  • Example 2: Sliding Window in Audio Processing: In speech recognition, a small window moves over an audio signal (waveform) to capture short sound patterns (like phonemes or syllables). CNNs process these windows using convolution filters to learn important speech features.

A CNN consists of multiple layers, including convolutional layers, activation functions, pooling layers, and fully connected layers. The core operation in CNN is the convolution operation.

3.1.1 Convolution, Pooling, Stride, Padding

These are the fundamental building blocks of a CNN, responsible for its ability to learn hierarchical features and reduce computational complexity.

Convolution Operation

The convolution operation is the core of a CNN. It applies a learnable kernel (filter) across the input data (e.g., an image) to produce a feature map (or activation map). This process extracts local patterns such as edges, textures, or more complex features.

How it works (2D Convolution for Images): A small filter slides over the input image, performing element-wise multiplications with the underlying pixels and summing the results to produce a single pixel in the output feature map. This operation is repeated across the entire image.

  • Input Image: $I$ (e.g., $H \times W \times C$ for height, width, channels)
  • Kernel/Filter: $K$ (e.g., $k_h \times k_w \times C_{\text{in}} \times C_{\text{out}}$ for kernel height, width, input channels, output channels/filters)
  • Output Feature Map: $F$ (e.g., $F_h \times F_w \times C_{\text{out}}$)

The value of a pixel at position $(i, j)$ in the output feature map $F$ for a single channel $c$ is given by:

$F_{i,j,c} = \sum_{x=0}^{k_h-1} \sum_{y=0}^{k_w-1} I_{i+x,j+y,c_{\text{in}}} \cdot K_{x,y,c_{\text{in}},c} + b_c$

Where:

  • $K_{x,y,c_{\text{in}},c}$ is the weight at position $(x, y)$ of the filter for input channel $c_{\text{in}}$ and output channel $c$.
  • $b_c$ is the bias term for output channel $c$.
  • The filter weights ($K$) and biases ($b$) are learned during training.

Stride

Stride defines how many pixels the filter moves across the input image at each step.

  • Stride = 1: The filter moves one pixel at a time. This results in an output feature map size that is close to the input size.
  • Stride > 1: The filter jumps multiple pixels. This reduces the spatial dimensions of the output feature map, serving as a form of downsampling.

Formula for Output Feature Map Dimension (without padding, stride = s): $F_h = \lfloor \frac{I_h - k_h}{s} \rfloor + 1$ $F_w = \lfloor \frac{I_w - k_w}{s} \rfloor + 1$ Where $I_h, I_w$ are input height/width, and $k_h, k_w$ are kernel height/width.

Padding

Padding involves adding extra pixels (typically zeros) around the border of the input image.

  • Purpose:
    • Preserve spatial dimensions: "Same" padding (or "Full" padding) adds enough zeros to ensure the output feature map has the same spatial dimensions as the input image (if stride is 1).
    • Prevent information loss: Pixels at the edges of the input image would otherwise be covered by the filter less frequently, potentially losing information. Padding ensures these pixels are equally considered.
    • Control output size: Allows for more flexible control over the output dimensions.

Formula for Output Feature Map Dimension (with padding = p, stride = s): $F_h = \lfloor \frac{I_h + 2p - k_h}{s} \rfloor + 1$ $F_w = \lfloor \frac{I_w + 2p - k_w}{s} \rfloor + 1$

Pooling (or Downsampling)

Pooling layers reduce the spatial dimensions (height and width) of the feature maps, thus reducing the number of parameters and computational cost. This helps control overfitting and makes the network more robust to small translations or distortions in the input.

  • Types of Pooling:
    • Max Pooling: Selects the maximum value from the portion of the feature map covered by the pooling window. This captures the most prominent feature within that region.
    • Average Pooling: Calculates the average value from the portion of the feature map covered by the pooling window.
  • Mechanism: A pooling window (e.g., $2 \times 2$) slides over the feature map with a specified stride, outputting a single value for each window.
  • Output Size for Pooling (pool_size = p, stride = s): $F_h = \lfloor \frac{I_h - p_h}{s} \rfloor + 1$ $F_w = \lfloor \frac{I_w - p_w}{s} \rfloor + 1$

Example of Practical Implementation of CNN using MNIST Dataset:

The provided material illustrates a basic CNN implementation for handwritten digit recognition using the MNIST dataset in Keras/TensorFlow.

  1. Import Libraries:

    import tensorflow as tf
    from tensorflow import keras
    from keras.layers import Conv2D, MaxPooling2D, Dense, Flatten
    from keras import Sequential
    from keras.datasets import mnist
    import numpy as np
    import matplotlib.pyplot as plt
  2. Load and Preprocess Dataset:

    • The MNIST dataset consists of grayscale images of handwritten digits (0-9).
    • mnist.load_data() loads the training and testing sets.
    • Images are 28x28 pixels. For CNN, an explicit channel dimension is needed (1 for grayscale). Pixel values are normalized to a 0-1 range by dividing by 255.0.
    • Labels (0-9) are one-hot encoded for categorical cross-entropy loss.
    (X_train, y_train), (X_test, y_test) = mnist.load_data()
    
    # Reshape and normalize input images
    X_train = X_train.reshape(-1, 28, 28, 1).astype("float32") / 255.0
    X_test = X_test.reshape(-1, 28, 28, 1).astype("float32") / 255.0
    
    # One-hot encode target labels
    y_train = to_categorical(y_train, 10)
    y_test = to_categorical(y_test, 10)
    
    print("X_train shape:", X_train.shape) # (60000, 28, 28, 1)
    print("y_train shape:", y_train.shape) # (60000, 10)
    
    # Display a sample image
    plt.imshow(X_train[0].reshape(28, 28), cmap='gray')
    plt.title(f"Sample Digit (Label: {np.argmax(y_train[0])})")
    plt.show()
  3. Define Model Architecture (without Pooling first, then with Stride to control feature map size):

    • This example showcases how Conv2D layers (and stride) affect the output shape.
    # Model without explicit MaxPooling (using stride in Conv2D for downsampling)
    model_no_pooling = Sequential([
        Conv2D(32, kernel_size=(3,3), strides=(2,2), padding='same', activation='relu', input_shape=(28,28,1)),
        Conv2D(32, kernel_size=(3,3), strides=(2,2), padding='same', activation='relu'),
        Conv2D(32, kernel_size=(3,3), strides=(2,2), padding='same', activation='relu'),
        Flatten(),
        Dense(128, activation='relu'),
        Dense(10, activation='softmax')
    ])
    model_no_pooling.summary()

    Output Analysis (model_no_pooling.summary()):

    • Conv2D (32, kernel=(3,3), strides=(2,2), padding='same'):
      • Input: (None, 28, 28, 1)
      • Output: (None, 14, 14, 32)
        • Calculation: $F_h = \lfloor \frac{28 + 2*(\text{padding_value}) - 3}{2} \rfloor + 1$. For padding='same', it tries to make output size $\lceil I_h / s \rceil$. So $28/2 = 14$.
        • Params: $(k_h \times k_w \times C_{\text{in}} + 1) \times C_{\text{out}} = (3 \times 3 \times 1 + 1) \times 32 = (9+1) \times 32 = 10 \times 32 = 320$.
    • Conv2D_1:
      • Input: (None, 14, 14, 32) (output of previous layer)
      • Output: (None, 7, 7, 32)
        • Calculation: $F_h = \lfloor \frac{14 - 3}{2} \rfloor + 1 = \lfloor \frac{11}{2} \rfloor + 1 = 5 + 1 = 6$. Wait, the summary says 7x7. This is due to 'same' padding ensuring that for input 14 and stride 2, the output is ceil(14/2)=7.
        • Params: $(3 \times 3 \times 32 + 1) \times 32 = (288+1) \times 32 = 289 \times 32 = 9248$.
    • Conv2D_2:
      • Input: (None, 7, 7, 32)
      • Output: (None, 4, 4, 32)
        • Calculation: $F_h = \lceil 7/2 \rceil = 4$.
        • Params: $(3 \times 3 \times 32 + 1) \times 32 = 9248$.
    • Flatten: Converts the 4x4x32 feature map into a 1D vector: $4 \times 4 \times 32 = 512$.
      • Output: (None, 512). Params: 0.
    • Dense (128 units): Fully connected layer.
      • Output: (None, 128). Params: $(512 \times 128) + 128 = 65536 + 128 = 65664$.
    • Dense (10 units, softmax): Output layer for 10 classes.
      • Output: (None, 10). Params: $(128 \times 10) + 10 = 1280 + 10 = 1290$.
    • Total parameters: 2,002,698 (This appears to be a typo in the provided text's calculation, let's re-verify).
      • The parameter calculation in the provided text for Layer 1 is $10 \times 32 = 320$.
      • For Layer 2, it's $(3 \times 3 \times 32 \text{ (input channels)} + 1 \text{ (bias)}) \times 32 \text{ (output channels)} = (288 + 1) \times 32 = 9248$. This aligns with conv2d_1.
      • Total params for model_no_pooling: $320 + 9248 + 9248 + 0 + 65664 + 1290 = 86770$. The provided text states "Total params: 2,002,698", which is likely a global parameter count with different settings, or a confusion with total operations. The Keras summary for the specific model defined is 86,770.

    Total Multiplication Operations (Layer 1 example from text):

    • Input size: 28x28, Kernel size: 3x3, Output feature map size: 26x26 (if no padding, stride 1). With padding='same' and stride=1, output would be 28x28. With stride=2, it's 14x14 as seen.
    • Let's assume the text meant 26x26 for input 28x28, kernel 3x3, stride 1, and no padding for multiplication count.
    • One kernel performs $3 \times 3 = 9$ multiplications per output pixel.
    • For a $26 \times 26$ output feature map, one kernel performs $9 \times 26 \times 26$ multiplications.
    • If 32 kernels are used, total computation is $9 \times 26 \times 26 \times 32 = 194688$.
    • This is a calculation for operations, not trainable parameters.

    Total Multiplication Operations (Layer 2 example from text):

    • Input image dimension (from Layer 1 output): 26 x 26 x 32 channels.
    • Kernel dimension in layer 2: 3 x 3 x 32 (input channels)
    • Output dimension: 24 x 24 x 32 (assuming stride 1, no padding).
    • To generate 1 pixel in output image for one filter: $3 \times 3 \times 32 = 288$ operations (per pixel across all input channels).
    • To generate whole picture (24 pixel single channel for one filter): $288 \times 24 \times 24 = 165888$.
    • Total operations for 32 filters: $165888 \times 32 = 5308416$.
    • These are floating point operations (FLOPs), not trainable parameters.
  4. Define Model Architecture (with MaxPooling layer):

    • This shows explicit MaxPooling2D layers for downsampling.
    # Model with MaxPooling layers
    model_pooling = Sequential([
        Conv2D(32, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', input_shape=(28,28,1)),
        MaxPooling2D(pool_size=(2,2)), # Reduces 28x28 to 14x14
        Conv2D(32, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu'),
        MaxPooling2D(pool_size=(2,2)), # Reduces 14x14 to 7x7
        Conv2D(32, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu'),
        MaxPooling2D(pool_size=(2,2)), # Reduces 7x7 to 3x3 (for 7, max_pooling (2,2) with default padding `valid` will result in `floor((7-2)/2)+1=3`)
        Flatten(),
        Dense(128, activation='relu'),
        Dense(10, activation='softmax')
    ])
    model_pooling.summary()

    Output Analysis (model_pooling.summary()):

    • Conv2D: Input (28,28,1), Output (28,28,32), Params 320.
    • MaxPooling2D: Input (28,28,32), Output (14,14,32), Params 0.
    • Conv2D_1: Input (14,14,32), Output (14,14,32), Params 9248.
    • MaxPooling2D_1: Input (14,14,32), Output (7,7,32), Params 0.
    • Conv2D_2: Input (7,7,32), Output (7,7,32), Params 9248.
    • MaxPooling2D_2: Input (7,7,32), Output (3,3,32), Params 0.
    • Flatten: Input (3,3,32) which is $9 \times 32 = 288$.
      • Output: (None, 288). Params 0.
    • Dense (128 units): Output (None, 128). Params: $(288 \times 128) + 128 = 36864 + 128 = 36992$.
    • Dense (10 units, softmax): Output (None, 10). Params: $(128 \times 10) + 10 = 1290$.
    • Total parameters: $320 + 9248 + 9248 + 36992 + 1290 = 57098$. (This matches the provided text.)

Drawbacks of Standard Convolution:

  1. Translation Invariance Issue:

    • Problem: Standard convolution tends to have issues with strict translation invariance. If an object or feature in an image moves or shifts even slightly, the convolutional filters might not recognize it in the new position perfectly because they focus on local regions and don't inherently account for shifts in the input image. This is more accurately described as translation sensitivity rather than complete invariance. CNNs achieve a degree of translation equivariance (output shifts with input) and learn invariance through pooling and subsequent layers learning to recognize features regardless of precise position.
    • Impact: This can reduce the network’s ability to generalize well when objects or features are shifted in the input data.
    • Solution:
      • Pooling (like max pooling): Helps by making the representation slightly invariant to small translations within the pooling window, but causes a loss of spatial information.
      • Transposed convolutions (or deconvolutional layers): Used in generative models for upsampling.
      • Dilated convolutions: Allow filters to have a wider receptive field without increasing parameters, helping capture context.
  2. Computational Complexity:

    • Problem: Convolution operations can be computationally expensive, especially as the number of filters and the depth of the network increase. The number of multiplications needed increases with the size of the image, filter size, and depth.
    • Impact: This leads to high memory consumption and slow training and inference times.
    • Solution:
      • Depthwise separable convolutions (e.g., used in MobileNets): Significantly reduce the number of parameters and operations.
      • Grouped convolutions: Divide channels into groups, reducing computation.
  3. Large Number of Parameters:

    • Problem: In standard convolutions, each filter operates on the entire input depth. As the network deepens and the number of filters increases, the total number of parameters grows quickly.
    • Impact: This can result in overfitting (especially with limited data) and high memory/computational requirements.
    • Solution:
      • Weight sharing: Intrinsic to CNNs, where the same filter weights are used across different spatial locations.
      • Depthwise separable convolutions and grouped convolutions: Reduce parameter count.
      • Transfer learning: Reusing pre-trained models on large datasets.
      • Dropout: A regularization technique that randomly sets a fraction of input units to 0 at each update during training, preventing overfitting.
      • Batch Normalization: Helps by stabilizing activations, allowing higher learning rates and reducing dependence on initialization, indirectly helping with parameter efficiency.

5. Receptive Field

The receptive field of a unit (pixel) in a convolutional neural network's output feature map is the region in the input image that influences that unit's value. It's the size of the input area that a particular filter "sees".

Formula: The receptive field of an output pixel can be calculated using the formula: $RF_{\text{out}} = RF_{\text{in}} + (\text{Kernel size} - 1) \times \text{Stride}_{\text{previous layer}}$

Let's illustrate with the provided example:

  • Input Image (I): 28x28x1

  • First Convolution: Filter size = 3x3, Stride = 1 (default if not specified).

    • Output image (I1) size: $28 - 3 + 1 = 26 \times 26 \times \text{Num_filters}$.
    • Receptive field for each pixel in I1 ($RF_{\text{I1}}$): $RF_{\text{I}} + (K_1 - 1) = 1 + (3 - 1) = 3$. So, a 3x3 patch in the input image (I).
  • Second Convolution: Applied to I1, Filter size = 3x3, Stride = 1.

    • Output image (I2) size: $26 - 3 + 1 = 24 \times 24 \times \text{Num_filters}$.
    • Receptive field for each pixel in I2 ($RF_{\text{I2}}$): $RF_{\text{I1}} + (K_2 - 1) \times \text{Stride}_{\text{I1}}$. Assuming stride for the first layer was 1 (implicit).
    • $RF_{\text{I2}} = 3 + (3 - 1) \times 1 = 3 + 2 = 5$. So, a 5x5 patch in the original input image.

General Rule for Receptive Field: To increase the receptive field, you can:

  • Increase the size of the kernel.
  • Increase the stride of the convolutional layers or pooling layers.
  • Add more convolutional layers (deeper network). Increasing the receptive field allows the network to capture wider, more global features in the image. However, a very large kernel size can lead to more parameters and costly computation. To overcome this, Dilated Convolution Technique can be used, which increases the receptive field without increasing the number of parameters by skipping pixels in the input.

3.1.1 Convolution, Pooling, Stride, Padding (Revisit)

This section re-emphasizes the role of these operations in CNNs and how they affect the feature map dimensions. The example code for MNIST dataset already demonstrated these concepts, especially how strides in Conv2D and MaxPooling2D effectively reduce the spatial dimensions.

3.1.2 Fully Connected Layers

  • Role: After several convolutional and pooling layers, the high-level features learned by the network are typically flattened into a single vector. This vector is then fed into one or more fully connected (FC) layers (or Dense layers).
  • Mechanism: In a fully connected layer, every neuron in the layer is connected to every neuron in the previous layer. This is where the network performs high-level reasoning and classification based on the extracted features.
  • Output Layer: The final fully connected layer typically has neurons equal to the number of classes, with an activation function like softmax for multi-class classification (as seen in the MNIST example) or sigmoid for binary classification.
  • Parameters: Fully connected layers usually have a large number of parameters (weights and biases), especially if the input to them is a large flattened vector. This can be a source of overfitting if not properly regularized.

3.1.3 Batch Normalization and Dropout

These are crucial regularization and optimization techniques commonly used in deep neural networks to improve training stability, speed, and generalization performance.

1. Introduction to Batch Normalization (BatchNorm)

Batch Normalization (BatchNorm) is a technique introduced by Sergey Ioffe and Christian Szegedy in their 2015 paper titled "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". BatchNorm became a standard in modern neural network architectures due to its ability to speed up training and improve model accuracy.

2. Why Was Batch Normalization Invented?

During the training of deep neural networks, a phenomenon called internal covariate shift occurs. This refers to the changing distributions of the inputs to layers as the parameters of the network are updated. This makes training unstable and can lead to issues like:

  • Slow convergence: Due to continually shifting input distributions, the network has to constantly adapt to new input ranges for each layer, slowing down training.
  • Vanishing/exploding gradients: Especially problematic in deep networks, where gradients can become extremely small (vanishing) or extremely large (exploding), hindering effective weight updates.
  • Difficulty with large learning rates: Large learning rates can cause divergence or oscillations because the input distributions are unstable.

BatchNorm was invented to solve these issues by normalizing the activations of each layer during training, ensuring that their distributions remain stable (zero mean and unit variance). This leads to:

  • Faster convergence: Training becomes more stable, allowing for higher learning rates.
  • Improved gradient flow: Helps mitigate vanishing/exploding gradients.
  • Less dependence on initialization: Reduces the sensitivity to initial weight configurations.

3. What Problem Does Batch Normalization Tackle?

The core problem addressed by BatchNorm is internal covariate shift. As mentioned, during training, the inputs to each layer keep changing due to weight updates, leading to:

  • Slower convergence.
  • Difficulty in training deep networks.
  • Problems with large learning rates. BatchNorm standardizes the activations of each layer, which helps to stabilize the learning process.

4. Key Concepts of Batch Normalization

The BatchNorm operation consists of the following steps, applied independently for each feature/channel in a mini-batch:

a. Compute Mean and Variance: Given a mini-batch of data (e.g., a batch of images), BatchNorm computes the mean ($\mu_B$) and variance ($\sigma^2_B$) of the activations along each feature/channel. For an input $x^{(i)}$ in a mini-batch of size $m$: $\mu_B = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}$ $\sigma^2_B = \frac{1}{m} \sum_{i=1}^{m} (x^{(i)} - \mu_B)^2$

b. Normalize the Data: Once the mean and variance are calculated, each input activation $x^{(i)}$ is normalized to have a mean of 0 and variance of 1: $\hat{x}^{(i)} = \frac{x^{(i)} - \mu_B}{\sqrt{\sigma^2_B + \epsilon}}$ Where $\epsilon$ is a small constant (e.g., $10^{-5}$) added for numerical stability to prevent division by zero.

c. Scale and Shift Using Learnable Parameters: After normalization, BatchNorm introduces two learnable parameters: $\gamma$ (scale) and $\beta$ (shift). These parameters allow the network to learn how to restore the distribution of the activations if needed, enabling it to undo normalization if the optimal distribution for a feature is not zero-mean and unit-variance. $y^{(i)} = \gamma \hat{x}^{(i)} + \beta$

Why Are Shift ($\beta$) and Scale ($\gamma$) Necessary in Batch Normalization?

After normalizing inputs in a mini-batch to have zero mean and unit variance, applying an affine transformation $y = \gamma \hat{x} + \beta$ is essential for several reasons:

  1. Limited Model Capacity Without Shift and Scale: Without $\gamma$ and $\beta$, the normalized activations are forced to have mean 0 and variance 1. This severely constrains every neuron to produce outputs from a standard normal distribution, reducing the network’s ability to represent diverse patterns and learn flexible transformations tailored to specific tasks.
  2. Non-optimal Activation Range for Nonlinearities: For activation functions like ReLU, Sigmoid, or Tanh, the input range significantly affects behavior. For instance:
    • ReLU: Outputs zero for all negative values. Forcing inputs around zero might limit the range of active neurons.
    • Sigmoid/Tanh: Saturate (output values flatten) and produce gradients close to zero for inputs far from zero. Without $\gamma$ and $\beta$, inputs to these nonlinearities would be constrained near zero, potentially leading to inactive or "dying" neurons and hindering learning (vanishing gradients). $\gamma$ and $\beta$ allow the network to learn the optimal scale and shift for each feature, enabling it to move the normalized values to the optimal range for the subsequent non-linear activation functions. This flexibility is crucial for the network's expressive power and learning capacity.

Step-by-Step Calculation of Batch Normalization:

5.1 Single Image (Single Channel - Grayscale): Let's consider a 2x2 image with a single channel (grayscale) with pixel values: Image = [1 2; 3 4] (represented as a 1D vector for calculation: ${1, 2, 3, 4}$)

Step 1: Compute the Mean ($\mu_B$) and Variance ($\sigma^2_B$): $\mu_B = \frac{1+2+3+4}{4} = \frac{10}{4} = 2.5$ $\sigma^2_B = \frac{1}{4} \left((1-2.5)^2 + (2-2.5)^2 + (3-2.5)^2 + (4-2.5)^2\right)$ $\sigma^2_B = \frac{1}{4} ((-1.5)^2 + (-0.5)^2 + (0.5)^2 + (1.5)^2)$ $\sigma^2_B = \frac{1}{4} (2.25 + 0.25 + 0.25 + 2.25) = \frac{1}{4} (5.00) = 1.25$

Step 2: Normalize the Data ($\hat{x}^{(i)}$): For each pixel in the image, subtract the mean and divide by the standard deviation (using $\epsilon = 10^{-5}$): $\sqrt{\sigma^2_B + \epsilon} = \sqrt{1.25 + 10^{-5}} \approx \sqrt{1.25} \approx 1.118$ $\hat{x}^{(1)} = \frac{1 - 2.5}{1.118} = \frac{-1.5}{1.118} \approx -1.34$ $\hat{x}^{(2)} = \frac{2 - 2.5}{1.118} = \frac{-0.5}{1.118} \approx -0.45$ $\hat{x}^{(3)} = \frac{3 - 2.5}{1.118} = \frac{0.5}{1.118} \approx 0.45$ $\hat{x}^{(4)} = \frac{4 - 2.5}{1.118} = \frac{1.5}{1.118} \approx 1.34$ Normalized values: $\hat{x} = {-1.34, -0.45, 0.45, 1.34}$

Step 3: Apply Scaling ($\gamma$) and Shifting ($\beta$): Assuming learnable parameters $\gamma = 2$ and $\beta = 3$: $y^{(i)} = \gamma \hat{x}^{(i)} + \beta = 2 \cdot \hat{x}^{(i)} + 3$ $y^{(1)} = 2 \cdot (-1.34) + 3 = -2.68 + 3 = 0.32$ $y^{(2)} = 2 \cdot (-0.45) + 3 = -0.90 + 3 = 2.10$ $y^{(3)} = 2 \cdot (0.45) + 3 = 0.90 + 3 = 3.90$ $y^{(4)} = 2 \cdot (1.34) + 3 = 2.68 + 3 = 5.68$ Resulting activations: $y = {0.32, 2.10, 3.90, 5.68}$. Interpretation: The network can now adjust the mean and variance of activations as needed for the task, restoring the ability to learn expressive and flexible features.

5.2 Single Image (Three Channels - RGB): For a 2x2 image with 3 channels (RGB), let's say the pixels are $R_{1,1}, G_{1,1}, B_{1,1}$ for the first pixel, $R_{1,2}, G_{1,2}, B_{1,2}$ for the second, etc. Image = [R1,1 G1,1 B1,1; R1,2 G1,2 B1,2]

The process is repeated for each channel independently:

  1. Compute the mean and variance for each channel separately (e.g., $\mu_R, \sigma^2_R$ for Red channel, etc.).
  2. Normalize the values for each channel independently.
  3. Apply scaling ($\gamma$) and shifting ($\beta$) for each channel independently. This ensures that each channel is normalized independently.

5.3 Multiple Images (Single Channel - Mini-batch): Consider a mini-batch of 2 images, each of size 2x2 with a single channel: Image 1 = [1 2; 3 4] Image 2 = [5 6; 7 8] Flatten the mini-batch into a single vector (for calculation of global mean/variance across the batch): $B = {1, 2, 3, 4, 5, 6, 7, 8}$

Step 1: Compute the Batch Mean ($\mu_B$): $\mu_B = \frac{1+2+3+4+5+6+7+8}{8} = \frac{36}{8} = 4.5$

Step 2: Compute the Batch Variance ($\sigma^2_B$): $\sigma^2_B = \frac{1}{8} \sum_{i=1}^{8} (x^{(i)} - 4.5)^2$ $\sigma^2_B = \frac{1}{8} \left((1-4.5)^2 + (2-4.5)^2 + \cdots + (8-4.5)^2\right)$ $\sigma^2_B = \frac{1}{8} ((-3.5)^2 + (-2.5)^2 + (-1.5)^2 + (-0.5)^2 + (0.5)^2 + (1.5)^2 + (2.5)^2 + (3.5)^2)$ $\sigma^2_B = \frac{1}{8} (12.25 + 6.25 + 2.25 + 0.25 + 0.25 + 2.25 + 6.25 + 12.25) = \frac{1}{8} (42.0) = 5.25$

Step 3: Normalize the Data for All Pixels in the Batch: $\hat{x}^{(i)} = \frac{x^{(i)} - 4.5}{\sqrt{5.25 + 10^{-5}}} \approx \frac{x^{(i)} - 4.5}{2.291}$ For each pixel, subtract 4.5 and divide by 2.291.

Step 4: Apply Scaling and Shifting: $y^{(i)} = \gamma \hat{x}^{(i)} + \beta$, where $\gamma$ and $\beta$ are learnable parameters.

5.4 Multiple Images (Three Channels - RGB): For a mini-batch of 2 images, each with 3 channels (RGB) (e.g., Image 1 (RGB) = ${R_{1}, G_{1}, B_{1}}$, Image 2 (RGB) = ${R_{2}, G_{2}, B_{2}}$):

  • Flatten each channel across all images in the batch (e.g., all Red pixels from all images form one vector).
  • Compute the mean and variance for each channel across the batch.
  • Normalize each channel separately.
  • Apply the scaling and shifting for each channel.

6. Conclusion (Batch Normalization)

Batch Normalization is a powerful technique that stabilizes training, improves convergence, and allows the use of higher learning rates. By normalizing activations per mini-batch, it mitigates the issues of internal covariate shift and helps with gradient flow, thus allowing the training of deeper networks.

Dropout

Dropout is a regularization technique that prevents overfitting in neural networks.

  • Mechanism: During training, a certain percentage of neurons in a layer are randomly "dropped out" (i.e., their outputs are temporarily set to zero) along with their connections. This happens in each training iteration.
  • Effect:
    • Prevents Co-adaptation: It forces the network to learn more robust features because no single neuron can rely too much on the presence of any other specific neuron.
    • Ensemble Effect: It can be thought of as training an ensemble of many "thinned" networks (sub-networks) sharing weights. During inference (testing), all neurons are active, but their outputs are scaled by the dropout rate to compensate for the higher number of active neurons.
  • Application: Applied to fully connected layers, and sometimes to convolutional layers, but typically with lower dropout rates for convolutional layers.

3.1.4 Hyper-parameter Tuning

Hyper-parameter tuning is the process of finding the optimal set of hyper-parameters for a machine learning model. Hyper-parameters are external configurations for a model whose values cannot be estimated from data (unlike model parameters, like weights and biases, which are learned during training). Examples include learning rate, number of layers, number of neurons per layer, kernel size, batch size, activation functions, and dropout rate.

Why Tune Hyper-parameters?

  • Model Performance: Different hyper-parameter combinations can significantly impact model accuracy, generalization, and training speed.
  • Generalization: Poorly chosen hyper-parameters can lead to overfitting (model performs well on training data but poorly on unseen data) or underfitting (model is too simple to capture patterns).
  • Computational Efficiency: Optimal hyper-parameters can reduce training time and resource consumption.

Techniques for Hyper-parameter Tuning:

The provided material demonstrates Random Search using Keras Tuner.

  1. Install Keras Tuner (if not already installed):

    pip install keras-tuner
  2. Import Libraries:

    import tensorflow as tf
    from tensorflow import keras
    from keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, Dropout # Added Dropout
    from keras import Sequential
    from keras.datasets import mnist
    from keras.utils import to_categorical # For one-hot encoding
    from kerastuner.tuners import RandomSearch # For random search
    import numpy as np # Not explicitly in user material, but useful for np.argmax
    import matplotlib.pyplot as plt # Not explicitly in user material, but useful for visualization
  3. Load and Preprocess Dataset: (Same as explained in 3.1.1)

    (x_train, y_train), (x_val, y_val) = mnist.load_data()
    
    # Reshape and normalize input images
    x_train = x_train.reshape(-1, 28, 28, 1).astype("float32") / 255.0
    x_val = x_val.reshape(-1, 28, 28, 1).astype("float32") / 255.0 # Using x_val for validation
    
    # One-hot encode target labels
    y_train = to_categorical(y_train, 10)
    y_val = to_categorical(y_val, 10)
  4. Model Builder Function (build_model(hp)):

    • This function defines the model architecture and incorporates hp.Choice or hp.Int to specify ranges for hyper-parameters.
    • hp.Choice('filters', [32, 64]): Allows the tuner to choose between 32 or 64 filters for the Conv2D layer.
    • hp.Choice('kernel_size', [3, 5]): Allows choice between 3x3 or 5x5 kernel size.
    • hp.Choice('dense_units', [64, 128, 256]): Allows choice for the number of units in the Dense layer.
    • hp.Choice('dropout', [0.2, 0.3, 0.5]): Allows choice for the dropout rate.
    • hp.Choice('learning_rate', [1e-2, 1e-3, 1e-4]): Allows choice for the Adam optimizer's learning rate.
    def build_model(hp):
        model = tf.keras.Sequential()
    
        # First Convolutional Layer with tunable filters and kernel size
        model.add(Conv2D(
            filters=hp.Choice('filters', [32, 64]),
            kernel_size=hp.Choice('kernel_size', [3, 5]),
            activation='relu',
            padding='same',
            input_shape=(28, 28, 1)
        ))
        # Adding MaxPooling
        model.add(MaxPooling2D(pool_size=(2, 2)))
    
        # Adding more Conv2D layers (can be made tunable too)
        model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', padding='same'))
        model.add(MaxPooling2D(pool_size=(2, 2)))
        model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', padding='same'))
        model.add(MaxPooling2D(pool_size=(2, 2))) # Output will be 3x3x32 as seen in summary
    
        # Flatten layer before Dense layers
        model.add(Flatten())
    
        # Dense layer with tunable units
        model.add(Dense(
            units=hp.Choice('dense_units', [64, 128, 256]),
            activation='relu'
        ))
    
        # Dropout layer with tunable rate
        model.add(Dropout(
            rate=hp.Choice('dropout', [0.2, 0.3, 0.5])
        ))
    
        # Output layer
        model.add(Dense(10, activation='softmax'))
    
        # Compile the model with a tunable learning rate for the Adam optimizer
        model.compile(
            optimizer=tf.keras.optimizers.Adam(
                learning_rate=hp.Choice('learning_rate', [1e-2, 1e-3, 1e-4])
            ),
            loss='categorical_crossentropy',
            metrics=['accuracy']
        )
        return model
  5. Random Search Tuner Setup:

    • RandomSearch is chosen for tuning.
    • objective='val_accuracy': The tuner aims to maximize validation accuracy.
    • max_trials=20: The total number of different hyper-parameter combinations to try.
    • executions_per_trial=1: How many models to train for each trial.
    • directory='cnn_random_tuning': Directory to store logs and checkpoints.
    • project_name='mnist_random': Name of the tuning project.
    tuner = RandomSearch(
        build_model,
        objective='val_accuracy',
        max_trials=20, # You can increase this for deeper search
        executions_per_trial=1,
        directory='cnn_random_tuning',
        project_name='mnist_random'
    )
  6. Start Tuning:

    • tuner.search() starts the hyper-parameter search process.
    • epochs=5: Number of training epochs for each trial.
    • validation_data=(x_val, y_val): Data used for validation during tuning.
    tuner.search(x_train, y_train,
                 epochs=5,
                 validation_data=(x_val, y_val))
  7. Get Best Model and Hyperparameters:

    • After the search, the best model and its hyper-parameters can be retrieved.
    best_model = tuner.get_best_models(1)[0]
    best_hps = tuner.get_best_hyperparameters(1)[0]
    
    print("\nBest hyperparameters found:")
    print(f"Filters: {best_hps.get('filters')}")
    print(f"Kernel Size: {best_hps.get('kernel_size')}")
    print(f"Dense Units: {best_hps.get('dense_units')}")
    print(f"Dropout: {best_hps.get('dropout')}")
    print(f"Learning Rate: {best_hps.get('learning_rate')}")

3.2 Image Augmentation and Normalization

Image Augmentation and Image Normalization are two crucial preprocessing techniques used in computer vision and deep learning tasks, especially for training Convolutional Neural Networks (CNNs). They help improve model performance, generalization, and robustness.

1. Image Augmentation

Image augmentation is a technique used to artificially increase the size and diversity of a training dataset by creating modified versions of existing images. These transformations simulate real-world variability while preserving the original label of the image.

Why it is used:

  1. To prevent overfitting: By increasing the variety of training data, the model sees more diverse examples, making it less likely to memorize the specific training samples and more likely to learn generalizable features.
  2. To make models more robust to real-world variations: Models become more resilient to changes such as orientation, lighting, scale, and occlusion that might occur in unseen data.
  3. To train with less real data: Augmentation reduces the need for collecting vast amounts of new, distinct real-world data, saving time and data collection costs.

1.1 Common Image Augmentation Techniques

Image augmentation introduces controlled transformations to training images to increase dataset diversity, improve model generalization, and reduce overfitting.

1.1.1 Rotation

Rotation involves turning an image by a specific angle $\theta$.

  • If the rotation is around the origin $(0, 0)$, the coordinates $(x, y)$ of a pixel transform to new coordinates $(x', y')$ as: $\begin{bmatrix} x' \ y' \end{bmatrix} = \begin{bmatrix} \cos \theta & -\sin \theta \ \sin \theta & \cos \theta \end{bmatrix} \begin{bmatrix} x \ y \end{bmatrix}$ So: $x' = x \cos \theta - y \sin \theta$ $y' = x \sin \theta + y \cos \theta$
  • If you rotate around the origin $(0,0)$, parts of the image can move out of view.
  • For rotation around the center of the image $(c_x, c_y)$, the formula becomes: $x' = (x - c_x) \cos \theta - (y - c_y) \sin \theta + c_x$ $y' = (x - c_x) \sin \theta + (y - c_y) \cos \theta + c_y$ This is how rotation is typically done in practical applications (e.g., OpenCV, TensorFlow), ensuring the image stays centered and rotates in place.

Example: Rotation of a 3x3 Image Around Its Center by 90° Counterclockwise

Step 1: Original Image Coordinate Grid Assume top-left is (0,0) and bottom-right is (2,2). Coordinates (y $\downarrow$, x $\rightarrow$): y=0: (0,0) (1,0) (2,0) y=1: (0,1) (1,1) (2,1) y=2: (0,2) (1,2) (2,2)

Step 2: Compute Image Center Image has width $W=3$ and height $H=3$. Center: $c_x = \frac{W-1}{2} = \frac{3-1}{2} = 1.0$ $c_y = \frac{H-1}{2} = \frac{3-1}{2} = 1.0$ So, the center of rotation is $(1, 1)$.

Step 3: Apply Rotation Around the Center To rotate a point $(x, y)$ by $\theta = 90^\circ$ counterclockwise around $(c_x, c_y)$ (i.e., $(1, 1)$), we use the transformation formulas: $x' = (x - 1) \cos 90^\circ - (y - 1) \sin 90^\circ + 1$ $y' = (x - 1) \sin 90^\circ + (y - 1) \cos 90^\circ + 1$ Since $\cos 90^\circ = 0$ and $\sin 90^\circ = 1$: $x' = -(y - 1) + 1 = -y + 2$ $y' = (x - 1) + 1 = x$

Applying this to each pixel in the image:

(x, y) $\Delta x = x - c_x$ $\Delta y = y - c_y$ $x'$ $y'$ New Position ($x', y'$)
(0, 0) -1 -1 -(-1)+1=2 0 (2, 0)
(1, 0) 0 -1 -(-1)+1=2 1 (2, 1)
(2, 0) 1 -1 -(-1)+1=2 2 (2, 2)
(0, 1) -1 0 -(0)+1=1 0 (1, 0)
(1, 1) 0 0 -(0)+1=1 1 (1, 1)
(2, 1) 1 0 -(0)+1=1 2 (1, 2)
(0, 2) -1 1 -(1)+1=0 0 (0, 0)
(1, 2) 0 1 -(1)+1=0 1 (0, 1)
(2, 2) 1 1 -(1)+1=0 2 (0, 2)

Step 4: Rotated Image Grid The resulting coordinate grid after $90^\circ$ rotation: (y $\downarrow$, x $\rightarrow$) y=0: (0,0) (1,0) (2,0) y=1: (0,1) (1,1) (2,1) y=2: (0,2) (1,2) (2,2)

This confirms that the top row becomes the right column, and all pixels rotate correctly around the image center.

1.1.2 Flipping

Flipping is an image augmentation technique that mirrors the image either horizontally or vertically, increasing training data diversity by showing different orientations of the same image.

  • Horizontal Flip (Right-to-Left Flip):

    • Operation: Reverses the image along the vertical axis (flips from right to left or left to right).
    • Usefulness: Useful when the left/right orientation of objects does not matter (e.g., animals, vehicles).
    • Mathematical Formula: $x' = W - x - 1$ $y' = y$ where $W$ is the image width, and $(x, y)$ are the original pixel coordinates.
    • Example for a 3x3 image: Original: $\begin{bmatrix} A & B & C \ D & E & F \ G & H & I \end{bmatrix}$ $\Rightarrow$ Flipped: $\begin{bmatrix} C & B & A \ F & E & D \ I & H & G \end{bmatrix}$
  • Vertical Flip (Up-to-Down Flip):

    • Operation: Reverses the image along the horizontal axis (flips from top to bottom or bottom to top).
    • Usefulness: Useful when object orientation could appear inverted.
    • Mathematical Formula: $x' = x$ $y' = H - y - 1$ where $H$ is the image height, and $(x, y)$ are the original pixel coordinates.
    • Example for the same 3x3 image: Original: $\begin{bmatrix} A & B & C \ D & E & F \ G & H & I \end{bmatrix}$ $\Rightarrow$ Flipped: $\begin{bmatrix} G & H & I \ D & E & F \ A & B & C \end{bmatrix}$

Summary (Flipping): The terms horizontal/vertical flip and right-to-left/up-to-down flip are interchangeable and describe the same transformations.

1.1.3 Cropping

Cropping is a data augmentation technique that extracts a smaller rectangular region from the original image to simulate zooming or focusing on specific areas.

  • Mechanism: Given an image of size $(H, W)$, a crop of size $(h, w)$ is selected with the top-left corner coordinates $(x_0, y_0)$ such that: $0 \le x_0 \le W - w$ $0 \le y_0 \le H - h$
  • Effect: This crop represents a zoomed-in region of the original image, which helps the model learn better features from different perspectives.

Example: Cropping a 3x3 patch from a 5x5 image starting at $(x_0, y_0) = (1, 1)$ Original 5x5 image (represented by letters in a grid): $\begin{bmatrix} A & B & C & D & E \ F & G & H & I & J \ K & L & M & N & O \ P & Q & R & S & T \ U & V & W & X & Y \end{bmatrix}$

If we want to crop a 3x3 patch starting at $(1, 1)$ (where $(0,0)$ is top-left), the resulting crop is: $\begin{bmatrix} G & H & I \ L & M & N \ Q & R & S \end{bmatrix}$

1.1.4 Scaling

Scaling is a geometric transformation that resizes the image by a scaling factor $s$.

  • Mechanism: Given an original image size $(H, W)$, after scaling, the new image size becomes $(sH, sW)$, where $s > 0$.
  • Interpretation:
    • If $s > 1$, the image is enlarged (zoom in).
    • If $s < 1$, the image is shrunk (zoom out).
  • Pixel Coordinate Transformation: Each pixel coordinate $(x, y)$ transforms as: $x' = s \times x$ $y' = s \times y$
  • Interpolation methods (e.g., bilinear, bicubic) are used to fill pixel values at non-integer coordinates that arise after scaling.
  • Typical values of $s$: Range between 0.8 and 1.2.

Example: Scaling a 3x3 image by $s=2$ (results in a 6x6 image) Original 3x3 image: $\begin{bmatrix} A & B & C \ D & E & F \ G & H & I \end{bmatrix}$

Scaling by $s=2$ effectively "zooms in" the image by doubling its size and replicating pixels: $\begin{bmatrix} A & A & B & B & C & C \ A & A & B & B & C & C \ D & D & E & E & F & F \ D & D & E & E & F & F \ G & G & H & H & I & I \ G & G & H & H & I & I \end{bmatrix}$

1.1.5 Translation

Translation shifts the image along the X (horizontal) and Y (vertical) axes.

  • Mechanism: Given a horizontal shift $\Delta x$ and vertical shift $\Delta y$, the pixel coordinates $(x, y)$ transform as: $x' = x + \Delta x$ $y' = y + \Delta y$
  • Here, zeros indicate empty pixels introduced due to shifting (where original pixels moved out of bounds, or new regions are exposed).

Example: Applying translation with $\Delta x = 1$ and $\Delta y = 0$ (shift right by 1 pixel) to a 3x3 image Original 3x3 image: $\begin{bmatrix} A & B & C \ D & E & F \ G & H & I \end{bmatrix}$

Applying translation: $\Rightarrow \begin{bmatrix} 0 & A & B \ 0 & D & E \ 0 & G & H \end{bmatrix}$ (assuming the image bounds are preserved, and new space is filled with 0s for pixel column 0).

1.2 Color Transformations

Color transformations modify the color properties of an image, which can serve as a form of augmentation or enhancement.

1.2.1 Brightness Adjustment

Brightness adjustment modifies the overall intensity of the image by adding a constant $\beta$ to all pixels.

  • Formula: $I'(x, y) = \text{clip}(I(x, y) + \beta, 0, 255)$ where $I(x, y)$ is the original pixel value, $\beta$ is the brightness adjustment parameter, and $\text{clip}(\cdot)$ limits values to the valid pixel range (e.g., 0 to 255 for 8-bit images).

  • Example: Original image pixels and brightness adjustments Original image: $\begin{bmatrix} 100 & 120 & 130 \ 140 & 150 & 160 \ 170 & 180 & 190 \end{bmatrix}$ (grayscale values)

    With $\beta = 20$ (brighter): $\begin{bmatrix} 120 & 140 & 150 \ 160 & 170 & 180 \ 190 & 200 & 210 \end{bmatrix}$

    With $\beta = -30$ (darker): $\begin{bmatrix} 70 & 90 & 100 \ 110 & 120 & 130 \ 140 & 150 & 160 \end{bmatrix}$

1.2.2 Contrast Modification

Contrast modification changes the difference between light and dark regions by scaling pixel intensity deviations from the mean.

  • Formula: $I'(x, y) = \alpha \cdot (I(x, y) - \mu) + \mu$ where:

    • $I(x, y)$ is the original pixel intensity.
    • $\mu$ is the mean intensity of the image.
    • $\alpha$ is the contrast adjustment parameter.
    • If $\alpha > 1$, contrast increases (stretches intensities).
    • If $0 < \alpha < 1$, contrast decreases (compresses intensities).
  • Example: 3x3 grayscale image, calculating mean intensity and new pixel at (0,0) Assuming a 3x3 grayscale image: $\begin{bmatrix} 100 & 160 & 170 \ 120 & 180 & 190 \ 130 & 140 & 150 \end{bmatrix}$

    Calculate the mean intensity ($\mu$): $\mu = \frac{100+160+170+120+180+190+130+140+150}{9} = \frac{1240}{9} \approx 137.78$ (The provided material uses 140 for simplicity, let's stick to that for demonstration purpose as per the document.) So, let $\mu = 140$.

    For $\alpha = 1.2$, new pixel at $(0,0)$: $I'(0, 0) = 1.2 \times (100 - 140) + 140 = 1.2 \times (-40) + 140 = -48 + 140 = 92$. (This process would be repeated for other pixels similarly.)

1.2.3 Saturation and Hue Shifting

These operations are performed in the HSV (Hue, Saturation, Value) color space, which separates color components. Conversion between RGB and HSV is needed to perform these operations.

  • Saturation: Adjusts the intensity or purity of color. Increasing saturation makes colors more vivid, while decreasing saturation makes them more washed out.
  • Hue: Changes the color tone by shifting the hue angle around the color wheel (e.g., red can become orange or purple).

1.3 Noise Injections

Noise injection involves adding artificial noise to training images. This helps models generalize better by exposing them to variations and imperfections commonly found in real-world images. Models trained on noisy data become more robust and less prone to overfitting.

1.3.1 Gaussian Noise

Gaussian noise is statistical noise with a probability density function equal to that of the normal (Gaussian) distribution. It simulates natural sensor noise or grainy images taken in low-light conditions.

  • Mathematical Description: $I'(x, y) = I(x, y) + N(0, \sigma^2)$ where:

    • $I(x, y)$ is the original pixel value at coordinates $(x, y)$.
    • $N(0, \sigma^2)$ is a random value sampled from a Gaussian distribution with mean 0 and variance $\sigma^2$.
    • $I'(x, y)$ is the new pixel value after adding noise.
  • Role: Adds subtle, natural-looking noise to images, helping models handle real noisy images better.

  • Example: If the original pixel value is 150 and the noise sampled from the Gaussian distribution is 5, then the new pixel value becomes: $I'(x, y) = 150 + 5 = 155$. This slight change simulates a noisy sensor reading.

1.3.2 Salt-and-Pepper Noise

Salt-and-pepper noise is a type of impulse noise where pixels are randomly set to black or white. It simulates sudden disturbances like dead pixels, transmission errors, or dust on a camera sensor.

  • Mechanism:

    • Each pixel has a probability $p$ of being corrupted.
    • For corrupted pixels:
      • With probability $p/2$, pixel value becomes 0 (pepper, black).
      • With probability $p/2$, pixel value becomes 255 (salt, white).
    • Otherwise, the pixel remains unchanged.
  • Role: Models sudden random corruptions, helping models learn to ignore or handle missing or corrupted pixel information.

  • Example: If a pixel’s original value is 120 but it is randomly corrupted to 0 (pepper), the model must learn to recognize the image even if some pixels are completely black.

1.3.3 Random Erasing (also known as Cutout or CutMix in a broader context)

Random erasing involves randomly selecting a rectangular region of the image and erasing its pixels (usually replaced by zeros or random values).

  • Role: Simulates occlusion where part of the object in the image is hidden or missing, helping models be more robust to partial visibility of objects.

  • Example: If a photo of a face has part of it covered (e.g., by a hand or glasses), random erasing simulates such occlusions during training.

Summary (Noise Injection): Noise injection techniques help models generalize better by exposing them to variations and imperfections. Real-world images are rarely perfect—noise simulates those real-world imperfections. Models trained on noisy data become more robust and less prone to overfitting.

1.4 Cutout and Erasing (Revisit - overlaps with Random Erasing)

The syllabus specifically lists "Cutout and Erasing" as a separate topic, which strongly overlaps with "Random Erasing" mentioned under Noise Injections. Often, "Cutout" refers to specifically setting a rectangular region to zero, while "Random Erasing" can replace with random values or image mean.

1.4.1 Random Erasing (Detailed)

Randomly blacks out a rectangular region in the image to simulate occlusion or missing parts.

  • If the erased region is defined by top-left corner $(x_0, y_0)$ and width $w$, height $h$, then: $I'(x, y) = 0$ for $x_0 \le x < x_0 + w, y_0 \le y < y_0 + h$
  • Pixels outside this region remain unchanged.

Example: Erasing a 2x2 block in a 5x5 image starting at $(1, 1)$ (numbers indicate pixel values): Original image: $\begin{bmatrix} 10 & 20 & 30 & 40 & 50 \ 60 & 70 & 80 & 90 & 100 \ 110 & 120 & 130 & 140 & 150 \ 160 & 170 & 180 & 190 & 200 \ 210 & 220 & 230 & 240 & 250 \end{bmatrix}$

After erasing (the 2x2 block from index (1,1) to (2,2) - using 0-based indexing): $\begin{bmatrix} 10 & 20 & 30 & 40 & 50 \ 60 & \mathbf{0} & \mathbf{0} & 90 & 100 \ 110 & \mathbf{0} & \mathbf{0} & 140 & 150 \ 160 & 170 & 180 & 190 & 200 \ 210 & 220 & 230 & 240 & 250 \end{bmatrix}$

1.5 Advanced Techniques (for Image Augmentation)

1.5.1 Mixup

Mixup generates new training samples by creating convex combinations of pairs of images and their labels.

  • Given two images $x_i$ and $x_j$ with their corresponding labels $y_i$ and $y_j$, the mixed image $\tilde{x}$ and mixed label $\tilde{y}$ are: $\tilde{x} = \lambda x_i + (1 - \lambda) x_j$ $\tilde{y} = \lambda y_i + (1 - \lambda) y_j$
  • Where $\lambda \in$ is sampled from a Beta distribution $\text{Beta}(\alpha, \alpha)$ with parameter $\alpha > 0$.
  • Effect: This encourages the model to behave linearly between training examples, improving generalization and robustness to adversarial examples.
1.5.2 CutMix

CutMix enhances data diversity by replacing a random rectangular region of one image with a patch from another image.

  • Formally, given images $x_i$, $x_j$ and their labels $y_i$, $y_j$, and a binary mask $M$ indicating the rectangular region (where $M$ has 1s in the cut region and 0s elsewhere), the augmented image $\tilde{x}$ and label $\tilde{y}$ are: $\tilde{x} = x_i \odot (1 - M) + x_j \odot M$ (The provided document has $x_i \odot M + x_j \odot (1-M)$, which means the mask applies to $x_i$ and its inverse to $x_j$. The interpretation is usually that a patch from $x_j$ replaces a patch in $x_i$, so $x_i \odot (1 - M)$ is the original image with a hole, and $x_j \odot M$ is the patch to be inserted.) $\tilde{y} = \lambda y_i + (1 - \lambda) y_j$ where $\odot$ denotes element-wise multiplication, and $\lambda$ is proportional to the area of the region defined by $M$ (i.e., $\lambda = \text{Area}(M) / \text{Area}(\text{Image})$).
  • Effect: This forces the model to learn from partial views of objects and context, improving robustness and generalization.
  • For further reading on practical data augmentation techniques, refer to the mentioned Medium article: Image Data Augmentation Techniques.

2 Pixel Values in Digital Images

2.1 What Are Pixel Values?

  • A pixel (picture element) is the smallest unit of a digital image, representing a single point of color or intensity.
  • Each pixel has a numeric value that encodes information about the color or brightness at that point.
  • The range of possible pixel values depends on the image's bit-depth, which determines how many distinct values each pixel can have.

Common Pixel Value Ranges:

  • For 8-bit images (most common format), each pixel value ranges from 0 to 255.
    • 0 represents the darkest intensity (black).
    • 255 represents the brightest (white) in grayscale images.
  • For color images (e.g., RGB), each pixel consists of multiple channels (Red, Green, Blue), and each channel value ranges independently from 0 to 255.
  • Higher bit-depth images (e.g., 16-bit or 32-bit) allow pixel values in a wider range (e.g., 0 to 65535 for 16-bit), but 8-bit is most commonly used due to its balance of quality and storage.

Variability Across Images and Datasets: Different images or datasets can have significant variations in:

  • Lighting conditions: Some images might be very bright, while others are very dark.
  • Contrast levels: Some images might have strong contrast, while others appear washed out.
  • Color distributions: Colors and their intensity distributions may vary widely across different datasets. This variability means raw pixel values from different images or datasets may not be directly comparable or suitable as input features for machine learning models without preprocessing.

Why Normalize Pixel Values?

Since raw pixel values can vary widely, normalization rescales these values into a common scale or distribution. This is essential for:

  • Reducing the effect of varying lighting and contrast: Ensures that differences in illumination don't mislead the model.
  • Preventing features with larger numeric ranges from dominating the learning process: If pixel values are very large, they might disproportionately influence gradient calculations and weight updates.
  • Improving numerical stability and training convergence speed: Neural networks often train faster and more stably when inputs are within a consistent, small range.
  • Making model training more robust across different datasets: Ensures that a model trained on one dataset can perform well on another with different lighting or capture conditions.

Numeric Example: Two grayscale images with different lighting conditions

  • Bright Image: A pixel value is $I_1 = 200$ (on a scale of 0 to 255).
  • Dark Image: The corresponding pixel value is $I_2 = 50$. Both pixels represent the same point in the scene but differ due to lighting. Without Normalization: $I_1 = 200, I_2 = 50$. The model sees a large difference (150) which might mislead it to think the content differs significantly.

2.2 Common Normalization Techniques

  • Min-Max Scaling:

    • How: Scales pixel values to a fixed range, usually $$ or $[-1, 1]$, by dividing by the maximum possible value (or subtracting min and dividing by range).
    • Formula (to $$): $I' = \frac{I - I_{\min}}{I_{\max} - I_{\min}}$
    • For 8-bit images ($I_{\min}=0, I_{\max}=255$): $I' = \frac{I}{255}$.
  • Example (Min-Max Scaling): Original pixel values: $I_1 = 200, I_2 = 50$. $I_{\max}=255, I_{\min}=0$. $I'_1 = \frac{200}{255} \approx 0.784$ $I'_2 = \frac{50}{255} \approx 0.196$ While a difference remains, normalization reduces its scale, allowing the model to learn invariant features more easily.

  • Mean-Variance Normalization (Standardization):

    • How: Centers pixel values around zero mean and scales them to unit variance. This is done by subtracting the mean ($\mu$) and dividing by the standard deviation ($\sigma$).
    • Formula: $I' = \frac{I - \mu}{\sigma}$
    • Where $\mu$ is the mean pixel value and $\sigma$ is the standard deviation of all pixel values in the image or dataset.
  • Example (Mean-Variance Normalization): Assume global mean $\mu = 127.5$ and standard deviation $\sigma = 50$. $I''_1 = \frac{200 - 127.5}{50} = \frac{72.5}{50} = 1.45$ $I''_2 = \frac{50 - 127.5}{50} = \frac{-77.5}{50} = -1.55$ This centers pixel values around zero and scales differences relative to data variation, helping the model generalize better across lighting changes.

Summary (Normalization): Normalization scales pixel values to standard ranges or distributions. For grayscale images, it operates on a single channel. For color images (RGB), normalization is done independently per color channel because each channel can have different mean and variance. Proper normalization improves model training speed and performance.

Example: Single-Channel (Grayscale) Image Normalization

Consider a grayscale image with pixel values in the range 0 to 255. Suppose the image has the following pixel values: $\begin{bmatrix} 100 & 150 & 200 \ 50 & 125 & 175 \ 0 & 75 & 225 \end{bmatrix}$

Min-Max Scaling to: $I_{\min} = 0, I_{\max} = 225$. Normalized pixel values ($I' = I/255$): $\begin{bmatrix} 100/225 & 150/225 & 200/225 \ 50/225 & 125/225 & 175/225 \ 0/225 & 75/225 & 225/225 \end{bmatrix} \approx \begin{bmatrix} 0.444 & 0.667 & 0.889 \ 0.222 & 0.556 & 0.778 \ 0.000 & 0.333 & 1.000 \end{bmatrix}$

Mean-Variance Normalization: Calculate the mean $\mu$ and standard deviation $\sigma$ of all pixel values: $\mu = \frac{100+150+200+50+125+175+0+75+225}{9} = \frac{1100}{9} \approx 122.22$ (The example text calculation for mean is $100 + 120 + \dots + 190 / 9 = 140$. Let's use the actual values from the given matrix for our calculation) Let's recalculate the mean for the given $3 \times 3$ matrix: $\mu = (100+150+200+50+125+175+0+75+225) / 9 = 1100 / 9 \approx 122.22$ The standard deviation ($\sigma$) is often calculated as the square root of the variance. The document provides $\sigma \approx 69.87$ for a different example, but let's assume we would calculate it for this specific matrix if needed for a full numerical example. The formula is $\sigma = \sqrt{\frac{1}{N} \sum (I_{ij} - \mu)^2}$. The sample standard deviation is $\sqrt{\frac{1}{N-1} \sum (I_{ij} - \mu)^2}$. Let's assume we have $\mu \approx 122.22$ and calculate $\sigma$ for this example. $\sigma^2 = \frac{1}{9} [ (100-122.22)^2 + (150-122.22)^2 + \dots + (225-122.22)^2 ]$ $\sigma \approx 72.8$ (Using a quick calculator for the values in the matrix).

Let's use the provided example for $I_{11} = 100$ and $I_{11} = 111.11$ with $\sigma = 69.87$ from the document for illustration: For pixel $I_{11} = 100$, with $\mu = 111.11$ and $\sigma = 69.87$: $I'_{11} = \frac{100 - 111.11}{69.87} = \frac{-11.11}{69.87} \approx -0.158$. Similarly, other pixels get normalized values centered around zero with unit variance.

Example: Multi-Channel (Color) Image Normalization

For RGB images, normalization is usually done per channel because each color channel can have different mean and variance. Suppose we have an RGB image with three channels: Red, Green, and Blue. Each channel’s pixel values are normalized separately: $I'_R = \frac{I_R - \mu_R}{\sigma_R}$ $I'_G = \frac{I_G - \mu_G}{\sigma_G}$ $I'_B = \frac{I_B - \mu_B}{\sigma_B}$ where $\mu_R, \mu_G, \mu_B$ and $\sigma_R, \sigma_G, \sigma_B$ are the mean and standard deviation of pixel values in the respective channels, often computed over the entire training dataset.

Practical example: If the mean and standard deviation of the Red channel are 123.68 and 58.39 respectively, then a Red pixel value 150 will be normalized as: $I'_R = \frac{150 - 123.68}{58.39} = \frac{26.32}{58.39} \approx 0.45$. Similarly for Green and Blue channels.

Summary (Normalization):

  • Normalization scales pixel values to standard ranges or distributions.
  • For grayscale images, normalization operates on a single channel.
  • For color images, normalization is done independently per color channel.
  • Proper normalization improves model training speed and performance.

3.3 Image Classification, Object Detection and Segmentation

These are three fundamental tasks in computer vision, forming the backbone of many applications. Deep learning models, particularly CNNs, have achieved state-of-the-art results in all of them.

Image Classification

Image classification is the task of assigning a single class label to an entire input image. The goal is to determine what is depicted in the image.

  • Input: An image.
  • Output: A single category label (e.g., "cat", "dog", "car").
  • Mechanism (with CNNs): CNNs extract features through convolutional and pooling layers, which are then fed into fully connected layers that perform the final classification using a softmax activation function to output probabilities for each class.
  • Applications in Industrial Inspection:
    • Classifying products as "good" or "defective."
    • Categorizing different types of defects (e.g., "scratch," "dent," "color error").
    • Identifying product types on an assembly line.
    • Example: Grouping sample images of good parts and various types of defective parts into individual folders, where the directory name serves as the label. Popular CNNs include AlexNet, VGG, GoogLeNet, and ResNet.

Object Detection

Object detection is a more complex task that involves both localizing objects within an image and classifying them.

  • Input: An image.
  • Output: Bounding boxes around detected objects, along with a class label and a confidence score for each detected object.
  • Mechanism (with CNNs): Modern object detectors often use a two-stage approach (e.g., Faster R-CNN) or a single-stage approach (e.g., YOLO, SSD).
    • Two-stage: First, propose regions where objects might be (region proposals), then classify and refine bounding boxes for these regions.
    • Single-stage: Directly predict bounding boxes and class probabilities from the image in a single pass.
  • Applications in Industrial Inspection:
    • Locating defects on product surfaces (e.g., identifying the exact area of a scratch).
    • Counting specific components in an assembly.
    • Detecting missing parts in a package.
    • Challenges: The annotation process for object detection (drawing bounding boxes) is more time-consuming than simple image classification.
    • Popular Networks: R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN, Single Shot Detector (SSD), YOLO.

Image Segmentation

Image segmentation takes the task of understanding an image to the pixel level. It involves dividing an image into meaningful regions or objects, assigning a label to each pixel.

  • Semantic Segmentation:

    • Input: An image.
    • Output: A segmentation map where every pixel is assigned a class label. Pixels belonging to the same semantic class (e.g., "road," "sky," "person") are grouped together. Background and non-interest regions often share a single "background" class label.
    • Mechanism (with CNNs): Typically uses encoder-decoder architectures (e.g., U-Net, DeepLab). The encoder captures high-level features, and the decoder reconstructs the spatial detail to assign a class to each pixel.
    • Mathematical Intuition (Semantic Segmentation):
      • Overview: Assigns a class label to each pixel. Given an input image $\mathbf{x} \in \mathbb{R}^{H \times W \times 3}$ (height, width, 3 channels for RGB), the output is a segmentation map $\hat{\mathbf{y}} \in \mathbb{R}^{H \times W \times C}$ where $C$ is the number of semantic classes.
      • Each pixel $(i, j)$ in the output map has a class probability vector: $\hat{y}{i,j} = [P{i,j}(1), P_{i,j}(2), \ldots, P_{i,j}(C)]$.
      • Feature Extraction (Encoder): A CNN encoder extracts spatial features, producing a lower-resolution feature map $\mathbf{f}(\mathbf{x}) = \text{CNN}(\mathbf{x}) \in \mathbb{R}^{h \times w \times d}$ (reduced height $h$, width $w$, depth $d$). This reduces computational complexity, increases receptive field, and extracts abstract features.
      • Pixel-wise Classification: A $1 \times 1$ convolution is applied to the feature map to generate class scores (logits) for each pixel: $\mathbf{z} = \text{Conv}_{1 \times 1}(\mathbf{f}(\mathbf{x})) \in \mathbb{R}^{h \times w \times C}$.
      • Upsampling / Decoder: The class score map $\mathbf{z}$ is upsampled to the original image resolution: $\mathbf{z}' = u(\mathbf{z}) \in \mathbb{R}^{H \times W \times C}$. Methods include transposed convolution (learnable) or bilinear interpolation (non-learnable), often combined with skip connections (U-Net).
      • Softmax Classification: Softmax is applied to $\mathbf{z}'$ at each pixel $(i, j)$ to get class probabilities: $P_{i,j}(c) = \frac{e^{z'{i,j}(c)}}{\sum{k=1}^{C} e^{z'_{i,j}(k)}}$.
      • Loss Function: Categorical cross-entropy is typically used for training, minimizing $\mathcal{L}{i,j} = -\log P{i,j}(y_{i,j})$. The total loss is summed over all pixels.
      • Inference: The final class prediction at each pixel is the class with the highest probability: $\hat{y}_{i,j} = \text{arg max}c P{i,j}(c)$.
      • Architectures: Popular CNN architectures used as encoders (VGGNet, ResNet, DenseNet, MobileNet) combined with decoders (Transposed convolution layers, Upsampling). End-to-end architectures like U-Net, SegNet, FCN, DeepLab are specialized for semantic segmentation. Transformers (ViTs) are also emerging for long-range dependency capture.
  • Instance Segmentation:

    • Input: An image.
    • Output: A segmentation mask for each individual object instance in the image, along with its class label and bounding box. Unlike semantic segmentation, it differentiates between multiple objects of the same class (e.g., "car 1," "car 2").
    • Mechanism (with CNNs): Often combines object detection and semantic segmentation principles (e.g., Mask R-CNN). It first detects objects with bounding boxes, then for each detected object, it predicts a binary mask at the pixel level.
    • Mathematical Intuition (Instance Segmentation):
      • Definition: Classifies each pixel and differentiates between individual instances of the same class. Given input $\mathbf{x} \in \mathbb{R}^{H \times W \times C}$, outputs a set of binary masks with class labels: $f: \mathbb{R}^{H \times W \times C} \to {(M_1, c_1), \ldots, (M_N, c_N)}$.

      • $M_n \in {0, 1}^{H \times W}$ is the binary mask for the $n$-th instance. $c_n \in {1, \ldots, K}$ is the class label. $N$ is the number of detected instances.

      • Mask Concept: A binary image (same size as input) where 1 indicates pixel belongs to specific object instance, 0 indicates background.

      • Formal Representation: Given $\mathbf{x} \in \mathbb{R}^{H \times W \times C}$, outputs ${(M_1, c_1), \ldots, (M_N, c_N)}$. Mask dimensions $H \times W$, channels are not used in mask.

      • Output Tensor: If $N$ instances detected, complete mask output is $Masks \in {0, 1}^{N \times H \times W}$. Also outputs class labels $\in {1, \ldots, K}^N$ and bounding boxes (optional) $\in \mathbb{R}^{N \times 4}$.

      • Visualization: Binary masks overlaid on original image with colors/transparency to show object, location, and instance.

      • Illustrative Example: Mask Generation: For a 5x5 grayscale image with two objects ('1' for person, '2' for dog): Input Image: $\begin{bmatrix} 0 & 0 & 0 & 0 & 0 \ 0 & 1 & 1 & 0 & 0 \ 0 & 1 & 1 & 0 & 2 \ 0 & 0 & 0 & 0 & 0 \ 0 & 0 & 1 & 0 & 1 \end{bmatrix}$ (This matrix represents the values in the image, e.g., 1 for person pixels, 2 for dog pixels, 0 for background. The actual provided matrix for the example might be an interpretation of how the pixels are distributed, so let's use the explicit representation for clarity if the underlying data is 1 for person and 2 for dog).

        Original (interpreted): $\begin{bmatrix} 0 & 0 & 0 & 0 & 0 \ 0 & \mathbf{1} & \mathbf{1} & 0 & 0 \ 0 & \mathbf{1} & \mathbf{1} & 0 & \mathbf{2} \ 0 & 0 & 0 & 0 & 0 \ 0 & 0 & \mathbf{1} & 0 & \mathbf{1} \end{bmatrix}$ Binary Mask for “Person”: $M_{\text{person}} = \begin{bmatrix} 0 & 0 & 0 & 0 & 0 \ 0 & 1 & 1 & 0 & 0 \ 0 & 1 & 1 & 0 & 0 \ 0 & 0 & 0 & 0 & 0 \ 0 & 0 & 1 & 0 & 1 \end{bmatrix}$ (pixels with value '1' become 1, others 0 for this mask) Binary Mask for “Dog”: $M_{\text{dog}} = \begin{bmatrix} 0 & 0 & 0 & 0 & 0 \ 0 & 0 & 0 & 0 & 0 \ 0 & 0 & 0 & 0 & 1 \ 0 & 0 & 0 & 0 & 0 \ 0 & 0 & 0 & 0 & 0 \end{bmatrix}$ (pixels with value '2' become 1, others 0 for this mask)

        The model returns a list of instance masks with corresponding labels: [(M_person, "person"), (M_dog, "dog")]

      • Comparison with Semantic Segmentation:

        • Semantic Segmentation: Assigns a class label to each pixel (e.g., “car”), but does not distinguish between multiple objects of the same class (e.g., all cars are just "car" pixels).
        • Instance Segmentation: Identifies each distinct object instance with a unique mask (e.g., “car 1”, “car 2”).
      • Mathematical Components (Model Output):

        • Object detection: Locates bounding boxes and predicts class labels (as in Faster R-CNN).
        • Mask prediction: For each detected box, a small CNN predicts a binary mask at pixel level.
        • Each object instance is represented by a binary segmentation mask: $Mask_n(i, j) = \begin{cases} 1 & \text{if pixel } (i, j) \text{ belongs to instance } n \ 0 & \text{otherwise} \end{cases}$
        • Each instance is also associated with:
          • A bounding box $B_n = (x_{\min}, y_{\min}, x_{\max}, y_{\max})$
          • A class label $c_n$
          • A confidence score $s_n \in$
      • Loss Function: Trained using a composite loss that combines multiple objectives: $L = L_{cls} + L_{box} + L_{mask}$ where:

        • $L_{cls}$: classification loss (e.g., cross-entropy) for the object class.
        • $L_{box}$: bounding box regression loss (e.g., Smooth L1) for location prediction.
        • $L_{mask}$: binary cross-entropy loss for the pixel-wise mask of each instance.
      • Use Cases: Object counting and tracking in crowded scenes (people, vehicles), robotic manipulation, augmented reality (AR), medical imaging (segmenting individual cells, organs, lesions).

      • Summary: Instance segmentation bridges the gap between object detection and semantic segmentation by identifying and segmenting each object instance, crucial for advanced visual perception in AI systems.

Read also: The provided resource https://learnopencv.com/selective-search-for-object-detection-cpp-python/ gives insights into Selective Search, an algorithm often used in older R-CNN pipelines for generating region proposals. It's an important step for understanding the evolution of object detection.

Anomaly Detection (within classification/detection/segmentation)

The syllabus also mentions anomaly detection as a potential fourth type of method using deep learning.

  • Anomaly detection: Identifies instances that deviate significantly from the majority of the data.
  • Realization: Can be implemented as an image classification (good vs. anomalous), segmentation (highlighting anomalous regions), or detection problem.
  • Key Advantage: Useful when there are many "good" products available for training but very few "bad" (defective/anomalous) parts, making supervised learning challenging due to class imbalance. In such cases, models can be trained primarily on "good" data, and high reconstruction error or deviation from learned patterns can flag anomalies.
  • Data Requirement: Often requires high-quality images that clearly show differences between normal and anomalous regions. In some cases, as few as 20 images of good parts might suffice for quick prototyping.

3.4 Image Synthesis and Style Transfer

This topic refers to the generation of new images or the modification of existing ones based on learned patterns.

Image Synthesis

Image synthesis involves creating new images from scratch or from a given input, often based on learned data distributions.

  • Generative Models: Techniques like Generative Adversarial Networks (GANs) are particularly powerful for image synthesis. GANs consist of a generator network (creates synthetic images) and a discriminator network (tries to distinguish real from fake images). Through adversarial training, the generator learns to produce increasingly realistic images.
  • Applications:
    • Medical Imaging: Generating synthetic medical images for data augmentation, anonymization, or cross-modality translation (e.g., converting MRI to CT, as shown in the medical imaging material).
      • Nie et al. used GANs to estimate CT images from MR images. They added a gradient difference loss function in the training process to maintain the intensity gradient between pixels in MRI, resulting in sharper and more realistic CT images compared to conventional methods.
    • Data Augmentation: Creating diverse training examples without collecting more real data.
    • Filling Missing Regions: Inpainting or reconstructing corrupted image parts.

Style Transfer

Style transfer is a technique that applies the artistic "style" of one image (e.g., a painting by Van Gogh) to the "content" of another image (e.g., a photograph), while preserving the content structure.

  • Mechanism: Typically uses deep learning models (often CNNs) to separate and recombine the content features of one image with the style features of another.
  • Applications:
    • Artistic Filters: Applying various artistic styles to photographs.
    • Image Manipulation: Creating new visual effects for graphics and media.
    • Domain Adaptation: Potentially transferring "style" (e.g., lighting conditions, sensor noise) between different image datasets to improve model performance across domains.

3.5 Deep Neural Networks (DNNs)

The syllabus lists "Deep Neural Networks" as a separate topic, which serves as an overarching concept encompassing CNNs and other deep learning architectures. DNNs are neural networks with multiple hidden layers, enabling them to learn hierarchical representations of data.

  • Definition: A Deep Neural Network (DNN) is an artificial neural network with multiple hidden layers between the input and output layers. The "deep" aspect refers to the depth of these hidden layers.

  • Hierarchy of Features: DNNs are capable of learning complex, abstract, and hierarchical features from raw input data automatically, moving from low-level features (e.g., edges) in earlier layers to high-level semantic features (e.g., object parts, object identities) in deeper layers.

  • Architecture: DNNs can take various forms, including:

    • Feedforward Neural Networks (MLPs): Where connections flow only in one direction from input to output, forming layers.
    • Convolutional Neural Networks (CNNs): Specialized for grid-like data (images, video) with convolutional and pooling layers.
    • Recurrent Neural Networks (RNNs) and LSTMs: Designed for sequential data (text, audio, time-series) to handle temporal dependencies.
    • Autoencoders: Used for unsupervised learning, dimensionality reduction, and feature learning.
    • Transformers: More recent architectures (e.g., BERT, GPT) excelling in sequence-to-sequence tasks and capturing long-range dependencies using attention mechanisms.
  • Learning Capacity: The "depth" of DNNs allows them to model highly complex, non-linear relationships within data that shallower networks cannot.

  • Backpropagation: The primary algorithm used to train DNNs by adjusting weights based on the gradient of a loss function.

  • Challenges (addressed by BN, Dropout, etc.):

    • Overfitting: DNNs with many parameters are prone to overfitting, which is mitigated by techniques like Dropout and Batch Normalization.
    • Vanishing/Exploding Gradients: Deep architectures can suffer from unstable gradients, addressed by better activation functions (ReLU), Batch Normalization, and specialized architectures (ResNets, LSTMs).
    • Computational Cost: Training and inference can be computationally intensive, requiring specialized hardware (GPUs, TPUs) and optimized software frameworks.

The provided material on "Fundamentals of Machine Learning" also discusses DNNs. It notes that DNNs with 'n' hidden layers are described by: $f(\mathbf{x}) = f_{\text{activation}}^{(n+1)} (h^{(n)}(f_{\text{activation}}^{(n-1)}(\ldots h^{(1)}(f_{\text{activation}}^{(0)}(\mathbf{x}))\ldots)))$ where $h^{(n)}(\mathbf{x})$ is a pre-activation function (linear process with weighted matrix $W^{(n)}$ and bias $b^{(n)}$). The bar notation $\bar{x}^{(n)}$ denotes a variable with 'n' indicating it's attached to the vector $\mathbf{x}$, and $h^{(l)}(\mathbf{x})$ is the hidden layer activation/transfer function.

The material also highlights that DNNs are widely used for recognition of automatic speech by modeling good acoustics and for computer vision applications using CNNs, both of which benefit from DNN's ability to handle shared architecture and constant translation features.

3.6 Domain Adaptation and Transfer Learning

These techniques address the challenge of applying machine learning models trained on one dataset or domain to a different but related dataset or domain.

Domain Adaptation

Domain adaptation aims to enable a model trained on a source domain (with abundant labeled data) to perform well on a target domain (with scarce or unlabeled data) where the data distributions are different but the underlying task is the same.

  • Problem: When a model trained on a source domain is directly applied to a target domain, performance often drops due to "domain shift" (differences in data distribution).
  • Goal: To minimize the performance degradation when the test data distribution is different from the training data distribution.
  • Approaches:
    • Feature-based Domain Adaptation: Transforms features from either domain (or both) into a common, domain-invariant feature space.
    • Model-based Domain Adaptation: Modifies the model's parameters to adapt to the target domain.
    • Adversarial Domain Adaptation: Uses adversarial networks (similar to GANs) to learn domain-invariant features.

Transfer Learning

Transfer learning is a broader concept where knowledge gained from solving one problem (source task) is applied to a different but related problem (target task).

  • Mechanism: Instead of training a model from scratch for the target task, a pre-trained model (often a deep neural network trained on a very large and diverse dataset like ImageNet) is used as a starting point.
  • Common Transfer Learning Strategies for Vision:
    1. Feature Extraction: The pre-trained CNN (without its final classification layers) is used as a fixed feature extractor. New data is passed through it, and the extracted features are then fed into a new, smaller classifier (e.g., SVM, shallow neural network). This is effective when the new dataset is small and similar to the pre-training data.
    2. Fine-tuning: The pre-trained CNN's weights are used as an initialization, and then the entire network (or a portion of its later layers) is re-trained (fine-tuned) on the new dataset. This is effective when the new dataset is large or significantly different from the pre-training data. The learning rate during fine-tuning is often set to a very small value to avoid drastically altering the learned features too quickly.
  • Benefits:
    • Reduced Training Time: Much faster than training from scratch.
    • Improved Performance: Especially beneficial when the target dataset is small, as it leverages vast knowledge from large datasets.
    • Less Data Required: Can achieve good performance with less labeled data for the target task.
  • Applications:
    • Medical Image Analysis: Pre-training CNNs on natural images (ImageNet) and then fine-tuning them for medical tasks like lesion detection or classification (as mentioned in the medical imaging material by Shin et al. and Cheng et al.).
      • Cheng et al. pre-trained a stacked denoising autoencoder (SDAE) to determine malignancy from chest CT images, achieving better performance than traditional CADx systems [Browse 3].
      • Shin et al. tested various pre-trained CNN architectures, demonstrating performance improvement in CT patch-based thoraco-abdominal lymph node detection and ILD classification through transfer learning [Browse 3].
    • Industrial Inspection: Adapting models trained on general images to specific defect detection tasks on industrial product images.

Example from Medical Imaging (Revisit): The Hanyang Medical Reviews paper on "Deep Learning for Medical Image Analysis" explicitly mentions transfer learning. For instance, Hosseini-Asl et al. proposed a 3D CNN model for Alzheimer's disease progression detection from brain MR images, where they used a transfer learning method. They pre-trained a Convolutional Autoencoder (CAE) on a small number of source domain images to learn features, then fine-tuned these features on target domain data to train the actual classification model. This approach achieved better classification performance than direct supervised training on the target domain [Browse 3].

This concludes Unit 3. Let me know if you'd like to proceed to Unit 4!


References for Browsed Content: [Browse 3] Hanyang Med Rev 2017;37:61-70 - Deep Learning for Medical Image Analysis