Skip to content

Latest commit

 

History

History
430 lines (343 loc) · 43 KB

File metadata and controls

430 lines (343 loc) · 43 KB

Okay, let's delve into Unit-2: Feature Representation. This unit focuses on how raw data, especially images and audio, are transformed into a format that machine learning algorithms can effectively process for pattern recognition.

Unit-2: Feature Representation [4 Hrs.]

2.1 Fundamental of Image Acquisition and Preprocessing

Feature representation is the process of extracting meaningful information from raw data to make it suitable for pattern recognition algorithms. This often involves converting raw, analog signals into digital data and then enhancing or simplifying it.

Image Acquisition

Image acquisition is the initial step in digital image processing, where a digital image is obtained from a real-world scene. This process involves converting physical light energy into a discrete set of numerical values that a computer can process.

As defined in Fundamentals of Digital Image Processing by Anil K. Jain, image acquisition is "the process of obtaining a digital image from a real-world scene through the use of an imaging system, such as a camera or a scanner, which converts the spatial distribution of light intensity into a discrete set of numerical values."

The principal energy source for images used today is the electromagnetic energy spectrum. Different parts of this spectrum are used for various imaging modalities, each capturing different types of information and having distinct properties (frequency, wavelength, applications).

The Electromagnetic Spectrum (Overview):

Region Typical Frequency Range (Hz) Typical Wavelength Range (m) Key Characteristics & Applications
Gamma Rays $10^{19} - 10^{22}$ $10^{-12} - 10^{-15}$ Characteristics: Highest energy, shortest wavelength. Generated by radioactive decay and nuclear processes.
Uses: Nuclear medicine (e.g., PET scans), astronomical observations (studying high-energy events like supernovae and black holes).
X-Rays $10^{16} - 10^{19}$ $10^{-8} - 10^{-12}$ Characteristics: High energy, penetrate soft tissues but are absorbed by denser materials like bone.
Uses: Medical diagnostics (e.g., X-ray radiography, CT scans), industrial inspection (detecting flaws in materials), airport security, and astronomical observations (studying hot gases in galaxy clusters).
Ultraviolet (UV) $10^{14} - 10^{16}$ $10^{-7} - 10^{-8}$ Characteristics: More energetic than visible light, can cause sunburn.
Uses: Lithography, industrial inspection (detecting invisible flaws), microscopy (enhancing resolution), lasers, biological imaging (sterilization, DNA analysis), and astronomical observations (studying hot young stars and distant galaxies).
Visible Light $4 \times 10^{14} - 7.9 \times 10^{14}$ $400 \text{ nm} - 700 \text{ nm}$ Characteristics: The portion of the EM spectrum detectable by the human eye, including all colors of the rainbow (violet to red). Lies between infrared and ultraviolet in terms of frequency/wavelength.
Uses: Everyday photography, optical microscopes, human vision, and conventional cameras for image acquisition. This is the most common range for everyday visual pattern recognition.
Infrared (IR) $10^{11} - 4 \times 10^{14}$ $700 \text{ nm} - 1 \text{ mm}$ Characteristics: Lower frequency than visible light, carries heat, not visible to the human eye.
Uses: Thermal imaging (night vision, detecting heat signatures, industrial temperature monitoring), remote controls, fiber optics, medical diagnostics (detecting inflammation), and astronomy (observing cooler objects and dust clouds).
Microwaves $10^8 - 10^{11}$ $1 \text{ mm} - 1 \text{ m}$ Characteristics: Longer wavelength than IR. Some radar waves can penetrate clouds, vegetation, ice, and dry sand.
Uses: Radar (imaging across weather conditions, exploring inaccessible regions of Earth's surface), microwave ovens, telecommunications, and remote sensing (e.g., Synthetic Aperture Radar (SAR) for Earth observation).
Radio Waves $10^4 - 10^8$ $> 1 \text{ m}$ Characteristics: Longest wavelength, lowest energy.
Uses: Radio communication, broadcasting, MRI (radiofrequency pulses interact with body's hydrogen atoms to create images), astronomy (radio telescopes observing distant galaxies and phenomena).

Discussion Questions on Imaging Across the Electromagnetic Spectrum:

  • How is imaging performed using different types of electromagnetic waves (e.g., infrared, visible, ultraviolet, X-ray, microwave)?
    • Infrared: Thermal cameras detect IR radiation emitted by objects (heat) and convert it into visible images.
    • Visible: Standard cameras capture reflected visible light.
    • Ultraviolet: UV cameras capture reflected or emitted UV light, often used for detailed surface inspection or forensic analysis.
    • X-ray: X-ray machines pass X-rays through an object; denser parts absorb more X-rays, creating a shadow image on a detector.
    • Microwave: Radar systems emit microwaves and detect the reflected signals, using the time and intensity of reflection to create an image, often used for mapping and weather.
  • What are the principles behind capturing images in each case?
    • Absorption/Reflection: Different materials absorb, reflect, or emit EM radiation differently across the spectrum based on their physical and chemical properties. Imaging systems exploit these differences.
    • Energy Detection: Sensors are designed to detect specific wavelengths/frequencies and convert the detected energy into electrical signals, which are then processed into an image.
    • Interaction with Matter: For X-rays, the principle is differential absorption. For thermal imaging, it's blackbody radiation. For radar, it's reflection time and intensity.
  • What types of cameras or sensors are used?
    • Gamma/X-ray: Scintillation detectors, CCDs (Charge-Coupled Devices), flat-panel detectors.
    • UV/Visible: CMOS (Complementary Metal-Oxide-Semiconductor) and CCD sensors.
    • Infrared: Bolometers, thermopiles, focal plane arrays made of specialized materials like InGaAs, MCT.
    • Microwave/Radio: Antennas, radar receivers.
  • How are images acquired and stored from each type of electromagnetic wave-based imaging system?
    • Acquisition: The EM waves interact with the scene, and specialized sensors detect the altered waves. These sensors convert the energy into electrical signals.
    • Digitization: These analog electrical signals are then digitized by an Analog-to-Digital Converter (ADC) into discrete numerical values (pixels).
    • Storage: The digital pixel values are arranged into matrices (for 2D images) or tensors (for color or 3D images) and stored in various file formats (e.g., JPEG, PNG, DICOM, TIFF, RAW, WebP, GIF, BMP). The format depends on the type of data, application, and encoding method (e.g., lossy vs. lossless compression).
  • Compare the file formats, resolutions, and processing requirements across systems (e.g., thermal images, satellite images, X-ray scans).
    • File Formats: Vary widely (e.g., DICOM for medical, JPEG/PNG for visible, GeoTIFF for satellite).
    • Resolution: Ranges from micrometers (medical/industrial X-rays) to meters/kilometers (satellite).
    • Processing Requirements: High-resolution and multi-spectral data (like satellite or medical images) require significant computational power for storage, processing, and analysis (e.g., specialized algorithms for noise reduction, enhancement, feature extraction).
  • What types of waves are emitted or detected by sensors used in various data acquisition tasks?
    • Active Sensors: Emit their own energy and detect the reflection (e.g., LiDAR emits pulsed laser light in the near-infrared or green spectrum and detects reflected light; SAR emits microwaves).
    • Passive Sensors: Detect naturally occurring energy (e.g., optical cameras detect reflected visible light; thermal cameras detect emitted infrared radiation).
  • For example, what does a LiDAR sensor emit? How does an infrared sensor detect temperature?
    • A LiDAR sensor typically emits pulsed laser light in the near-infrared (around 900-1064 nm) or green (around 532 nm for bathymetry) spectrum.
    • An infrared sensor (thermal camera) detects temperature by measuring the intensity of infrared radiation (heat) emitted by an object. All objects with a temperature above absolute zero emit IR radiation; the intensity and wavelength distribution of this radiation are directly related to the object's temperature.

Image Preprocessing

Image preprocessing is a crucial step in digital image processing aimed at improving image quality and making them more amenable for further analysis.

1. What is Digital Image Processing (DIP)? Digital Image Processing (DIP) is a rapidly evolving field involving the manipulation and analysis of images using computer algorithms. It plays a crucial role in various industries such as medical imaging, satellite imaging, industrial inspection, and multimedia applications. With advancements in artificial intelligence (AI) and machine learning (ML), DIP has become more sophisticated, enabling tasks like object detection, facial recognition, and automated quality control.

DIP can be divided into three main stages:

  1. Preprocessing: Enhancing raw images for better analysis.
  2. Processing: Applying algorithms for feature extraction and transformation.
  3. Post-processing: Refining results for visualization or further analysis.

1. Preprocessing (Detailed Techniques): Digital image preprocessing improves image quality and prepares them for subsequent analysis. Key techniques include:

  • Grayscale Conversion:

    • Converts a colored image (e.g., RGB) to a single channel (grayscale) image.
    • Reduces image complexity and makes processing more efficient by combining the R, G, and B channels of each pixel into a single intensity value.
  • Image Resizing:

    • Adjusts image dimensions to a specific size, either to fit a model's input requirements or to reduce computational load.
    • Involves scaling pixel values based on the target size, often through interpolation methods like nearest neighbor, bilinear, or bicubic.
  • Image Smoothing (Blurring):

    • Reduces noise or fine details to make larger structures or features easier to detect.
    • Gaussian blur: Applies a weighted average of neighboring pixels.
    • Median filter: Replaces each pixel with the median value of its neighborhood, effective for removing "salt-and-pepper noise".
  • Image Thresholding:

    • Converts a grayscale image into a binary image (pixels are either black or white).
    • Commonly used in object detection or segmentation. A threshold value is set: pixels above it become white, and below it become black.
    • Otsu’s method: Automatically determines an optimal global threshold value.
  • Edge Detection:

    • Detects boundaries or transitions between different regions, crucial for object recognition and segmentation.
    • Sobel operator: Detects edges by calculating intensity gradients in horizontal and vertical directions.
    • Canny edge detector: A multi-step algorithm that detects edges by finding areas with significant intensity change.
  • Noise Reduction:

    • Removes unwanted random variations (noise) that distort the image, improving clarity.
    • Filters like Gaussian, median, and Wiener filters are used to smooth out noise without losing important details.
  • Histogram Equalization:

    • Enhances image contrast by spreading out intensity levels across the entire range.
    • Redistributes pixel intensity values to maximize contrast and improve detail visibility.
  • Morphological Operations:

    • Modifies the shape or structure of objects in an image, used in tasks like image segmentation.
    • Dilation: Expands object boundaries.
    • Erosion: Shrinks object boundaries.
    • Opening and Closing: Combinations of dilation and erosion used to remove small objects or fill small holes.
  • Contrast Adjustment:

    • Enhances the image by increasing the contrast between different regions, making features more distinguishable.
    • Methods include linear contrast adjustment and gamma correction (adjusts midtones while preserving shadows and highlights).
  • Geometric Transformations:

    • Changes the position, orientation, or size of the image, often used in data augmentation or correcting for distortions.
    • Rotation: Rotating the image by a specific angle.
    • Translation: Shifting the image along the x or y axis.
    • Affine transformations: Scaling, rotating, or translating an image using matrix operations.
  • Image Inversion (Negative):

    • Reverses the color values of an image, useful for specific types of analysis.
    • Each pixel's color value is subtracted from the maximum possible value (e.g., 255 for an 8-bit image).
  • Color Space Conversion:

    • Transforms the image from one color space to another (e.g., from RGB to HSV or YCbCr).
    • Helps isolate certain color channels or enhance features for specific image analysis tasks.

Image Processing Techniques (General Categories):

  • Image Enhancement: Aims to improve the visual quality of an image for easier interpretation by humans or processing by machines. Includes contrast adjustment, sharpening (enhancing edges), and noise removal.
  • Image Restoration: The process of recovering an image that has been degraded by noise, blur, or other distortions. Unlike enhancement, restoration tries to reverse the damage. Common techniques include inverse filtering (undoes blurring), Wiener filtering (statistical noise/blur reduction), and deconvolution (reconstructs original image by reversing blur effects).
  • Image Segmentation: Divides an image into meaningful regions for analysis (e.g., objects, regions, boundaries). Common methods include thresholding (divides based on pixel intensity) and edge detection (identifies object boundaries). Region-based segmentation groups pixels with similar characteristics (color, texture).

2.2 Spatial Filters: Laplacian and Sobel

Spatial filters are fundamental image processing techniques that operate directly on pixel values, typically by convolving the image with a small kernel (or mask). They are used for various purposes, including edge detection, smoothing, and sharpening.

Laplacian Filter: Edge Detection Using Second Order Derivatives

  1. Introduction: The Laplacian filter is a popular image processing technique primarily used for edge detection. Unlike first-order derivative filters (like Sobel or Prewitt), which detect edges by looking at the rate of change (gradient) of intensity, the Laplacian uses second-order derivatives to find regions where the intensity changes rapidly—i.e., edges.

  2. Mathematical Background: The Laplacian operator $\nabla^2$ is a scalar operator defined as the divergence of the gradient of a function. For a 2D image $I(x, y)$, the Laplacian is given by: $\nabla^2 I = \frac{\partial^2 I}{\partial x^2} + \frac{\partial^2 I}{\partial y^2}$ This measures the rate at which the gradient changes, effectively highlighting areas where intensity transitions sharply.

  3. Why Use Laplacian?

    • Edges correspond to places where the intensity changes abruptly.
    • First derivatives detect the rate of change.
    • The second derivative zero-crossings indicate precise edge locations.
    • The Laplacian filter highlights rapid intensity changes, producing a high response at edges.
  4. Discrete Laplacian Kernels: The Laplacian operator is implemented using convolution with discrete kernels. Common 3x3 Laplacian kernels include: $\begin{bmatrix} 0 & 1 & 0 \ 1 & -4 & 1 \ 0 & 1 & 0 \end{bmatrix}$ or $\begin{bmatrix} 1 & 1 & 1 \ 1 & -8 & 1 \ 1 & 1 & 1 \end{bmatrix}$ The center negative value sums the neighbors and subtracts them, highlighting edges. The second kernel has stronger sensitivity but can amplify noise.

  5. Applying the Laplacian Filter:

    • Convolve the image with the Laplacian kernel.
    • The output highlights edges by detecting zero-crossings—points where the Laplacian changes sign.
    • Since second derivatives amplify noise, pre-smoothing the image (e.g., Gaussian filter) is often performed before applying the Laplacian. This leads to the Laplacian of Gaussian (LoG) method.
  6. Laplacian of Gaussian (LoG): The LoG combines smoothing and edge detection. The LoG kernel is given by: $LoG(x, y) = -\frac{1}{\pi\sigma^4} \left(1 - \frac{x^2 + y^2}{2\sigma^2}\right) e^{-\frac{x^2 + y^2}{2\sigma^2}}$ Steps:

    1. Smooth the image with a Gaussian filter (to reduce noise).
    2. Apply the Laplacian operator.
    3. Detect zero-crossings in the result as edges.
  7. Properties of the Laplacian Filter:

    Property Description
    Isotropic Response is independent of edge orientation.
    Second Derivative Sensitive to rapid intensity changes and noise.
    Zero-crossings Edges occur where the Laplacian output crosses zero. Requires smoothing to reduce false edges.
    Noise Sensitivity Highly sensitive to noise due to second-order derivatives.

    Summary: The Laplacian filter is a second-order derivative operator for edge detection. It highlights regions of rapid intensity change. It is sensitive to noise, so it is often combined with Gaussian smoothing (LoG). It detects edges by finding zero-crossings in the filtered image and is isotropic and simple to implement.

Sobel Filter: Mathematical Intuition, Construction, and Application

  1. Introduction to Gradient and Edge Detection: In image processing, detecting edges is critical for understanding structure, boundaries, and objects. One common method is to use the Sobel Filter, which approximates the first-order derivative (i.e., gradient) of image intensity. Gradient Definition: In 2D space, the gradient of an image $I$ is a vector: $\nabla I = \begin{bmatrix} \frac{\partial I}{\partial x} \ \frac{\partial I}{\partial y} \end{bmatrix} = \begin{bmatrix} G_x \ G_y \end{bmatrix}$ This vector indicates the direction and rate of fastest intensity change, which helps identify edges. $G_x$ is the gradient in the horizontal (x) direction, and $G_y$ is the gradient in the vertical (y) direction.

  2. Approximating Derivatives with Finite Differences: To compute gradients in digital images (discrete grids), finite difference operators are used. For example, the simplest approximation of the derivative in the x-direction is: [−1, 0, +1]

    • What It Does?: It subtracts the pixel on the left, ignores the center pixel, and adds the pixel on the right. This operation computes the change in pixel intensity along the horizontal direction (left to right).
    • Why Is This Useful?: Edges in images typically occur where there is a sudden change in intensity. This filter detects such changes by computing the gradient:
      • If intensity increases from left to right $\Rightarrow$ the output is a positive value.
      • If intensity decreases from left to right $\Rightarrow$ the output is a negative value.
      • If there is no change $\Rightarrow$ the output is zero.
    • Thus, this filter acts like a slope detector in the horizontal direction, highlighting vertical edges in the image.
  3. Prewitt Filter (Extended to 2D): The Prewitt filter for the x-gradient ($G_x$) and y-gradient ($G_y$) are: $G_x = \begin{bmatrix} -1 & 0 & +1 \ -1 & 0 & +1 \ -1 & 0 & +1 \end{bmatrix}$ and $G_y = \begin{bmatrix} -1 & -1 & -1 \ 0 & 0 & 0 \ +1 & +1 & +1 \end{bmatrix}$

  4. The Sobel Filter: Motivation and Construction: The Sobel filter improves upon the Prewitt filter by giving more weight to the center row/column to reduce noise sensitivity.

    • Sobel $G_x$ (Vertical Edge Detection): $G_x = \begin{bmatrix} -1 & 0 & +1 \ -2 & 0 & +2 \ -1 & 0 & +1 \end{bmatrix}$
    • Sobel $G_y$ (Horizontal Edge Detection): $G_y = \begin{bmatrix} -1 & -2 & -1 \ 0 & 0 & 0 \ +1 & +2 & +1 \end{bmatrix}$
    • Why These Values?:
      1. The Derivative Component: The row vector [−1 0 +1] is a basic first-order difference operator approximating the gradient in the x-direction.
      2. The Smoothing Component: The column vector [1 2 1]^T is a 1D approximation of a Gaussian function (derived from Pascal’s Triangle). It reduces noise by assigning more weight to the central pixel.
    • Combining these makes Sobel filters more robust to noise than simple difference operators.
  5. Example Use Case: Detecting Edges: Suppose we apply $G_x$ and $G_y$ to a grayscale image. For each pixel:

    1. Apply $G_x$ to get the horizontal gradient $I_x$.
    2. Apply $G_y$ to get the vertical gradient $I_y$.
    3. Compute gradient magnitude: $M = \sqrt{I_x^2 + I_y^2}$
    4. Edges correspond to pixels where $M$ exceeds a certain threshold.
    5. Optionally, compute direction: $\theta = \tan^{-1}\left(\frac{I_y}{I_x}\right)$
      • This angle tells you in which direction the intensity is increasing (measured counter-clockwise from the x-axis in degrees or radians).

    Example: Gradient Direction Calculation Using Sobel Operator Consider a 3 × 3 grayscale image patch centered at pixel P: $I = \begin{bmatrix} 100 & 100 & 100 \ 0 & 0 & 0 \ 0 & 0 & 0 \end{bmatrix}$ Apply the Sobel operators in the x and y directions: $G_x = \begin{bmatrix} -1 & 0 & +1 \ -2 & 0 & +2 \ -1 & 0 & +1 \end{bmatrix}$, $G_y = \begin{bmatrix} -1 & -2 & -1 \ 0 & 0 & 0 \ +1 & +2 & +1 \end{bmatrix}$ To compute the gradients at the center pixel $P = (2, 2)$: $G_x(P) = (-1)(100) + (0)(100) + (+1)(100) + (-2)(0) + (0)(0) + (+2)(0) + (-1)(0) + (0)(0) + (+1)(0) = 0$ $G_y(P) = (+1)(100) + (+2)(100) + (+1)(100) + (0)(0) + (0)(0) + (0)(0) + (-1)(0) + (-2)(0) + (-1)(0) = 400$ So, the gradient components are: $G_x = 0, G_y = 400$. Gradient Magnitude: $\parallel \nabla I \parallel = \sqrt{G_x^2 + G_y^2} = \sqrt{0^2 + 400^2} = 400$. Gradient Direction: $\theta = \tan^{-1}\left(\frac{400}{0}\right) = 90^\circ$.

    • The gradient magnitude is large: strong edge detected.
    • The gradient direction is $90^\circ$: intensity increases in the positive y-direction (i.e., upward).
    • This means the edge is horizontal, separating a dark bottom from a bright top.

    Why Is Direction Important?

    • In edge detection (e.g., Canny edge detector), the gradient direction helps in tracing the edge path more accurately.
    • It enables non-maximum suppression—a technique to retain only the strongest edge pixels aligned with the direction of the gradient.
    • Gradient direction is essential in higher-level tasks such as shape analysis, object detection, optical flow estimation, and feature matching and tracking.
  6. Use in Convolutional Neural Networks (CNNs): In classical Computer Vision (CV), Sobel filters were explicitly applied to detect edges. In modern CNNs:

    • CNNs learn filters (including edge detectors) automatically via training.
    • The first convolutional layer often learns filters resembling Sobel/Prewitt because edges are fundamental features.
    • Unlike fixed Sobel filters, CNN filters are optimized for the specific task (e.g., face detection, object classification).

    Summary Table of Sobel Filter:

    Step Explanation
    1 Use finite difference to approximate derivatives.
    2 Add Gaussian-like smoothing [1 2 1]^T to reduce noise.
    3 Combine smoothing $\times$ derivative to get the Sobel kernel.
    4 Apply $G_x, G_y$ to the image and compute the gradient magnitude.
    5 Threshold to detect edges.

2.3 Histogram of Oriented Gradients (HOG)

The Histogram of Oriented Gradients (HOG) is a feature descriptor used in computer vision and image processing for object detection. It works by capturing edge or gradient orientation information within localized regions of an image. The main idea behind HOG is that local object appearance and shape can be characterized by the distribution of local intensity gradients or edge directions.

Step-by-Step Explanation:

  1. Input Image:

    • Start with an input image. Typically, images are converted to grayscale to simplify gradient computation. This reduces complexity and focuses on intensity changes.
  2. Gamma and Color Normalization:

    • To reduce the effect of illumination variations, gamma correction is applied: $I_{\text{normalized}} = I^\gamma$, commonly with $\gamma = 0.5$.
    • This normalization adjusts the pixel intensity distribution and enhances gradient features, making the system less sensitive to lighting changes.
  3. Compute Gradients:

    • Use filters (e.g., Sobel operator) to compute gradients in the x and y directions for each pixel. These filters highlight changes in intensity.
    • For a pixel at $(x, y)$, the gradients $G_x(x, y)$ and $G_y(x, y)$ are calculated.
    • The gradient magnitude $M(x, y)$ and orientation $\theta(x, y)$ at each pixel are then calculated as: $M(x, y) = \sqrt{G_x(x, y)^2 + G_y(x, y)^2}$ $\theta(x, y) = \arctan\left(\frac{G_y(x, y)}{G_x(x, y)}\right)$
    • The orientation is typically in degrees (0° to 180° for unsigned gradients or 0° to 360° for signed gradients).
  4. Weighted Vote into Spatial and Orientation Cells:

    • The image is divided into small, non-overlapping "cells" (typically $8 \times 8$ pixels).
    • For each pixel within a cell, its gradient magnitude and orientation are considered.
    • Each pixel "votes" into an "orientation histogram" for its cell. The vote is weighted by the gradient magnitude, meaning stronger edges contribute more.
    • The orientations are divided into a fixed number of "bins" (e.g., 9 bins for 0°–180°, with each bin covering 20°).
    • Example: Assume the following orientation and magnitude values for a cell:
      • Orientations (in degrees): 10, 20, 40, 45, 50, 80, 100, 130, 150, 170
      • Magnitudes: 1, 2, 1.5, 2.5, 1, 2, 1.5, 0.5, 2.5, 1
      • Divide orientations into 9 bins (0°–180°, each 20° wide). Each pixel votes for the nearest bin, weighted by its magnitude.
      • Table: Explanation of Bins in HOG Histogram (Bin size = 20°)
        Bin Range (°) Orientations That Fall In (°) Magnitude Sum (Vote)
        0–20 10 1
        20–40 20 2
        40–60 40, 45, 50 $1.5 + 2.5 + 1 = 5$
        60–80 None 0
        80–100 80 2
        100–120 100 1.5
        120–140 130 0.5
        140–160 150 2.5
        160–180 170 1
      • The resulting histogram (e.g., a bar chart showing the magnitude sum for each bin) represents the local gradient distribution within that cell.
  5. Contrast Normalize Over Overlapping Blocks:

    • To ensure invariance to local changes in brightness and contrast, cells are grouped into larger, overlapping "blocks" (e.g., $2 \times 2$ cells).
    • The concatenated histogram vector from all cells within a block is then normalized. A common normalization technique is L2-normalization: $v_{\text{norm}} = \frac{v}{\sqrt{\parallel v \parallel^2_2 + \epsilon^2}}$, where $\epsilon$ is a small constant (e.g., $10^{-5}$) to prevent division by zero.
    • This normalization makes the descriptor more robust to variations in illumination.
  6. Collect HOG Features Over Detection Window:

    • A detection window (e.g., $64 \times 128$ pixels) slides across the entire image.
    • At each location of the detection window, the normalized block histograms from all blocks within that window are concatenated to form a single, long HOG descriptor (feature vector) for that window.
    • This descriptor represents the image content within that specific detection window.
  7. Linear SVM (for Classification):

    • A linear Support Vector Machine (SVM) is then trained using these HOG descriptors.
    • The training involves labeled samples (e.g., pedestrian vs. non-pedestrian).
    • Each detection window is classified as containing a person or not, based on the SVM's decision function: $f(x) = w^T x + b$.
    • If $f(x) > \text{threshold}$, the window is classified as containing a person; otherwise, it's a non-person.

Applications of HOG: HOG features are widely used in pedestrian detection, object recognition, and other computer vision tasks where robust shape and appearance descriptors are needed.

2.4 Scale-Invariant Feature Transform (SIFT)

The Scale-Invariant Feature Transform (SIFT) is a powerful algorithm in computer vision for detecting and describing local features in images. SIFT features are designed to be invariant to image scale, rotation, and partially invariant to illumination changes and affine distortion. This makes them highly robust for matching features across different views of an object or scene.

Key Concepts and Steps:

  1. Scale-Space Extrema Detection:

    • The first step involves convolving the image with Gaussian filters at different scales to create a scale space. This simulates looking at the image at different levels of blur.
    • Difference of Gaussians (DoG): To efficiently detect potential interest points that are scale-invariant, the difference of successive Gaussian-blurred images (DoG) is computed.
    • Candidate Keypoints: Local maxima and minima in the DoG images across different scales and spatial locations are identified as candidate keypoints. This aims to find points that are stable across scale changes.
  2. Keypoint Localization:

    • Once candidate keypoints are found, they are refined to sub-pixel accuracy to improve their localization.
    • Discarding Low-Contrast and Edge Points: Points with low contrast (unreliable) or those lying on edges (prone to noise and unstable localization) are discarded. This improves robustness.
  3. Orientation Assignment:

    • For each retained keypoint, one or more orientations are assigned based on the local image gradient directions around the keypoint.
    • An orientation histogram is created from the gradient directions of pixels within a neighborhood around the keypoint.
    • The highest peak(s) in this histogram define the characteristic orientation(s) for the keypoint. Assigning an orientation makes the descriptor rotation-invariant as all subsequent operations are performed relative to this orientation.
  4. Keypoint Descriptor Generation:

    • A local image descriptor is generated for each keypoint based on the gradient magnitudes and orientations within a region around the keypoint.
    • This region is divided into a grid of sub-regions (e.g., $4 \times 4$ sub-regions).
    • For each sub-region, a new orientation histogram (e.g., 8 bins) is computed. The gradients within the sub-region are weighted by their magnitude and a Gaussian window centered on the keypoint.
    • These histograms are then concatenated into a single, high-dimensional vector (e.g., $4 \times 4 \times 8 = 128$ elements for each keypoint).
    • This descriptor is then normalized to make it robust to illumination changes.

Invariance Properties of SIFT Features:

  • Scale Invariance: Achieved by detecting keypoints across a continuous range of scales using scale-space extrema.
  • Rotation Invariance: Achieved by orienting the descriptor based on the dominant gradient direction at each keypoint.
  • Partial Illumination Invariance: Achieved by normalizing the descriptor vector.
  • Partial Affine Invariance: Achieved by the local nature of the descriptor and its robustness to minor geometric distortions.

Applications of SIFT: SIFT features are widely used in various computer vision tasks due to their robustness and distinctiveness:

  • Object Recognition and Detection: Identifying objects in images regardless of their size or orientation.
  • Image Matching and Stitching: Aligning and combining multiple images (e.g., creating panoramas).
  • 3D Reconstruction: Building 3D models from 2D images.
  • Robotics and Navigation: Localizing robots and mapping environments.
  • Content-Based Image Retrieval: Searching for images based on their visual content.

2.5 Audio Feature Representations

Audio signals, like images, are raw data that need to be transformed into meaningful features for pattern recognition tasks such as speech recognition, music analysis, and environmental sound detection. Raw audio (waveform) is a 1D time-series signal, and directly processing it can be computationally intensive and may not capture perceptually relevant information efficiently.

Fundamental Concepts:

  1. Sampling:

    • Analog audio signals are continuous in time and amplitude. Sampling converts this continuous signal into a discrete sequence of values by taking amplitude measurements at regular intervals.
    • The sampling rate (e.g., 44.1 kHz for CD quality, 16 kHz for speech) determines how many samples are taken per second. A higher sampling rate captures more detail and allows for representation of higher frequencies (Nyquist theorem).
    • Quantization: Each sample's amplitude is converted into a discrete numerical value (e.g., 8-bit, 16-bit).
  2. Framing (Windowing):

    • Audio signals are typically non-stationary over long durations but can be considered approximately stationary over short periods.
    • Therefore, the continuous audio stream is divided into short, overlapping frames (e.g., 20-30 ms long, with 50% overlap).
    • A window function (e.g., Hamming, Hanning) is applied to each frame to reduce spectral leakage caused by abruptly cutting the signal.
  3. Frequency Domain Transformation (e.g., FFT):

    • After framing, each frame is transformed from the time domain to the frequency domain using the Fast Fourier Transform (FFT).
    • The spectrum obtained from FFT represents the distribution of frequencies present in that frame.
    • Spectrogram: A common visualization of audio features, showing how the frequency content of a signal changes over time (frequency on y-axis, time on x-axis, intensity representing amplitude).

Common Audio Feature Representations for Pattern Recognition:

  1. Zero-Crossing Rate (ZCR):

    • Concept: The rate at which the audio signal changes sign (from positive to negative or vice versa) within a frame.
    • What it represents: Indicates the noisiness or periodicity of a signal. Higher ZCR often corresponds to noisy or unvoiced speech, while lower ZCR corresponds to voiced speech or steady tones.
    • Application: Speech endpoint detection, unvoiced/voiced speech discrimination.
  2. Energy/Amplitude (Short-Time Energy):

    • Concept: The sum of the squares of the signal's amplitude within a frame.
    • What it represents: A measure of the loudness or intensity of the audio signal.
    • Application: Speech activity detection (identifying presence of speech), audio segmentation, speaker verification.
  3. Mel-Frequency Cepstral Coefficients (MFCCs):

    • Concept: These are widely used features that attempt to mimic the human auditory system's non-linear perception of pitch. They represent the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a non-linear Mel scale of frequency.
    • Steps:
      1. Take the FFT of a windowed audio frame.
      2. Map the power spectrum onto the Mel scale (a perceptual scale of pitches judged by listeners to be equal in distance from one another).
      3. Take the logarithm of the powers at each of the Mel frequencies.
      4. Take the Discrete Cosine Transform (DCT) of the log Mel power spectrum to decorrelate the Mel-filterbank outputs. The MFCCs are the amplitudes of the resulting spectrum.
    • What it represents: Captures the timbre or spectral envelope of a sound, robust to variations in pitch.
    • Application: Speech recognition, speaker identification, music genre classification.
  4. Chroma Features:

    • Concept: Represents the twelve different pitch classes (chroma) of Western music (C, C#, D, etc.) regardless of octave.
    • What it represents: Provides information about the harmonic content of a musical piece, relatively insensitive to timbre changes.
    • Application: Music information retrieval (genre classification, key detection, chord recognition), audio fingerprinting.
  5. Spectral Centroid:

    • Concept: The "center of mass" of the spectrum, indicating the spectral brightness of a sound.
    • What it represents: High spectral centroid often means "bright" or "sharp" sounds, while low spectral centroid means "dark" or "dull" sounds.
    • Application: Music timbre classification, instrument recognition.
  6. Spectral Roll-off:

    • Concept: The frequency below which a specified percentage (e.g., 85%) of the total spectral energy is contained.
    • What it represents: Indicates the shape of the spectrum and can differentiate between voiced and unvoiced sounds, or between different timbres.
    • Application: Speech/music classification, environmental sound recognition.
  7. Spectral Bandwidth:

    • Concept: The width of the frequency band containing a certain proportion of the total spectral energy.
    • What it represents: Can indicate the spread of frequencies in a sound, useful for distinguishing between narrow-band and broad-band sounds.
    • Application: Noise detection, audio quality assessment.

These audio feature representations convert the complex raw waveform into a compact and informative set of numerical features that machine learning models can use for various pattern recognition tasks. They aim to capture different perceptual or physical characteristics of sound.