Skip to content

[Phase 3] Feature Engineering #4

@Sakeeb91

Description

@Sakeeb91

Phase 3: Feature Engineering

Parent: #1
Depends on: #3

Objectives

Extract clinically meaningful features from ECG beats for classical ML models.

Tasks

  • Implement time-domain feature extraction (10+ features)
  • Implement frequency-domain feature extraction (8+ features)
  • Implement wavelet-based feature extraction (8+ features)
  • Create unified feature extraction pipeline
  • Document clinical meaning of each feature
  • Validate features against published literature
  • Handle edge cases (NaN, division by zero)

Files to Create/Modify

File Action Description
src/feature_extraction.py Create Feature extraction module
tests/test_features.py Create Unit tests

Features to Extract

Time Domain (10 features):

  • Mean, std, variance
  • Skewness, kurtosis
  • RMS (root mean square)
  • Peak amplitude, peak-to-peak
  • QRS duration estimate
  • RR interval ratio

Frequency Domain (8 features):

  • Spectral centroid, spectral spread
  • Spectral entropy
  • Band powers: VLF, LF, HF
  • LF/HF ratio
  • Dominant frequency

Wavelet Features (8+ features):

  • Energy at scales 4, 8, 16, 32 (db4 wavelet)
  • Approximation coefficient statistics
  • Detail coefficient statistics

Code Reference

from scipy.stats import skew, kurtosis
from scipy.signal import welch
import pywt
import numpy as np

class FeatureExtractor:
    def __init__(self, fs: int = 360):
        self.fs = fs

    def time_domain_features(self, beat: np.ndarray) -> dict:
        return {
            'mean': np.mean(beat),
            'std': np.std(beat),
            'variance': np.var(beat),
            'rms': np.sqrt(np.mean(beat**2)),
            'peak': np.max(np.abs(beat)),
            'peak_to_peak': np.ptp(beat),
            'skewness': skew(beat),
            'kurtosis': kurtosis(beat),
        }

    def frequency_domain_features(self, beat: np.ndarray) -> dict:
        freqs, psd = welch(beat, fs=self.fs, nperseg=min(256, len(beat)))
        total_power = np.sum(psd)
        spectral_centroid = np.sum(freqs * psd) / (total_power + 1e-10)
        return {
            'spectral_centroid': spectral_centroid,
            'total_power': total_power,
            # ... more features
        }

    def wavelet_features(self, beat: np.ndarray, wavelet: str = 'db4') -> dict:
        coeffs = pywt.wavedec(beat, wavelet, level=4)
        features = {}
        for i, c in enumerate(coeffs):
            features[f'wavelet_energy_{i}'] = np.sum(c**2)
            features[f'wavelet_std_{i}'] = np.std(c)
        return features

    def extract_all(self, beat: np.ndarray) -> np.ndarray:
        """Extract all features and return as array."""
        all_features = {}
        all_features.update(self.time_domain_features(beat))
        all_features.update(self.frequency_domain_features(beat))
        all_features.update(self.wavelet_features(beat))
        return np.array(list(all_features.values()))

Definition of Done

  • 30-40 features extracted per beat
  • All features have valid ranges (no NaN, inf)
  • Feature names documented with clinical interpretation
  • Unit tests verify calculations against known values
  • Feature extraction runs <10ms per beat

Technical Notes

For junior developers:

  • Kurtosis is high for impulsive signals (like arrhythmias)
  • LF/HF ratio relates to autonomic nervous system
  • Wavelet decomposition captures multi-scale information
  • Always add small epsilon (1e-10) to denominators

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions