Skip to content

Implement Feature Selection via Correlation Matrix #59

@AreTaj

Description

@AreTaj

Problem

The current forecasting model currently ingests all available features without filtering for multicollinearity. This is particularly problematic for weather data, where features like maximum temperature (tmax) and sun duration (tsun) are often highly correlated (> 0.9).

While the Gradient Boosting model handles this gracefully for prediction accuracy, it harms interpretability. The model may arbitrarily split importance between two redundant features, making it impossible for the user to know which factor is the true trigger.

Proposed Solution

Implement an automated, efficient feature selection step using a Correlation Matrix Filter. This was chosen over Recursive Feature Elimination (RFE) for its speed and deterministic nature, which is critical for a desktop application user experience.

Implementation Details

  1. Algorithm: Pearson Correlation.
  2. Threshold: > 0.90 (Value should be defined as a constant).
  3. Process:
    • Calculate correlation matrix df.corr().abs().
    • Iterate through the upper triangle of the matrix.
    • Identify columns with correlation coefficient higher than the threshold.
    • Drop one column from each correlated pair.
    • Tie-breaking: If feature A and B are correlated, prefer keeping the one that is more "raw" or has fewer missing values if possible. Otherwise, simply drop the second one encountered to ensure deterministic behavior.

Location

  • Implement the filtering logic in forecasting/feature_engine.py or a new utility module.
  • Call this filter in forecasting/train_model.py immediately after data processing and before model training.

Acceptance Criteria

  • Unit test proving that perfectly correlated features result in one being dropped.
  • Pipeline runs without errors.
  • Feature importance outputs (if exposed) no longer show dilute signals across redundant variables.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions