-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
The current forecasting model currently ingests all available features without filtering for multicollinearity. This is particularly problematic for weather data, where features like maximum temperature (tmax) and sun duration (tsun) are often highly correlated (> 0.9).
While the Gradient Boosting model handles this gracefully for prediction accuracy, it harms interpretability. The model may arbitrarily split importance between two redundant features, making it impossible for the user to know which factor is the true trigger.
Proposed Solution
Implement an automated, efficient feature selection step using a Correlation Matrix Filter. This was chosen over Recursive Feature Elimination (RFE) for its speed and deterministic nature, which is critical for a desktop application user experience.
Implementation Details
- Algorithm: Pearson Correlation.
- Threshold: > 0.90 (Value should be defined as a constant).
- Process:
- Calculate correlation matrix
df.corr().abs(). - Iterate through the upper triangle of the matrix.
- Identify columns with correlation coefficient higher than the threshold.
- Drop one column from each correlated pair.
- Tie-breaking: If feature A and B are correlated, prefer keeping the one that is more "raw" or has fewer missing values if possible. Otherwise, simply drop the second one encountered to ensure deterministic behavior.
- Calculate correlation matrix
Location
- Implement the filtering logic in
forecasting/feature_engine.pyor a new utility module. - Call this filter in
forecasting/train_model.pyimmediately after data processing and before model training.
Acceptance Criteria
- Unit test proving that perfectly correlated features result in one being dropped.
- Pipeline runs without errors.
- Feature importance outputs (if exposed) no longer show dilute signals across redundant variables.
Metadata
Metadata
Assignees
Labels
Projects
Status