-
Notifications
You must be signed in to change notification settings - Fork 0
[Phase 9] External Regressors Integration #10
Copy link
Copy link
Open
Description
Summary
Integrate external regressors (promotions, holidays, weather, events) into the feature pipeline. These external factors significantly impact demand and improve forecast accuracy when properly incorporated.
System Context
demand-forecasting-engine/
├── src/
│ ├── features/
│ │ ├── pipeline.py # EXISTS - MODIFY
│ │ ├── calendar_features.py # EXISTS
│ │ └── external_features.py # <- CREATE
│ └── data/
│ ├── loader.py # EXISTS
│ └── external_loader.py # <- CREATE
├── configs/
│ └── external_sources.yaml # <- CREATE
└── scripts/
└── train_with_external.py # <- CREATE
Dependencies
- Requires: Phase 1 (Feature Pipeline)
- Blocks: Phase 10 (Production Pipeline)
Proposed Solution
1. External Data Loader (src/data/external_loader.py)
import pandas as pd
import numpy as np
from pathlib import Path
from typing import Optional, Dict, Any
from datetime import datetime, timedelta
import yaml
class ExternalDataLoader:
"""
Load and align external data sources with main time series.
Handles:
- Promotions calendar
- Holiday schedules
- Weather data
- Custom events
"""
def __init__(self, config_path: Optional[str] = None):
self.config = {}
if config_path:
with open(config_path, 'r') as f:
self.config = yaml.safe_load(f)
def load_promotions(
self,
filepath: str,
date_column: str = 'date',
sku_column: str = 'sku_id',
) -> pd.DataFrame:
"""
Load promotion calendar.
Expected columns:
- date: Promotion date
- sku_id: Product identifier
- promo_type: Type of promotion (e.g., 'discount', 'bogo', 'bundle')
- discount_pct: Discount percentage (0-100)
- promo_name: Optional name of promotion
"""
df = pd.read_csv(filepath, parse_dates=[date_column])
# Validate required columns
required = [date_column, sku_column, 'promo_type']
missing = set(required) - set(df.columns)
if missing:
raise ValueError(f"Missing columns in promotions: {missing}")
return df
def load_weather(
self,
filepath: str,
date_column: str = 'date',
location_column: str = 'location',
) -> pd.DataFrame:
"""
Load weather data.
Expected columns:
- date: Date of weather observation
- location: Store/region location
- temp_high: Daily high temperature
- temp_low: Daily low temperature
- precipitation: Precipitation amount
- weather_condition: Sunny, rainy, snowy, etc.
"""
df = pd.read_csv(filepath, parse_dates=[date_column])
return df
def load_events(
self,
filepath: str,
date_column: str = 'date',
) -> pd.DataFrame:
"""
Load custom event calendar.
Events like: Back to school, Black Friday, local festivals, etc.
"""
df = pd.read_csv(filepath, parse_dates=[date_column])
return df
def align_to_sales(
self,
sales_df: pd.DataFrame,
external_df: pd.DataFrame,
on: list,
how: str = 'left',
) -> pd.DataFrame:
"""
Align external data to sales data.
Parameters
----------
sales_df : pd.DataFrame
Main sales data
external_df : pd.DataFrame
External data to align
on : list
Columns to merge on (e.g., ['date', 'sku_id'])
how : str
Merge type
Returns
-------
pd.DataFrame
Sales data with external features added
"""
return pd.merge(sales_df, external_df, on=on, how=how)2. External Feature Transformers (src/features/external_features.py)
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from typing import List, Optional, Dict
import holidays
class PromotionTransformer(BaseEstimator, TransformerMixin):
"""
Transform promotion data into ML features.
Creates features:
- is_on_promotion: Binary flag
- promo_type_encoded: One-hot or label encoded
- discount_pct: Normalized discount
- days_since_promo: Days since last promotion
- days_to_promo: Days until next promotion (if known)
"""
PROMO_TYPES = ['discount', 'bogo', 'bundle', 'flash', 'clearance', 'none']
def __init__(
self,
promo_column: str = 'promo_type',
discount_column: str = 'discount_pct',
encode_type: str = 'onehot',
):
self.promo_column = promo_column
self.discount_column = discount_column
self.encode_type = encode_type
def fit(self, X: pd.DataFrame, y=None):
"""Fit transformer (learn promotion types)."""
if self.promo_column in X.columns:
self.promo_types_ = X[self.promo_column].dropna().unique().tolist()
else:
self.promo_types_ = []
return self
def transform(self, X: pd.DataFrame) -> pd.DataFrame:
"""Create promotion features."""
X = X.copy()
# Binary promotion flag
if self.promo_column in X.columns:
X['is_on_promotion'] = (~X[self.promo_column].isna()).astype(int)
# One-hot encode promotion types
if self.encode_type == 'onehot':
for promo_type in self.PROMO_TYPES:
X[f'promo_{promo_type}'] = (X[self.promo_column] == promo_type).astype(int)
# Discount percentage (fill NaN with 0)
if self.discount_column in X.columns:
X['discount_pct_feature'] = X[self.discount_column].fillna(0) / 100.0
else:
X['discount_pct_feature'] = 0.0
return X
def get_feature_names_out(self, input_features=None) -> List[str]:
names = ['is_on_promotion', 'discount_pct_feature']
for promo_type in self.PROMO_TYPES:
names.append(f'promo_{promo_type}')
return names
class WeatherTransformer(BaseEstimator, TransformerMixin):
"""
Transform weather data into ML features.
Creates features:
- temp_normalized: Temperature relative to seasonal norm
- is_extreme_weather: Hot/cold/stormy conditions
- precipitation_level: Binned precipitation
- weather_impact_score: Combined weather impact metric
"""
def __init__(
self,
temp_high_col: str = 'temp_high',
temp_low_col: str = 'temp_low',
precip_col: str = 'precipitation',
condition_col: str = 'weather_condition',
):
self.temp_high_col = temp_high_col
self.temp_low_col = temp_low_col
self.precip_col = precip_col
self.condition_col = condition_col
def fit(self, X: pd.DataFrame, y=None):
"""Calculate seasonal temperature norms."""
if self.temp_high_col in X.columns:
# Calculate monthly averages for normalization
if 'month' in X.columns:
self.monthly_temp_avg_ = X.groupby('month')[self.temp_high_col].mean().to_dict()
else:
self.monthly_temp_avg_ = {i: X[self.temp_high_col].mean() for i in range(1, 13)}
else:
self.monthly_temp_avg_ = {}
return self
def transform(self, X: pd.DataFrame) -> pd.DataFrame:
"""Create weather features."""
X = X.copy()
# Temperature features
if self.temp_high_col in X.columns:
X['temp_avg'] = (X[self.temp_high_col] + X.get(self.temp_low_col, X[self.temp_high_col])) / 2
# Temperature deviation from norm
if 'month' in X.columns:
X['temp_deviation'] = X.apply(
lambda row: row['temp_avg'] - self.monthly_temp_avg_.get(row['month'], 0),
axis=1
)
else:
X['temp_deviation'] = X['temp_avg'] - X['temp_avg'].mean()
# Extreme temperature flags
X['is_hot'] = (X['temp_avg'] > 30).astype(int)
X['is_cold'] = (X['temp_avg'] < 5).astype(int)
# Precipitation features
if self.precip_col in X.columns:
X['has_precipitation'] = (X[self.precip_col] > 0).astype(int)
X['heavy_precipitation'] = (X[self.precip_col] > X[self.precip_col].quantile(0.9)).astype(int)
# Weather condition encoding
if self.condition_col in X.columns:
conditions = ['sunny', 'cloudy', 'rainy', 'snowy', 'stormy']
for condition in conditions:
X[f'weather_{condition}'] = X[self.condition_col].str.lower().str.contains(condition, na=False).astype(int)
return X
def get_feature_names_out(self, input_features=None) -> List[str]:
return [
'temp_avg', 'temp_deviation', 'is_hot', 'is_cold',
'has_precipitation', 'heavy_precipitation',
'weather_sunny', 'weather_cloudy', 'weather_rainy',
'weather_snowy', 'weather_stormy',
]
class EventTransformer(BaseEstimator, TransformerMixin):
"""
Transform event calendar into ML features.
Creates features:
- is_event: Binary event flag
- event_type_encoded: Type of event
- days_to_event: Days until next major event
- days_since_event: Days since last major event
"""
def __init__(
self,
event_column: str = 'event_name',
event_type_column: str = 'event_type',
):
self.event_column = event_column
self.event_type_column = event_type_column
def fit(self, X: pd.DataFrame, y=None):
if self.event_type_column in X.columns:
self.event_types_ = X[self.event_type_column].dropna().unique().tolist()
else:
self.event_types_ = []
return self
def transform(self, X: pd.DataFrame) -> pd.DataFrame:
X = X.copy()
# Binary event flag
if self.event_column in X.columns:
X['is_event'] = (~X[self.event_column].isna()).astype(int)
# One-hot encode event types
if self.event_type_column in X.columns:
for event_type in self.event_types_:
X[f'event_{event_type}'] = (X[self.event_type_column] == event_type).astype(int)
return X
def get_feature_names_out(self, input_features=None) -> List[str]:
names = ['is_event']
for et in self.event_types_:
names.append(f'event_{et}')
return names3. External Sources Config (configs/external_sources.yaml)
# External data source configuration
promotions:
enabled: true
filepath: data/external/promotions.csv
date_column: date
sku_column: sku_id
columns:
- promo_type
- discount_pct
- promo_name
weather:
enabled: false # Enable when weather data available
filepath: data/external/weather.csv
date_column: date
location_column: store_location
columns:
- temp_high
- temp_low
- precipitation
- weather_condition
events:
enabled: true
filepath: data/external/events.csv
date_column: date
columns:
- event_name
- event_type
- event_scope # national, regional, local
# Feature generation settings
feature_settings:
promotions:
encode_type: onehot
include_discount: true
weather:
normalize_temperature: true
bin_precipitation: true
events:
lookahead_days: 7
lookback_days: 74. Training Script (scripts/train_with_external.py)
#!/usr/bin/env python
"""Train model with external regressors."""
import argparse
import pandas as pd
import yaml
from src.data.loader import load_sales_data
from src.data.external_loader import ExternalDataLoader
from src.features.pipeline import build_feature_pipeline
from src.features.external_features import PromotionTransformer, WeatherTransformer, EventTransformer
from src.models.ensemble import ForecastingEnsemble
from src.evaluation.splits import temporal_train_test_split
from src.evaluation.metrics import calculate_metrics, print_metrics
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--data", required=True)
parser.add_argument("--config", default="configs/external_sources.yaml")
args = parser.parse_args()
# Load config
with open(args.config, 'r') as f:
config = yaml.safe_load(f)
# Load sales data
print("Loading sales data...")
df = load_sales_data(args.data)
# Load and merge external data
external_loader = ExternalDataLoader()
if config.get('promotions', {}).get('enabled'):
print("Loading promotions...")
promos = external_loader.load_promotions(config['promotions']['filepath'])
df = external_loader.align_to_sales(df, promos, on=['date', 'sku_id'])
if config.get('events', {}).get('enabled'):
print("Loading events...")
events = external_loader.load_events(config['events']['filepath'])
df = external_loader.align_to_sales(df, events, on=['date'])
# Build feature pipeline with external features
print("Building features...")
pipeline = build_feature_pipeline()
df_features = pipeline.fit_transform(df)
# Add external feature transformers
if 'promo_type' in df_features.columns:
promo_transformer = PromotionTransformer()
df_features = promo_transformer.fit_transform(df_features)
if 'event_name' in df_features.columns:
event_transformer = EventTransformer()
df_features = event_transformer.fit_transform(df_features)
# Prepare training data
feature_cols = [c for c in df_features.columns
if c not in ['date', 'sku_id', 'sales', 'promo_type', 'event_name']]
X = df_features[feature_cols].dropna()
y = df_features.loc[X.index, 'sales']
print(f"Features: {len(feature_cols)} (including external)")
# Train and evaluate
X_train, X_test, y_train, y_test = temporal_train_test_split(X, y)
model = ForecastingEnsemble()
model.fit(X_train, y_train, eval_set=(X_test, y_test))
y_pred = model.predict(X_test)
metrics = calculate_metrics(y_test.values, y_pred)
print_metrics(metrics)
# Show external feature importance
importance = model.get_feature_importance()
external_features = importance[importance['feature'].str.contains('promo|weather|event')]
if len(external_features) > 0:
print("\nExternal Feature Importance:")
print(external_features.head(10).to_string(index=False))
if __name__ == "__main__":
main()Implementation Checklist
- Create
ExternalDataLoaderfor loading external sources - Implement
PromotionTransformerwith encoding - Implement
WeatherTransformerwith normalization - Implement
EventTransformerfor custom events - Create external sources configuration file
- Update feature pipeline to include external transformers
- Create training script with external data
- Test with sample external data
Files to Modify
| File | Lines | Action | Description |
|---|---|---|---|
src/data/external_loader.py |
0-100 | Create | External data loading |
src/features/external_features.py |
0-200 | Create | External feature transformers |
configs/external_sources.yaml |
0-50 | Create | Source configuration |
scripts/train_with_external.py |
0-80 | Create | Training script |
Technical Challenges
Aligning External Data: External data may have different granularity than sales data.
# Promotions: Per SKU per day - direct merge
df = pd.merge(sales, promos, on=['date', 'sku_id'], how='left')
# Weather: Per location per day - need location mapping
df = pd.merge(sales, weather, on=['date', 'store_location'], how='left')
# Events: Date only - broadcast to all SKUs
df = pd.merge(sales, events, on=['date'], how='left')Forward-Looking Features: External features can be "future known" if scheduled.
# These are ALLOWED because they're known at prediction time:
# - Planned promotions
# - Holidays
# - Scheduled events
# These require forecasts themselves:
# - Future weather (use weather forecasts)
# - Unscheduled events (cannot predict)Definition of Done
- External data loads and aligns correctly
- Promotion features created (is_on_promotion, discount)
- Holiday features work with calendar transformer
- External features show in feature importance
- Tests pass:
pytest tests/test_external.py -v
Risk Assessment
| Risk | Impact | Mitigation |
|---|---|---|
| Missing external data for dates | 🟡 Medium | Fill with sensible defaults |
| Data alignment issues | 🟡 Medium | Validate merge results |
| External data leakage | 🔴 High | Only use "known at forecast time" data |
Time Estimate
Estimated Hours: 8-10 hours
Priority: Medium (improves accuracy)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels