Skip to content

[Phase 9] External Regressors Integration #10

@Sakeeb91

Description

@Sakeeb91

Summary

Integrate external regressors (promotions, holidays, weather, events) into the feature pipeline. These external factors significantly impact demand and improve forecast accuracy when properly incorporated.

System Context

demand-forecasting-engine/
├── src/
│   ├── features/
│   │   ├── pipeline.py              # EXISTS - MODIFY
│   │   ├── calendar_features.py     # EXISTS
│   │   └── external_features.py     # <- CREATE
│   └── data/
│       ├── loader.py                # EXISTS
│       └── external_loader.py       # <- CREATE
├── configs/
│   └── external_sources.yaml        # <- CREATE
└── scripts/
    └── train_with_external.py       # <- CREATE

Dependencies

  • Requires: Phase 1 (Feature Pipeline)
  • Blocks: Phase 10 (Production Pipeline)

Proposed Solution

1. External Data Loader (src/data/external_loader.py)

import pandas as pd
import numpy as np
from pathlib import Path
from typing import Optional, Dict, Any
from datetime import datetime, timedelta
import yaml


class ExternalDataLoader:
    """
    Load and align external data sources with main time series.

    Handles:
    - Promotions calendar
    - Holiday schedules
    - Weather data
    - Custom events
    """

    def __init__(self, config_path: Optional[str] = None):
        self.config = {}
        if config_path:
            with open(config_path, 'r') as f:
                self.config = yaml.safe_load(f)

    def load_promotions(
        self,
        filepath: str,
        date_column: str = 'date',
        sku_column: str = 'sku_id',
    ) -> pd.DataFrame:
        """
        Load promotion calendar.

        Expected columns:
        - date: Promotion date
        - sku_id: Product identifier
        - promo_type: Type of promotion (e.g., 'discount', 'bogo', 'bundle')
        - discount_pct: Discount percentage (0-100)
        - promo_name: Optional name of promotion
        """
        df = pd.read_csv(filepath, parse_dates=[date_column])

        # Validate required columns
        required = [date_column, sku_column, 'promo_type']
        missing = set(required) - set(df.columns)
        if missing:
            raise ValueError(f"Missing columns in promotions: {missing}")

        return df

    def load_weather(
        self,
        filepath: str,
        date_column: str = 'date',
        location_column: str = 'location',
    ) -> pd.DataFrame:
        """
        Load weather data.

        Expected columns:
        - date: Date of weather observation
        - location: Store/region location
        - temp_high: Daily high temperature
        - temp_low: Daily low temperature
        - precipitation: Precipitation amount
        - weather_condition: Sunny, rainy, snowy, etc.
        """
        df = pd.read_csv(filepath, parse_dates=[date_column])
        return df

    def load_events(
        self,
        filepath: str,
        date_column: str = 'date',
    ) -> pd.DataFrame:
        """
        Load custom event calendar.

        Events like: Back to school, Black Friday, local festivals, etc.
        """
        df = pd.read_csv(filepath, parse_dates=[date_column])
        return df

    def align_to_sales(
        self,
        sales_df: pd.DataFrame,
        external_df: pd.DataFrame,
        on: list,
        how: str = 'left',
    ) -> pd.DataFrame:
        """
        Align external data to sales data.

        Parameters
        ----------
        sales_df : pd.DataFrame
            Main sales data
        external_df : pd.DataFrame
            External data to align
        on : list
            Columns to merge on (e.g., ['date', 'sku_id'])
        how : str
            Merge type

        Returns
        -------
        pd.DataFrame
            Sales data with external features added
        """
        return pd.merge(sales_df, external_df, on=on, how=how)

2. External Feature Transformers (src/features/external_features.py)

import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from typing import List, Optional, Dict
import holidays


class PromotionTransformer(BaseEstimator, TransformerMixin):
    """
    Transform promotion data into ML features.

    Creates features:
    - is_on_promotion: Binary flag
    - promo_type_encoded: One-hot or label encoded
    - discount_pct: Normalized discount
    - days_since_promo: Days since last promotion
    - days_to_promo: Days until next promotion (if known)
    """

    PROMO_TYPES = ['discount', 'bogo', 'bundle', 'flash', 'clearance', 'none']

    def __init__(
        self,
        promo_column: str = 'promo_type',
        discount_column: str = 'discount_pct',
        encode_type: str = 'onehot',
    ):
        self.promo_column = promo_column
        self.discount_column = discount_column
        self.encode_type = encode_type

    def fit(self, X: pd.DataFrame, y=None):
        """Fit transformer (learn promotion types)."""
        if self.promo_column in X.columns:
            self.promo_types_ = X[self.promo_column].dropna().unique().tolist()
        else:
            self.promo_types_ = []
        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """Create promotion features."""
        X = X.copy()

        # Binary promotion flag
        if self.promo_column in X.columns:
            X['is_on_promotion'] = (~X[self.promo_column].isna()).astype(int)

            # One-hot encode promotion types
            if self.encode_type == 'onehot':
                for promo_type in self.PROMO_TYPES:
                    X[f'promo_{promo_type}'] = (X[self.promo_column] == promo_type).astype(int)

        # Discount percentage (fill NaN with 0)
        if self.discount_column in X.columns:
            X['discount_pct_feature'] = X[self.discount_column].fillna(0) / 100.0
        else:
            X['discount_pct_feature'] = 0.0

        return X

    def get_feature_names_out(self, input_features=None) -> List[str]:
        names = ['is_on_promotion', 'discount_pct_feature']
        for promo_type in self.PROMO_TYPES:
            names.append(f'promo_{promo_type}')
        return names


class WeatherTransformer(BaseEstimator, TransformerMixin):
    """
    Transform weather data into ML features.

    Creates features:
    - temp_normalized: Temperature relative to seasonal norm
    - is_extreme_weather: Hot/cold/stormy conditions
    - precipitation_level: Binned precipitation
    - weather_impact_score: Combined weather impact metric
    """

    def __init__(
        self,
        temp_high_col: str = 'temp_high',
        temp_low_col: str = 'temp_low',
        precip_col: str = 'precipitation',
        condition_col: str = 'weather_condition',
    ):
        self.temp_high_col = temp_high_col
        self.temp_low_col = temp_low_col
        self.precip_col = precip_col
        self.condition_col = condition_col

    def fit(self, X: pd.DataFrame, y=None):
        """Calculate seasonal temperature norms."""
        if self.temp_high_col in X.columns:
            # Calculate monthly averages for normalization
            if 'month' in X.columns:
                self.monthly_temp_avg_ = X.groupby('month')[self.temp_high_col].mean().to_dict()
            else:
                self.monthly_temp_avg_ = {i: X[self.temp_high_col].mean() for i in range(1, 13)}
        else:
            self.monthly_temp_avg_ = {}
        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """Create weather features."""
        X = X.copy()

        # Temperature features
        if self.temp_high_col in X.columns:
            X['temp_avg'] = (X[self.temp_high_col] + X.get(self.temp_low_col, X[self.temp_high_col])) / 2

            # Temperature deviation from norm
            if 'month' in X.columns:
                X['temp_deviation'] = X.apply(
                    lambda row: row['temp_avg'] - self.monthly_temp_avg_.get(row['month'], 0),
                    axis=1
                )
            else:
                X['temp_deviation'] = X['temp_avg'] - X['temp_avg'].mean()

            # Extreme temperature flags
            X['is_hot'] = (X['temp_avg'] > 30).astype(int)
            X['is_cold'] = (X['temp_avg'] < 5).astype(int)

        # Precipitation features
        if self.precip_col in X.columns:
            X['has_precipitation'] = (X[self.precip_col] > 0).astype(int)
            X['heavy_precipitation'] = (X[self.precip_col] > X[self.precip_col].quantile(0.9)).astype(int)

        # Weather condition encoding
        if self.condition_col in X.columns:
            conditions = ['sunny', 'cloudy', 'rainy', 'snowy', 'stormy']
            for condition in conditions:
                X[f'weather_{condition}'] = X[self.condition_col].str.lower().str.contains(condition, na=False).astype(int)

        return X

    def get_feature_names_out(self, input_features=None) -> List[str]:
        return [
            'temp_avg', 'temp_deviation', 'is_hot', 'is_cold',
            'has_precipitation', 'heavy_precipitation',
            'weather_sunny', 'weather_cloudy', 'weather_rainy',
            'weather_snowy', 'weather_stormy',
        ]


class EventTransformer(BaseEstimator, TransformerMixin):
    """
    Transform event calendar into ML features.

    Creates features:
    - is_event: Binary event flag
    - event_type_encoded: Type of event
    - days_to_event: Days until next major event
    - days_since_event: Days since last major event
    """

    def __init__(
        self,
        event_column: str = 'event_name',
        event_type_column: str = 'event_type',
    ):
        self.event_column = event_column
        self.event_type_column = event_type_column

    def fit(self, X: pd.DataFrame, y=None):
        if self.event_type_column in X.columns:
            self.event_types_ = X[self.event_type_column].dropna().unique().tolist()
        else:
            self.event_types_ = []
        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        X = X.copy()

        # Binary event flag
        if self.event_column in X.columns:
            X['is_event'] = (~X[self.event_column].isna()).astype(int)

            # One-hot encode event types
            if self.event_type_column in X.columns:
                for event_type in self.event_types_:
                    X[f'event_{event_type}'] = (X[self.event_type_column] == event_type).astype(int)

        return X

    def get_feature_names_out(self, input_features=None) -> List[str]:
        names = ['is_event']
        for et in self.event_types_:
            names.append(f'event_{et}')
        return names

3. External Sources Config (configs/external_sources.yaml)

# External data source configuration

promotions:
  enabled: true
  filepath: data/external/promotions.csv
  date_column: date
  sku_column: sku_id
  columns:
    - promo_type
    - discount_pct
    - promo_name

weather:
  enabled: false  # Enable when weather data available
  filepath: data/external/weather.csv
  date_column: date
  location_column: store_location
  columns:
    - temp_high
    - temp_low
    - precipitation
    - weather_condition

events:
  enabled: true
  filepath: data/external/events.csv
  date_column: date
  columns:
    - event_name
    - event_type
    - event_scope  # national, regional, local

# Feature generation settings
feature_settings:
  promotions:
    encode_type: onehot
    include_discount: true
  weather:
    normalize_temperature: true
    bin_precipitation: true
  events:
    lookahead_days: 7
    lookback_days: 7

4. Training Script (scripts/train_with_external.py)

#!/usr/bin/env python
"""Train model with external regressors."""
import argparse
import pandas as pd
import yaml

from src.data.loader import load_sales_data
from src.data.external_loader import ExternalDataLoader
from src.features.pipeline import build_feature_pipeline
from src.features.external_features import PromotionTransformer, WeatherTransformer, EventTransformer
from src.models.ensemble import ForecastingEnsemble
from src.evaluation.splits import temporal_train_test_split
from src.evaluation.metrics import calculate_metrics, print_metrics


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", required=True)
    parser.add_argument("--config", default="configs/external_sources.yaml")
    args = parser.parse_args()

    # Load config
    with open(args.config, 'r') as f:
        config = yaml.safe_load(f)

    # Load sales data
    print("Loading sales data...")
    df = load_sales_data(args.data)

    # Load and merge external data
    external_loader = ExternalDataLoader()

    if config.get('promotions', {}).get('enabled'):
        print("Loading promotions...")
        promos = external_loader.load_promotions(config['promotions']['filepath'])
        df = external_loader.align_to_sales(df, promos, on=['date', 'sku_id'])

    if config.get('events', {}).get('enabled'):
        print("Loading events...")
        events = external_loader.load_events(config['events']['filepath'])
        df = external_loader.align_to_sales(df, events, on=['date'])

    # Build feature pipeline with external features
    print("Building features...")
    pipeline = build_feature_pipeline()
    df_features = pipeline.fit_transform(df)

    # Add external feature transformers
    if 'promo_type' in df_features.columns:
        promo_transformer = PromotionTransformer()
        df_features = promo_transformer.fit_transform(df_features)

    if 'event_name' in df_features.columns:
        event_transformer = EventTransformer()
        df_features = event_transformer.fit_transform(df_features)

    # Prepare training data
    feature_cols = [c for c in df_features.columns
                    if c not in ['date', 'sku_id', 'sales', 'promo_type', 'event_name']]
    X = df_features[feature_cols].dropna()
    y = df_features.loc[X.index, 'sales']

    print(f"Features: {len(feature_cols)} (including external)")

    # Train and evaluate
    X_train, X_test, y_train, y_test = temporal_train_test_split(X, y)

    model = ForecastingEnsemble()
    model.fit(X_train, y_train, eval_set=(X_test, y_test))

    y_pred = model.predict(X_test)
    metrics = calculate_metrics(y_test.values, y_pred)
    print_metrics(metrics)

    # Show external feature importance
    importance = model.get_feature_importance()
    external_features = importance[importance['feature'].str.contains('promo|weather|event')]
    if len(external_features) > 0:
        print("\nExternal Feature Importance:")
        print(external_features.head(10).to_string(index=False))


if __name__ == "__main__":
    main()

Implementation Checklist

  • Create ExternalDataLoader for loading external sources
  • Implement PromotionTransformer with encoding
  • Implement WeatherTransformer with normalization
  • Implement EventTransformer for custom events
  • Create external sources configuration file
  • Update feature pipeline to include external transformers
  • Create training script with external data
  • Test with sample external data

Files to Modify

File Lines Action Description
src/data/external_loader.py 0-100 Create External data loading
src/features/external_features.py 0-200 Create External feature transformers
configs/external_sources.yaml 0-50 Create Source configuration
scripts/train_with_external.py 0-80 Create Training script

Technical Challenges

Aligning External Data: External data may have different granularity than sales data.

# Promotions: Per SKU per day - direct merge
df = pd.merge(sales, promos, on=['date', 'sku_id'], how='left')

# Weather: Per location per day - need location mapping
df = pd.merge(sales, weather, on=['date', 'store_location'], how='left')

# Events: Date only - broadcast to all SKUs
df = pd.merge(sales, events, on=['date'], how='left')

Forward-Looking Features: External features can be "future known" if scheduled.

# These are ALLOWED because they're known at prediction time:
# - Planned promotions
# - Holidays
# - Scheduled events

# These require forecasts themselves:
# - Future weather (use weather forecasts)
# - Unscheduled events (cannot predict)

Definition of Done

  • External data loads and aligns correctly
  • Promotion features created (is_on_promotion, discount)
  • Holiday features work with calendar transformer
  • External features show in feature importance
  • Tests pass: pytest tests/test_external.py -v

Risk Assessment

Risk Impact Mitigation
Missing external data for dates 🟡 Medium Fill with sensible defaults
Data alignment issues 🟡 Medium Validate merge results
External data leakage 🔴 High Only use "known at forecast time" data

Time Estimate

Estimated Hours: 8-10 hours
Priority: Medium (improves accuracy)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions