[ENH] forecasting benchmarking task experiment #176

fkiraly · 2025-08-24T14:29:29Z

This PR adds a SktimeForecastingTask, which defines a full benchmarking run for a forecaster that is passed later in _evaluate.

This object could be used as a "task" in the sktime ForecastingBenchmark.

Draft for discussion and reviewing the design:

it is quite similar to and partially duplicative with SktimeForecastingExperiment which is used in tuning. How should we deal with the similarity and intersection?
- we could merge into a single class, depending on whether forecaster gets passed or not. Not sure where that leads though
is this a possible 1:1 dropin (or almost) for the task object in sktime?

arnavk23

I made some corrections to your file here -

# copyright: hyperactive developers, MIT License (see LICENSE file)

import numpy as np

from hyperactive.base import BaseExperiment


class SktimeForecastingTask(BaseExperiment):
    """Experiment adapter for forecast backtesting benchmark run.

    This class is used to perform backtesting experiments using a given
    sktime forecaster. It allows for hyperparameter tuning and evaluation of
    the model's performance.

    The score returned is the summary backtesting score,
    of applying ``sktime`` ``evaluate`` to an estimator passed as ``forecaster``
    in the ``score`` ``params``.

    The backtesting performed is specified by the ``cv`` parameter,
    and the scoring metric is specified by the ``scoring`` parameter.
    The ``X`` and ``y`` parameters are the input data and target values,
    which are used in fit/predict cross-validation.

    Differs from ``SktimeForecastingExperiment`` in that ``forecaster``
    is passed as a parameter directly to ``score`` and not to ``__init__``.
    """

    _tags = {
        "authors": "fkiraly",
        "maintainers": "fkiraly",
        "python_dependencies": "sktime",  # python dependencies
    }

    def __init__(
        self,
        cv,
        y,
        X=None,
        strategy="refit",
        scoring=None,
        error_score=np.nan,
        cv_X=None,
        backend=None,
        backend_params=None,
    ):
        self.X = X
        self.y = y
        self.strategy = strategy
        self.scoring = scoring
        self.cv = cv
        self.error_score = error_score
        self.cv_X = cv_X
        self.backend = backend
        self.backend_params = backend_params

        super().__init__()

        if scoring is None:
            from sktime.performance_metrics.forecasting import (
                MeanAbsolutePercentageError,
            )

            self._scoring = MeanAbsolutePercentageError(symmetric=True)
        else:
            self._scoring = scoring

        # Set a boolean tag indicating whether higher is better.
        # If the metric indicates lower_is_better, set False; otherwise True.
        try:
            lower_is_better = (
                True
                if scoring is None
                else bool(self._scoring.get_tag("lower_is_better", False))
            )
        except Exception:
            # If metric doesn't expose get_tag, default to False (lower is better)
            lower_is_better = True if scoring is None else False

        higher_is_better = not lower_is_better
        # Use a conventional boolean tag for the rest of the codebase
        try:
            self.set_tags(**{"higher_is_better": higher_is_better})
        except Exception:
            # If set_tags is not available or fails, ignore tagging but continue.
            pass

    def _paramnames(self):
        """Return the parameter names of the search."""
        return ["forecaster"]

    def _evaluate(self, params):
        """Evaluate the parameters.

        Parameters
        ----------
        params : dict with string keys
            Parameters to evaluate.

        Returns
        -------
        float
            The value of the parameters as per evaluation.
        dict
            Additional metadata about the search.
        """
        from sktime.forecasting.model_evaluation import evaluate

        forecaster = params.get("forecaster", None)
        if forecaster is None:
            raise ValueError("SktimeForecastingTask._evaluate requires params to include a 'forecaster' entry")

        try:
            results = evaluate(
                forecaster,
                cv=self.cv,
                y=self.y,
                X=self.X,
                strategy=self.strategy,
                scoring=self._scoring,
                error_score=self.error_score,
                cv_X=self.cv_X,
                backend=self.backend,
                backend_params=self.backend_params,
            )
        except Exception as e:
            # If user explicitly wants exceptions to propagate:
            if self.error_score == "raise":
                raise
            # Otherwise return error_score and capture the exception message
            return self.error_score, {"error": str(e)}

        # Determine scoring column name robustly
        scoring_name = getattr(self._scoring, "name", None) or self._scoring.__class__.__name__
        result_name = f"test_{scoring_name}"

        add_info = {"results": results}

        # Results handling robust to DataFrame-like or dict-like outputs
        try:
            # If results is a pandas DataFrame-like object:
            if hasattr(results, "columns"):
                if result_name in results.columns:
                    res_values = results[result_name]
                else:
                    # find a test_* column as fallback
                    test_cols = [c for c in results.columns if str(c).startswith("test_")]
                    if test_cols:
                        res_values = results[test_cols[0]]
                        add_info["warning"] = (
                            f"expected column '{result_name}' not found; using '{test_cols[0]}' instead"
                        )
                    else:
                        raise ValueError(f"No 'test_*' column found in evaluate results; expected '{result_name}'")
            else:
                # dict-like fallback
                if result_name in results:
                    res_values = results[result_name]
                else:
                    test_keys = [k for k in results.keys() if str(k).startswith("test_")]
                    if test_keys:
                        res_values = results[test_keys[0]]
                        add_info["warning"] = (
                            f"expected key '{result_name}' not found; using '{test_keys[0]}' instead"
                        )
                    else:
                        raise ValueError(f"No 'test_*' key found in evaluate results; expected '{result_name}'")
        except Exception as e:
            # Preserve original exception info
            if self.error_score == "raise":
                raise
            return self.error_score, {"error": str(e), **add_info}

        # Compute scalar summary result
        try:
            res_float = float(np.nanmean(res_values))
        except Exception:
            # Last-resort attempt: convert to numpy array and take mean
            try:
                res_float = float(np.nanmean(np.asarray(res_values)))
            except Exception as e:
                if self.error_score == "raise":
                    raise
                return self.error_score, {"error": f"Could not compute mean of results: {e}", **add_info}

        return res_float, add_info

    @classmethod
    def get_test_params(cls, parameter_set="default"):
        """Return testing parameter settings for the skbase object."""
        from sktime.datasets import load_airline, load_longley
        from sktime.split import ExpandingWindowSplitter

        y = load_airline()
        params0 = {
            "cv": ExpandingWindowSplitter(initial_window=36, step_length=12, fh=12),
            "y": y,
        }

        from sktime.performance_metrics.forecasting import MeanAbsolutePercentageError

        y, X = load_longley()
        params1 = {
            "cv": ExpandingWindowSplitter(initial_window=3, step_length=3, fh=1),
            "y": y,
            "X": X,
            "scoring": MeanAbsolutePercentageError(symmetric=False),
        }

        return [params0, params1]

    @classmethod
    def _get_score_params(cls):
        """Return settings for testing score/evaluate functions. Used in tests only."""
        from sktime.forecasting.naive import NaiveForecaster

        val0 = {"forecaster": NaiveForecaster(strategy="last")}
        val1 = {"forecaster": NaiveForecaster(strategy="last")}
        return [val0, val1]

fkiraly · 2025-11-22T11:47:58Z

@arnavk23, can you kindly explain what you corrected and why?

arnavk23 · 2025-11-22T12:36:38Z

@arnavk23, can you kindly explain what you corrected and why?

Added validation for forecaster in params
The original version assumed params["forecaster"] always existed.
I added an explicit check and a clear error message because missing/incorrect parameters otherwise raise cryptic errors deeper inside sktime.evaluate.
Made scoring metric handling more robust
The previous code assumed that any scoring object implements get_tag("lower_is_better").
I wrapped this in a try/except and added correct defaults for both cases (scoring=None or custom metrics).
Safely applied higher_is_better tag
set_tags() was called without handling the case where it fails or is not supported.
Improved parsing of the output from sktime.evaluate()
The previous implementation assumed: the result is always a DataFrame
the scoring column name is always exactly "test_<scoring.name>"
I added: support for both DataFrame-like and dict-like outputs fallback to the first available test_* column if the expected name isn’t present, warnings when fallback happens.
Better error handling during evaluate
Previously, any exception inside evaluate() could crash or create inconsistent behavior.
Now: error_score="raise" preserves the expected behavior
otherwise returns (error_score, {"error": })
Robust conversion of results to a scalar
The earlier implementation assumed you can always do float(results.mean()).
I added: use of np.nanmean
fallback to np.asarray if needed
structured error reporting if even that fails

Create sktime_benchmark_forecasting.py

9bebc6c

fkiraly added the enhancement New feature or request label Aug 24, 2025

fkiraly added 4 commits August 24, 2025 16:39

Update sktime_benchmark_forecasting.py

974693c

Update sktime_benchmark_forecasting.py

aaa3057

Update __init__.py

76d618c

Update sktime_benchmark_forecasting.py

973be51

arnavk23 reviewed Nov 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ENH] forecasting benchmarking task experiment #176

[ENH] forecasting benchmarking task experiment #176

Uh oh!

fkiraly commented Aug 24, 2025

Uh oh!

arnavk23 left a comment •

edited by SimonBlanke

Loading

Uh oh!

fkiraly commented Nov 22, 2025

Uh oh!

arnavk23 commented Nov 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[ENH] forecasting benchmarking task experiment #176

Are you sure you want to change the base?

[ENH] forecasting benchmarking task experiment #176

Uh oh!

Conversation

fkiraly commented Aug 24, 2025

Uh oh!

arnavk23 left a comment • edited by SimonBlanke Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fkiraly commented Nov 22, 2025

Uh oh!

arnavk23 commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arnavk23 left a comment •

edited by SimonBlanke

Loading

arnavk23 commented Nov 22, 2025 •

edited

Loading