Skip to content
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
4d653a9
add group by variables to base forecast transformer
Ezzaldin97 Feb 23, 2024
4e9d849
add group by variables to lag_features
Ezzaldin97 Feb 23, 2024
7f40391
add group by window features
Ezzaldin97 Feb 25, 2024
b476748
add group by expanding window features
Ezzaldin97 Feb 25, 2024
02c59bd
add test cases of groupby timeseries features
Ezzaldin97 Feb 25, 2024
0dd92cc
ensure code style tests
Ezzaldin97 Feb 25, 2024
47de2d6
fixing typehint errors
Ezzaldin97 Feb 25, 2024
dd43c27
fixing docs indentation issue
Ezzaldin97 Feb 25, 2024
7459811
fixing docs indentation issue in lag_features
Ezzaldin97 Feb 25, 2024
12aa825
adjust formatting and code style in tests
Ezzaldin97 Feb 29, 2024
c3bee66
refactoring timeseries & reformatting the code
Ezzaldin97 Feb 29, 2024
67725dc
adjust code formatting & style in tests
Ezzaldin97 Mar 2, 2024
9cb01ea
fix create lag features using groupby & freq parameters
Ezzaldin97 Mar 2, 2024
72ce43c
adjust code style
Ezzaldin97 Mar 2, 2024
9d999b0
add test cases to ensure code coverage
Ezzaldin97 Mar 2, 2024
b7b8bc9
add group_by docstring to _docstring
Ezzaldin97 Apr 1, 2024
ba375a4
remove check input of group_by
Ezzaldin97 Apr 1, 2024
90f08f4
enhance performance of group_by window features operations
Ezzaldin97 Apr 1, 2024
66baa75
enhance performance of group_by expanding window features operations
Ezzaldin97 Apr 1, 2024
92f996d
fix reindexing to original index after grouping bug
Ezzaldin97 Apr 1, 2024
152c037
fix reindexing to original index after grouping operation bug
Ezzaldin97 Apr 1, 2024
5343e50
replacing group_by docstring with group_by_docstring
Ezzaldin97 Apr 1, 2024
ef1eaa8
adjust code-style and formatting
Ezzaldin97 Apr 1, 2024
09db782
remove white spaces
Ezzaldin97 Apr 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions feature_engine/selection/drop_psi_features.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import datetime
from typing import List, Union
from typing import List, Union, Dict

import numpy as np
import pandas as pd
Expand Down Expand Up @@ -475,7 +475,7 @@ def fit(self, X: pd.DataFrame, y: pd.Series = None):
threshold_cat = self.threshold

# Compute the PSI by looping over the features
self.psi_values_ = {}
self.psi_values_: Dict = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We resolved this in a different PR. Could we remove this change from here please?

self.features_to_drop_ = []

# Compute PSI for numerical features
Expand Down
62 changes: 43 additions & 19 deletions feature_engine/timeseries/forecasting/base_forecast_transformers.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,30 +5,21 @@
from sklearn.utils.validation import check_is_fitted

from feature_engine._base_transformers.mixins import GetFeatureNamesOutMixin
from feature_engine._check_init_parameters.check_variables import (
_check_variables_input_value,
)
from feature_engine._check_init_parameters.check_variables import \
_check_variables_input_value
from feature_engine._docstrings.fit_attributes import (
_feature_names_in_docstring,
_n_features_in_docstring,
)
_feature_names_in_docstring, _n_features_in_docstring)
from feature_engine._docstrings.init_parameters.all_trasnformers import (
_drop_original_docstring,
_missing_values_docstring,
)
_drop_original_docstring, _missing_values_docstring)
from feature_engine._docstrings.methods import _fit_not_learn_docstring
from feature_engine._docstrings.substitute import Substitution
from feature_engine.dataframe_checks import (
_check_contains_inf,
_check_contains_na,
_check_X_matches_training_df,
check_X,
)
from feature_engine.dataframe_checks import (_check_contains_inf,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that black, isort and flack8 will automatically adjust the formatting and code style, because no problems appeared during development, and CI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is the case...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On circleCI we have automatic checks for formatting, but it does not fix it automatically. You need to isort and black your files before pushing them for the tests to pass.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to resolve it, and push again.

_check_contains_na,
_check_X_matches_training_df,
check_X)
from feature_engine.tags import _return_tags
from feature_engine.variable_handling import (
check_numerical_variables,
find_numerical_variables,
)
from feature_engine.variable_handling import (check_numerical_variables,
find_numerical_variables)


@Substitution(
Expand All @@ -51,6 +42,9 @@ class BaseForecastTransformer(BaseEstimator, TransformerMixin, GetFeatureNamesOu

{drop_original}

group_by_variables: str, list of str, default=None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call this method group_by to keep it similar to pandas method.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we do allow variable names to be integers, so in theory it could also take int and list of ints.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sure I am going to rename this parameter, and also allow to pass it as integer or list of integers, and use check_variables methods to check if it exists in dataframe, thank you.

variable of list of variables to create lag features based on.

Attributes
----------
{feature_names_in_}
Expand All @@ -64,6 +58,7 @@ def __init__(
variables: Union[None, int, str, List[Union[str, int]]] = None,
missing_values: str = "raise",
drop_original: bool = False,
group_by_variables: Optional[Union[str, List[str]]] = None,
) -> None:

if missing_values not in ["raise", "ignore"]:
Expand All @@ -78,9 +73,26 @@ def __init__(
f"Got {drop_original} instead."
)

# check validity if group by variables passed
if group_by_variables:
# check group by variables data-types
if not (
isinstance(group_by_variables, str)
or isinstance(group_by_variables, list)
):
raise ValueError(
"group_by_variables must be an string or a list of strings. "
f"Got {group_by_variables} instead."
)
# check if passed list has duplicates.
if isinstance(group_by_variables, list):
if len(set(group_by_variables)) != len(group_by_variables):
raise ValueError("group_by_variables contains duplicate values")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like tests for these 2 error catches are missing? Haven't gotten to the end of the PR so forgive me if I am wrong.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to add some test cases for them.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve the code I will do the following: after the checks on group_by_variables, I will put this variable into a list if the variable is just a string. In this way from now on you know that you are working with a list of string and you don't need to do other checks (like for example the one at line 180).

You could do something like

if group_by_variables:
             if isinstance(group_by_variables, str):
                      self.group_by_variables = [group_by_variables]
             elif not (
                      isinstance(group_by_variables, list) and
                      all(isinstance(element, str) for element in group_by_variables)
             ):
                       raise ValueError(
                              "group_by_variables must be an string or a list of strings. "
                              f"Got {group_by_variables} instead."
                         )
            else:
                     # note that if you are here, then group_by_variables is a list
                     if len(set(group_by_variables)) != len(group_by_variables):
                     raise ValueError("group_by_variables contains duplicate 

Copy link
Collaborator

@solegalli solegalli Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, for consistency with the sklearn api we cannot modify the input parameters. We need to leave them as they are.

self.variables = _check_variables_input_value(variables)
self.missing_values = missing_values
self.drop_original = drop_original
self.group_by_variables = group_by_variables

def _check_index(self, X: pd.DataFrame):
"""
Expand Down Expand Up @@ -165,6 +177,18 @@ def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None):
if self.missing_values == "raise":
self._check_na_and_inf(X)

if self.group_by_variables:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add this functionality to feature-engine? Pandas will fail if the variables are not in the dataframe.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@solegalli Yes, That's True Pandas will fail because of that, I am going to remove it, and push again.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that comes before this, in line 168, we are explicity forbidding duplicate values in the index, but if we allow groupby, we need to think what we do with this check

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@solegalli I think no duplicates will be in resultant dataframe, because we are creating the features for every group like the example below:

>>> import pandas as pd
    >>> from feature_engine.timeseries.forecasting import ExpandingWindowFeatures
    >>> X = pd.DataFrame(dict(date = ["2022-09-18",
    >>>                          "2022-09-19",
    >>>                          "2022-09-20",
    >>>                          "2022-09-21",
    >>>                          "2022-09-22",
    >>>                          "2022-09-18",
    >>>                          "2022-09-19",
    >>>                          "2022-09-20",
    >>>                          "2022-09-21",
    >>>                          "2022-09-22"],
    >>>                  x1 = [1,2,3,4,5, 3,5,6,8,11],
    >>>                  x2 = [6,7,8,9,10, 2,9,10,15,2],
    >>>                  x3=['a','a','a','a','a', 'b','b','b','b','b']
    >>>                ))
    >>> ewf = ExpandingWindowFeatures(group_by_variables='x3')
    >>> ewf.fit_transform(X)

the result is:

date  x1  x2 x3  x1_expanding_mean  x2_expanding_mean
    0  2022-09-18   1   6  a                NaN                NaN
    1  2022-09-19   2   7  a           1.000000                6.0
    2  2022-09-20   3   8  a           1.500000                6.5
    3  2022-09-21   4   9  a           2.000000                7.0
    4  2022-09-22   5  10  a           2.500000                7.5
    5  2022-09-18   3   2  b                NaN                NaN
    6  2022-09-19   5   9  b           3.000000                2.0
    7  2022-09-20   6  10  b           4.000000                5.5
    8  2022-09-21   8  15  b           4.666667                7.0
    9  2022-09-22  11   2  b           5.500000                9.0

# check if input group by variables is in input dataframe variables.
if isinstance(self.group_by_variables, list):
diff = set(self.group_by_variables).difference(X.columns.tolist())
if len(diff) != 0:
raise ValueError(f"{list(diff)} not exist in dataframe")
else:
if self.group_by_variables not in X.columns.tolist():
raise ValueError(
f"{self.group_by_variables} not exists in dataframe"
)

self._get_feature_names_in(X)

return self
Expand Down
100 changes: 78 additions & 22 deletions feature_engine/timeseries/forecasting/expanding_window_features.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,27 +3,20 @@

from __future__ import annotations

from typing import List
from typing import List, Optional, Union

import pandas as pd

from feature_engine._docstrings.fit_attributes import (
_feature_names_in_docstring,
_n_features_in_docstring,
)
_feature_names_in_docstring, _n_features_in_docstring)
from feature_engine._docstrings.init_parameters.all_trasnformers import (
_drop_original_docstring,
_missing_values_docstring,
_variables_numerical_docstring,
)
from feature_engine._docstrings.methods import (
_fit_not_learn_docstring,
_fit_transform_docstring,
)
_drop_original_docstring, _missing_values_docstring,
_variables_numerical_docstring)
from feature_engine._docstrings.methods import (_fit_not_learn_docstring,
_fit_transform_docstring)
from feature_engine._docstrings.substitute import Substitution
from feature_engine.timeseries.forecasting.base_forecast_transformers import (
BaseForecastTransformer,
)
from feature_engine.timeseries.forecasting.base_forecast_transformers import \
BaseForecastTransformer


@Substitution(
Expand Down Expand Up @@ -139,6 +132,36 @@ class ExpandingWindowFeatures(BaseForecastTransformer):
2 2022-09-20 3 8 1.5 6.5
3 2022-09-21 4 9 2.0 7.0
4 2022-09-22 5 10 2.5 7.5
create expanding window features based on other variables.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the example in the class' docstrings is just meant for the user to "copy and paste" a simple example, not a full blown demo. For that we have the user guide. Could we please keep the original example?

>>> import pandas as pd
>>> from feature_engine.timeseries.forecasting import ExpandingWindowFeatures
>>> X = pd.DataFrame(dict(date = ["2022-09-18",
>>> "2022-09-19",
>>> "2022-09-20",
>>> "2022-09-21",
>>> "2022-09-22",
>>> "2022-09-18",
>>> "2022-09-19",
>>> "2022-09-20",
>>> "2022-09-21",
>>> "2022-09-22"],
>>> x1 = [1,2,3,4,5, 3,5,6,8,11],
>>> x2 = [6,7,8,9,10, 2,9,10,15,2],
>>> x3=['a','a','a','a','a', 'b','b','b','b','b']
>>> ))
>>> ewf = ExpandingWindowFeatures(group_by_variables='x3')
>>> ewf.fit_transform(X)
date x1 x2 x3 x1_expanding_mean x2_expanding_mean
0 2022-09-18 1 6 a NaN NaN
1 2022-09-19 2 7 a 1.000000 6.0
2 2022-09-20 3 8 a 1.500000 6.5
3 2022-09-21 4 9 a 2.000000 7.0
4 2022-09-22 5 10 a 2.500000 7.5
5 2022-09-18 3 2 b NaN NaN
6 2022-09-19 5 9 b 3.000000 2.0
7 2022-09-20 6 10 b 4.000000 5.5
8 2022-09-21 8 15 b 4.666667 7.0
9 2022-09-22 11 2 b 5.500000 9.0
"""

def __init__(
Expand All @@ -151,6 +174,7 @@ def __init__(
sort_index: bool = True,
missing_values: str = "raise",
drop_original: bool = False,
group_by_variables: Optional[Union[str, List[str]]] = None,
) -> None:

if not isinstance(functions, (str, list)) or not all(
Expand All @@ -168,7 +192,7 @@ def __init__(
f"periods must be a non-negative integer. Got {periods} instead."
)

super().__init__(variables, missing_values, drop_original)
super().__init__(variables, missing_values, drop_original, group_by_variables)

self.min_periods = min_periods
self.functions = functions
Expand All @@ -193,12 +217,17 @@ def transform(self, X: pd.DataFrame) -> pd.DataFrame:
# Common dataframe checks and setting up.
X = self._check_transform_input_and_state(X)

tmp = (
X[self.variables_]
.expanding(min_periods=self.min_periods)
.agg(self.functions)
.shift(periods=self.periods, freq=self.freq)
)
if self.group_by_variables:
tmp = self._agg_expanding_window_features(
grouped_df=X.groupby(self.group_by_variables)
)
else:
tmp = (
X[self.variables_]
.expanding(min_periods=self.min_periods)
.agg(self.functions)
.shift(periods=self.periods, freq=self.freq)
)

tmp.columns = self._get_new_features_name()

Expand All @@ -224,3 +253,30 @@ def _get_new_features_name(self) -> List:
]

return feature_names

def _agg_expanding_window_features(
self,
grouped_df: pd.core.groupby.generic.DataFrameGroupBy,
) -> Union[pd.Series, pd.DataFrame]:
"""generate expanding window features based on groups
Parameters
----------
grouped_df : pd.core.groupby.generic.DataFrameGroupBy
dataframe of groups

Returns
-------
Union[pd.Series, pd.DataFrame]
returned expanding window features
"""
tmp_data = []
for _, group in grouped_df:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do we need to loop?

Are we creating a grouped df for every variable passed to group_by_variables?

And is this the desired functionality? For time series forecasting, would we not have all ts in 1 col and then we would group by one or more variables that identify the ts, but we would not create many groups?

When would we need to create many groups?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me explain what I need to do here, the reason behind adding group_by_variables to time series transformers is because of this issue #668 , when we need to create some lags, rolling window, or expanding window features based on a set of groups.
the above code loop over the set of groups to create the features for every group then concatenate them, and sort by index to return the dataframe to its original
let me explain it in the following code

X = pd.DataFrame(dict(date = ["2022-09-18",
                             "2022-09-19",
                             "2022-09-20",
                             "2022-09-21",
                             "2022-09-22",
                             "2022-09-18",
                             "2022-09-19",
                             "2022-09-20",
                             "2022-09-21",
                             "2022-09-22"],
                     x1 = [1,2,3,4,5, 3,5,6,8,11],
                     x2 = [6,7,8,9,10, 2,9,10,15,2],
                     x3=['a','a','a','a','a', 'b','b','b','b','b'],
                     x4=['c','c','c','w','w','c','c','w','w','w']
))

X_grouped = X.groupby(['x3', 'x4'])
for _, group in X_grouped:
    print(group)

the result is the dataframes of every group of ('x3', 'x4')

date  x1  x2 x3 x4
0  2022-09-18   1   6  a  c
1  2022-09-19   2   7  a  c
2  2022-09-20   3   8  a  c
         date  x1  x2 x3 x4
3  2022-09-21   4   9  a  w
4  2022-09-22   5  10  a  w
         date  x1  x2 x3 x4
5  2022-09-18   3   2  b  c
6  2022-09-19   5   9  b  c
         date  x1  x2 x3 x4
7  2022-09-20   6  10  b  w
8  2022-09-21   8  15  b  w
9  2022-09-22  11   2  b  w

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thank you for the explanation. Pandas should apply shift and rolling and expanding to the groups out of the box, there is no need to loop, as far as I understand. See for example these resources: https://www.statology.org/pandas-lag-by-group/

tmp = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to loop over each group. Pandas does that under the hood if I recall correctly. So we'd just add groupby before .expanding. Check these resources:

https://www.statology.org/pandas-lag-by-group/
https://stackoverflow.com/questions/37231844/pandas-creating-a-lagged-column-with-grouped-data

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a simple way to perform the group_by operation to calculate expanding window features using the .apply() method in pandas

group[self.variables_]
.expanding(min_periods=self.min_periods)
.agg(self.functions)
.shift(periods=self.periods, freq=self.freq)
)
tmp_data.append(tmp)
tmp = pd.concat(tmp_data).sort_index()
return tmp
Loading