Skip to content

Commit a1d87da

Browse files
solegallittungl
andauthored
Target mean selection closes #170 (#181)
* created target mean encoding * created test target mean encoding * update target mean encoding feature selector * update target mean encoding feature selector (2) * update target mean encoding feature selector (2) * update target mean encoding feature selector (3) * update target mean encoding feature selector (3) * update target mean encoding feature selector (3) * refractor select by target mean * rename feature shuffling selector * fix docstrings mean_target_selection * add cv, fixes bugs, adds 1 test * finish select with target meatn * docs for select with target mean encoding * add contributor to changelog Co-authored-by: ttungl <[email protected]>
1 parent b1f4659 commit a1d87da

14 files changed

+635
-33
lines changed

README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,8 @@ More resources will be added as they appear online!
9999
* DropDuplicateFeatures
100100
* DropCorrelatedFeatures
101101
* ShuffleFeaturesSelector
102+
* SelectBySingleFeaturePerformance
103+
* SelectByTargetMeanPerformance
102104
* RecursiveFeatureElimination
103105

104106

@@ -125,7 +127,7 @@ git clone https://github.com/solegalli/feature_engine.git
125127
### Usage
126128

127129
```python
128-
>>> from feature_engine.categorical_encoders import RareLabelCategoricalEncoder
130+
>>> from feature_engine.encoding import RareLabelEncoder
129131
>>> import pandas as pd
130132

131133
>>> data = {'var_A': ['A'] * 10 + ['B'] * 10 + ['C'] * 2 + ['D'] * 1}
@@ -143,7 +145,7 @@ Name: var_A, dtype: int64
143145
```
144146

145147
```python
146-
>>> rare_encoder = RareLabelCategoricalEncoder(tol=0.10, n_categories=3)
148+
>>> rare_encoder = RareLabelEncoder(tol=0.10, n_categories=3)
147149
>>> data_encoded = rare_encoder.fit_transform(data)
148150
>>> data_encoded['var_A'].value_counts()
149151
```

docs/changelog.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Contributors:
1212
- Sana Ben Driss
1313
- Nicolas Galli
1414
- Tejash Shah
15+
- Tung Lee
1516
- Soledad Galli
1617

1718

@@ -52,13 +53,15 @@ We renamed a few parameters to unify the nomenclature across the Package.
5253
- **DropDuplicateFeatures**: DropDuplicateFeatures finds and removes duplicated features from a dataset (**by Tejash Shah and Soledad Galli**)
5354
- **DropCorrelatedFeatures**: DropCorrelatedFeatures finds and removes features that are correlated (**by Nicolas Galli**)
5455
- **ShuffleFeaturesSelector**: ShuffleFeaturesSelector selects features by determining the drop in machine learning model performance when each feature's values are randomly shuffled from a dataframe (**by Sana Ben Driss**)
56+
- **SelectBySingleFeaturePerformance**: SelectBySingleFeaturePerformance trains a model based of each individual features, and derives performance (**by Nicolas Galli**)
57+
- **SelectByTargetMeanPerformance**: SelectByTargetMeanPerformance selects features encoding the categories with the target mean and using that as proxy for performance (**by Tung Lee and Soledad Galli**)
5558
- **RecursiveFeatureElimination**: RecursiveFeatureElimination selects features recursively, evaluating the drop in ML performance, from the least to the important feature (**by Sana Ben Driss**)
5659

5760
**Code Architecture - Important for Contributors and Developers**:
5861
- **Submodules**: transformers have been grouped within relevant submodules and modules.
5962
- **Individual tests**: testing classes have been subdivided into individual tests
6063
- **Code Style**: we adopted the use of flake8 for linting and PEP8 style checks, and black for automatic re-styling of code.
61-
- **Type hint**: we are slowly rolling out the use of type hint throughout Feature-engine classes and functions (**by Nodar Okroshiashvili**)
64+
- **Type hint**: rolled out the use of type hint throughout Feature-engine classes and functions (**by Nodar Okroshiashvili, Soledad Galli and Chris Samiullah**)
6265

6366
**Other Changes**:
6467
- **Updated documentation**: documentation reflects the current use of Feature-engine transformers

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,7 @@ Feature Selection:
177177
- :doc:`selection/DropCorrelatedFeatures`: drops correlated variables from a dataframe
178178
- :doc:`selection/ShuffleFeaturesSelector`: selects features by evaluating model performance after feature shuffling
179179
- :doc:`selection/SelectBySingleFeaturePerformance`: selects features based on their performance on univariate estimators
180+
- :doc:`selection/SelectByTargetMeanPerformance`: selects features based on target mean encoding performance
180181
- :doc:`selection/RecursiveFeatureElimination`: selects features recursively, by evaluating model performance
181182

182183
Getting Help
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
SelectByTargetMeanPerformance
2+
=============================
3+
4+
The SelectByTargetMeanPerformance()selects features based on the performance of
5+
machine learning models trained using individual features. In other words, selects
6+
features based on their individual performance, returned by estimators trained on
7+
only that particular feature.
8+
9+
API Reference
10+
-------------
11+
12+
.. autoclass:: feature_engine.selection.SelectByTargetMeanPerformance
13+
:members:

docs/selection/ShuffleFeaturesSelector.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1-
ShuffleFeaturesSelector
2-
=======================
1+
SelectByShuffling
2+
=================
33

4-
The ShuffleFeaturesSelector() selects important features if permutation their values
4+
The SelectByShuffling() selects important features if permutation their values
55
at random produces a decrease in the initial model performance. See API below for
66
more details into its functionality.
77

@@ -21,7 +21,7 @@ more details into its functionality.
2121
linear_model = LinearRegression()
2222
2323
# initialize feature selector
24-
tr = ShuffleFeaturesSelector(estimator=linear_model, scoring="r2", cv=3)
24+
tr = SelectByShuffling(estimator=linear_model, scoring="r2", cv=3)
2525
2626
# fit transformer
2727
Xt = tr.fit_transform(X, y)
@@ -75,5 +75,5 @@ more details into its functionality.
7575
API Reference
7676
-------------
7777

78-
.. autoclass:: feature_engine.selection.ShuffleFeaturesSelector
78+
.. autoclass:: feature_engine.selection.SelectByShuffling
7979
:members:

docs/selection/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,5 @@ Or in other words to select subsets of variables.
1515
DropCorrelatedFeatures
1616
ShuffleFeaturesSelector
1717
SelectBySingleFeaturePerformance
18+
SelectByTargetMeanPerformance
1819
RecursiveFeatureElimination

feature_engine/selection/__init__.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,16 +5,18 @@
55
from .drop_constant_features import DropConstantFeatures
66
from .drop_duplicate_features import DropDuplicateFeatures
77
from .drop_correlated_features import DropCorrelatedFeatures
8-
from .shuffle_features import ShuffleFeaturesSelector
9-
from .single_feature_performance_selection import SelectBySingleFeaturePerformance
8+
from .shuffle_features import SelectByShuffling
9+
from .single_feature_performance import SelectBySingleFeaturePerformance
1010
from .recursive_feature_elimination import RecursiveFeatureElimination
11+
from .target_mean_selection import SelectByTargetMeanPerformance
1112

1213
__all__ = [
1314
"DropFeatures",
1415
"DropConstantFeatures",
1516
"DropDuplicateFeatures",
1617
"DropCorrelatedFeatures",
17-
"ShuffleFeaturesSelector",
18+
"SelectByShuffling",
1819
"SelectBySingleFeaturePerformance",
1920
"RecursiveFeatureElimination",
21+
"SelectByTargetMeanPerformance",
2022
]

feature_engine/selection/shuffle_features.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,18 +20,18 @@
2020
Variables = Union[None, int, str, List[Union[str, int]]]
2121

2222

23-
class ShuffleFeaturesSelector(BaseEstimator, TransformerMixin):
23+
class SelectByShuffling(BaseEstimator, TransformerMixin):
2424
"""
2525
26-
ShuffleFeaturesSelector selects features by determining the drop in machine learning
26+
SelectByShuffling selects features by determining the drop in machine learning
2727
model performance when each feature's values are randomly shuffled.
2828
2929
If the variables are important, a random permutation of their values will
3030
decrease dramatically the machine learning model performance. Contrarily, the
3131
permutation of the values should have little to no effect on the model performance
3232
metric we are assessing.
3333
34-
The ShuffleFeaturesSelector first trains a machine learning model utilising all
34+
The SelectByShuffling first trains a machine learning model utilising all
3535
features. Next, it shuffles the values of 1 feature, obtains a prediction with the
3636
pre-trained model, and determines the performance drop (if any). If the drop in
3737
performance is bigger than a threshold then the feature is retained, otherwise
@@ -116,6 +116,7 @@ def __init__(
116116

117117
def fit(self, X: pd.DataFrame, y: pd.Series):
118118
"""
119+
Finds the important features
119120
120121
Args
121122
----

feature_engine/selection/single_feature_performance_selection.py renamed to feature_engine/selection/single_feature_performance.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -93,9 +93,11 @@ def __init__(
9393
if not isinstance(threshold, (int, float)):
9494
raise ValueError("threshold can only be integer or float")
9595

96-
if scoring == 'roc_auc' and (threshold < 0.5 or threshold > 1):
97-
raise ValueError("roc-auc score should vary between 0.5 and 1. Pick a "
98-
"threshold within this interval.")
96+
if scoring == "roc_auc" and (threshold < 0.5 or threshold > 1):
97+
raise ValueError(
98+
"roc-auc score should vary between 0.5 and 1. Pick a "
99+
"threshold within this interval."
100+
)
99101

100102
self.variables = _check_input_parameter_variables(variables)
101103
self.estimator = estimator
@@ -151,9 +153,7 @@ def fit(self, X: pd.DataFrame, y: pd.Series):
151153

152154
# check we are not dropping all the columns in the df
153155
if len(self.selected_features_) == 0:
154-
raise ValueError(
155-
"No features were selected, try changing the threshold."
156-
)
156+
raise ValueError("No features were selected, try changing the threshold.")
157157

158158
self.input_shape_ = X.shape
159159

0 commit comments

Comments
 (0)