Skip to content

Commit f74cec8

Browse files
Add match variables closes #132 (#308)
* Added SimilarColumns (Automatic handle of columns difference in test and train) * Update feature_selection.py Added typing and input check * Update feature_selection.py Added pandas import * Refactoring * Replaced BaseSelector with TransformerMixin and BaseEstimator * Style improvement * Improved Typing * code review refactoring * style * Added drop_if_more_columns and add_if_less_columns attributes * style * style * style * style * new way to add if less columns * Refactoring and error handling * style * styler * refactoring of tests * Removed optional parameters * Added skip of check_transformer_general * style * style * style * Improved typing * handling of code review comments * renamed sanity_check as preprocessing * Renamed similar_columns as match_columns * Renamed SimilarColumns as MatchColumnsToTrainSet * Replaced Any with Union[np.nan, int] for fill_value * Added check for fill_value Type * Code review change * modifies match_columns_transformer and tests * fix style * updates docs * renamed attr as per PR #298 * Added drop and addition for multiple columns * Added test for simultaneous addition and removal of columns * Style correction * Improved unit test when verbose is True * Added transformer to index.rst * Added example for MatchColumnsToTrainSet * Line too long correction * Added documentation for MatchColumnsToTrainSet * Corrected docs * corrected docs * expands docs * changed transformer name in docs * changed transformer name in jupyter * adds class to readme * changes class name and fixes typos * Added preprocessing/index to toctree * minor code move-around * Added Preprocessing to Readme.md * add preprocessing to tree index * fixes link to new repo for notebooks * Fixed docs * Removed examples folder * Refactoring * Replaced BaseSelector with TransformerMixin and BaseEstimator * Style improvement * Improved Typing * code review refactoring * style * Added drop_if_more_columns and add_if_less_columns attributes * style * style * style * style * new way to add if less columns * Refactoring and error handling * style * styler * refactoring of tests * Removed optional parameters * Added skip of check_transformer_general * style * style * style * Improved typing * handling of code review comments * renamed sanity_check as preprocessing * Renamed similar_columns as match_columns * Renamed SimilarColumns as MatchColumnsToTrainSet * Replaced Any with Union[np.nan, int] for fill_value * Added check for fill_value Type * Code review change * modifies match_columns_transformer and tests * fix style * updates docs * renamed attr as per PR #298 * Added drop and addition for multiple columns * Added test for simultaneous addition and removal of columns * Style correction * Improved unit test when verbose is True * Added transformer to index.rst * Added example for MatchColumnsToTrainSet * Line too long correction * Added documentation for MatchColumnsToTrainSet * Corrected docs * corrected docs * Added preprocessing/index to toctree * Added Preprocessing to Readme.md * expands docs * changed transformer name in docs * changed transformer name in jupyter * adds class to readme * changes class name and fixes typos * minor code move-around * add preprocessing to tree index * Fixed docs * Removed examples folder * Deleted sanity_check Co-authored-by: Soledad Galli <[email protected]>
1 parent f732631 commit f74cec8

File tree

9 files changed

+593
-0
lines changed

9 files changed

+593
-0
lines changed

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ More resources will be added as they appear online!
5656
* Variable Creation
5757
* Variable Selection
5858
* Scikit-learn Wrappers
59+
* Preprocessing
5960

6061
### Imputing Methods
6162
* MeanMedianImputer
@@ -115,6 +116,8 @@ More resources will be added as they appear online!
115116
* RecursiveFeatureElimination
116117
* RecursiveFeatureAddition
117118

119+
### Preprocessing
120+
* MatchVariables
118121

119122
## Installing
120123

docs/index.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ Feature-engine includes transformers for:
2323
- Outlier capping or removal
2424
- Variable combination
2525
- Variable selection
26+
- Preprocessing
2627

2728
Feature-engine allows you to select the variables you want to transform within each
2829
transformer. This way, different engineering procedures can be easily applied to
@@ -166,6 +167,10 @@ Feature Selection:
166167
- :doc:`selection/RecursiveFeatureElimination`: selects features recursively, by evaluating model performance
167168
- :doc:`selection/RecursiveFeatureAddition`: selects features recursively, by evaluating model performance
168169

170+
Preprocessing:
171+
~~~~~~~~~~~~~~
172+
173+
- :doc:`preprocessing/MatchVariables`: ensures that columns in test set match those in train set
169174

170175
Getting Help
171176
------------
@@ -237,6 +242,7 @@ The `issues <https://github.com/feature-engine/feature_engine/issues/>`_ and
237242
outliers/index
238243
creation/index
239244
selection/index
245+
preprocessing/index
240246
wrappers/index
241247

242248
.. toctree::
Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
MatchVariables
2+
==============
3+
4+
API Reference
5+
-------------
6+
7+
.. autoclass:: feature_engine.preprocessing.MatchVariables
8+
:members:
9+
10+
11+
Example
12+
-------
13+
14+
MatchVariables() ensures that the columns in the test set are identical to those
15+
in the train set.
16+
17+
If the test set contains additional columns, they are dropped. Alternatively, if the
18+
test set lacks columns that were present in the train set, they will be added with a
19+
value determined by the user, for example np.nan.
20+
21+
22+
.. code:: python
23+
24+
import numpy as np
25+
import pandas as pd
26+
27+
from feature_engine.preprocessing import MatchVariables
28+
29+
30+
# Load dataset
31+
def load_titanic():
32+
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
33+
data = data.replace('?', np.nan)
34+
data['cabin'] = data['cabin'].astype(str).str[0]
35+
data['pclass'] = data['pclass'].astype('O')
36+
data['age'] = data['age'].astype('float')
37+
data['fare'] = data['fare'].astype('float')
38+
data['embarked'].fillna('C', inplace=True)
39+
data.drop(
40+
labels=['name', 'ticket', 'boat', 'body', 'home.dest'],
41+
axis=1, inplace=True,
42+
)
43+
return data
44+
45+
# load data as pandas dataframe
46+
data = load_titanic()
47+
48+
# Split test and train
49+
train = data.iloc[0:1000, :]
50+
test = data.iloc[1000:, :]
51+
52+
# set up the transformer
53+
match_cols = MatchVariables(missing_values="ignore")
54+
55+
# learn the variables in the train set
56+
match_cols.fit(train)
57+
58+
# the transformer stores the input variables
59+
match_cols.input_features_
60+
61+
62+
.. code:: python
63+
64+
['pclass',
65+
'survived',
66+
'sex',
67+
'age',
68+
'sibsp',
69+
'parch',
70+
'fare',
71+
'cabin',
72+
'embarked']
73+
74+
75+
.. code:: python
76+
77+
# Let's drop some columns in the test set for the demo
78+
test_t = test.drop(["sex", "age"], axis=1)
79+
80+
test_t.head()
81+
82+
.. code:: python
83+
84+
pclass survived sibsp parch fare cabin embarked
85+
1000 3 1 0 0 7.7500 n Q
86+
1001 3 1 2 0 23.2500 n Q
87+
1002 3 1 2 0 23.2500 n Q
88+
1003 3 1 2 0 23.2500 n Q
89+
1004 3 1 0 0 7.7875 n Q
90+
91+
92+
.. code:: python
93+
94+
# the transformer adds the columns back
95+
test_tt = match_cols.transform(test_t)
96+
97+
test_tt.head()
98+
99+
.. code:: python
100+
101+
The following variables are added to the DataFrame: ['sex', 'age']
102+
103+
pclass survived sex age sibsp parch fare cabin embarked
104+
1000 3 1 NaN NaN 0 0 7.7500 n Q
105+
1001 3 1 NaN NaN 2 0 23.2500 n Q
106+
1002 3 1 NaN NaN 2 0 23.2500 n Q
107+
1003 3 1 NaN NaN 2 0 23.2500 n Q
108+
1004 3 1 NaN NaN 0 0 7.7875 n Q
109+
110+
111+
112+
Note how the missing columns were added back to the transformed test set, with
113+
missing values, in the position (i.e., order) in which they were in the train set.
114+
115+
Similarly, if the test set contained additional columns, those would be removed:
116+
117+
.. code:: python
118+
119+
# let's add some columns for the demo
120+
test_t[['var_a', 'var_b']] = 0
121+
122+
test_t.head()
123+
124+
.. code:: python
125+
126+
pclass survived sibsp parch fare cabin embarked var_a var_b
127+
1000 3 1 0 0 7.7500 n Q 0 0
128+
1001 3 1 2 0 23.2500 n Q 0 0
129+
1002 3 1 2 0 23.2500 n Q 0 0
130+
1003 3 1 2 0 23.2500 n Q 0 0
131+
1004 3 1 0 0 7.7875 n Q 0 0
132+
133+
134+
.. code:: python
135+
136+
test_tt = match_cols.transform(test_t)
137+
138+
test_tt.head()
139+
140+
.. code:: python
141+
142+
The following variables are added to the DataFrame: ['age', 'sex']
143+
The following variables are dropped from the DataFrame: ['var_a', 'var_b']
144+
145+
pclass survived sex age sibsp parch fare cabin embarked
146+
1000 3 1 NaN NaN 0 0 7.7500 n Q
147+
1001 3 1 NaN NaN 2 0 23.2500 n Q
148+
1002 3 1 NaN NaN 2 0 23.2500 n Q
149+
1003 3 1 NaN NaN 2 0 23.2500 n Q
150+
1004 3 1 NaN NaN 0 0 7.7875 n Q
151+
152+
153+
Now, the transformer simultaneously added the missing columns with NA as values and
154+
removed the additional columns from the resulting dataset.
155+
156+
These transformer is useful in "predict then optimize type of problems". In such cases,
157+
a machine learning model is trained on a certain dataset, with certain input features.
158+
Then, test sets are "post-processed" according to scenarios that want to be modelled.
159+
For example, "what would have happened if the customer received an email campaign"?
160+
where the variable "receive_campaign" would be turned from 0 -> 1.
161+
162+
While creating these modelling datasets, a lot of meta data e.g., "scenario number",
163+
"time scenario was generated", etc, could be added to the data. Then we need to pass
164+
these data over to the model to obtain the modelled prediction.
165+
166+
MatchVariables() provides an easy an elegant way to remove the additional metadeta,
167+
while returning datasets with the input features in the correct order, allowing the
168+
different scenarios to be modelled directly inside a machine learning pipeline.

docs/preprocessing/index.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
.. -*- mode: rst -*-
2+
3+
Preprocessing
4+
=============
5+
6+
Feature-engine's preprocessing transformers apply general data pre-processing
7+
and transformation procedures.
8+
9+
.. toctree::
10+
:maxdepth: 2
11+
12+
MatchVariables
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
"""
2+
The module preprocessing includes classes and functions for general data pre-processing
3+
and transformation.
4+
"""
5+
6+
from .match_columns import MatchVariables
7+
8+
__all__ = [
9+
"MatchVariables",
10+
]

0 commit comments

Comments
 (0)