-
-
Notifications
You must be signed in to change notification settings - Fork 331
TargetMeanDiscretiser: sorts variables in bins and replaces bins by target mean value #419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Morgan-Sell
wants to merge
29
commits into
feature-engine:main
Choose a base branch
from
Morgan-Sell:target_mean_discretiser
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
29 commits
Select commit
Hold shift + click to select a range
a2b0c9c
initial commit
Morgan-Sell e517953
create fit()
Morgan-Sell c823f7c
update init()
Morgan-Sell 20b902a
expand init() and fit() functionality
Morgan-Sell 6403cf8
add functionality to fit()
Morgan-Sell 9a0d662
create _make_discretiser()
Morgan-Sell de4ae94
create _make_pipeline
Morgan-Sell b6fac50
expand fit()
Morgan-Sell a2360a6
remove ArbitraryDiscretiser and correspdoning attributes
Morgan-Sell bf2fc62
update fit()
Morgan-Sell 23baacb
update fit()
Morgan-Sell 8250646
update transform() and _encode_X()
Morgan-Sell 0ac284c
add TargetMeanDiscretiser to test_check_estimator_discretisers.py
Morgan-Sell 265fd08
create test_target_mean_discretiser.py includes initial test
Morgan-Sell f576e3d
update unit tests
Morgan-Sell 86cbbf5
edit docstring
Morgan-Sell 20317ee
add tests
Morgan-Sell f676127
update fit()
Morgan-Sell c6372ba
(1) add _make_pipeline(); and (2) update fit() and transform()
Morgan-Sell d843d0e
fix style error
Morgan-Sell 138b201
create unit test and fix bugs
Morgan-Sell 5a229d4
create test_equal_width_strategy
Morgan-Sell 82f5acc
fix errors
Morgan-Sell ddd56e5
create rst file
Morgan-Sell 0a923a6
start user guide w/ demo
Morgan-Sell d278203
fix style error
Morgan-Sell 1a83491
update docs/index.rst
Morgan-Sell 8d7de98
update api_doc/discretisation/index.rst
Morgan-Sell cddf873
fix errors
Morgan-Sell File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
TargetMeanDiscretiser | ||
===================== | ||
|
||
.. autoclass:: feature_engine.discretisation.TargetMeanDiscretiser | ||
:members: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
.. _target_mean_discretiser: | ||
|
||
.. currentmodule:: feature_engine.discretisation | ||
|
||
TargetMeanDiscretiser | ||
===================== | ||
|
||
The :class:`TargetMeanDiscretiser()` sorts numerical variables and organizes the values into bins | ||
using either :class:`EqualFrequencyDiscretiser()` or :class:`EqualWidthDiscretiser()`. Once the numerical | ||
variables are separated into bins, :class:`MeanEncoder()` replaces categories with the mean of the | ||
target per bin interval. The number of bins is determined by the user. | ||
|
||
Let's look at an example using the California Housing Dataset. | ||
|
||
First, let's load the data and separate it into train and test: | ||
|
||
.. code:: python | ||
|
||
import numpy as np | ||
import pandas as pd | ||
import matplotlib.pyplot as plt | ||
from sklearn.datasets import fetch_california_housing | ||
from sklearn.model_selection import train_test_split | ||
|
||
from feature_engine.discretisation import TargetMeanDiscretiser | ||
|
||
# Load dataset | ||
california_dataset = fetch_california_housing() | ||
data = pd.DataFrame(california_dataset.data, columns=california_dataset.feature_names) | ||
|
||
# Seperate into train and test sets | ||
X_train, X_test, y_train, y_test = train_test_split( | ||
data, california_dataset["target"], test_size=0.3, | ||
random_state=0) | ||
|
||
Now, we set up the :class:`TargetMeanDiscretiser()` to encode the discretised bins and replace | ||
the bin indices only in the 3 indicated variables using the :class:`EqualFrequencyDiscretiser()`: | ||
|
||
.. code:: python | ||
|
||
# set up the discretisation transformer | ||
disc = TargetMeanDiscretiser(variables=["HouseAge", "AveRooms", "Population"], | ||
strategy="equal_frequency", | ||
bins=5) | ||
|
||
# fit the transformer | ||
disc.fit(X_train, y_train) | ||
|
||
With `fit()` the transformer learns the boundaries of each interval. Then, we can go | ||
ahead and sort the values into the intervals. The transformer learns the target mean | ||
value for each interval, which are stored in `encoder_dict_` parameter: | ||
|
||
.. code:: python | ||
|
||
disc._pipeline["encoder"].encoder_dict_ | ||
|
||
The `encoder_dict_` contains the mean value of the target per bin interval, per variable. | ||
So we can easily use this dictionary to map the numbers to the discretised bins. | ||
|
||
.. code:: python | ||
|
||
{'HouseAge': {Interval(-inf, 17.0, closed='right'): 2.0806529160739684, | ||
Interval(17.0, 25.0, closed='right'): 2.097539197771588, | ||
Interval(25.0, 33.0, closed='right'): 2.0686614742967993, | ||
Interval(33.0, 40.0, closed='right'): 2.1031412685185185, | ||
Interval(40.0, inf, closed='right'): 2.0266248845381525}, | ||
'AveRooms': {Interval(-inf, 4.281, closed='right'): 2.0751556984478934, | ||
Interval(4.281, 4.94, closed='right'): 2.0353196247563354, | ||
Interval(4.94, 5.524, closed='right'): 2.122038111675127, | ||
Interval(5.524, 6.258, closed='right'): 2.0422810965372507, | ||
Interval(6.258, inf, closed='right'): 2.103166361757106}, | ||
'Population': {Interval(-inf, 709.0, closed='right'): 2.0853869883779685, | ||
Interval(709.0, 1004.0, closed='right'): 2.0658340239808153, | ||
Interval(1004.0, 1346.0, closed='right'): 2.0712619255907487, | ||
Interval(1346.0, 1905.0, closed='right'): 2.0454417591204397, | ||
Interval(1905.0, inf, closed='right'): 2.108366283914729}} | ||
|
||
We can now go ahead and replace the bins with the numbers: | ||
|
||
..code:: python | ||
|
||
# transform the data | ||
train_t = disc.transform(X_train) | ||
test_t = disc.transform(X_test) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,225 @@ | ||
from typing import List, Union | ||
|
||
import pandas as pd | ||
from sklearn.pipeline import Pipeline | ||
from sklearn.utils.validation import check_is_fitted | ||
|
||
from feature_engine._docstrings.class_inputs import _variables_numerical_docstring | ||
from feature_engine._docstrings.fit_attributes import ( | ||
_feature_names_in_docstring, | ||
_n_features_in_docstring, | ||
_variables_attribute_docstring, | ||
) | ||
from feature_engine._docstrings.methods import ( | ||
_fit_not_learn_docstring, | ||
_fit_transform_docstring, | ||
) | ||
from feature_engine._docstrings.substitute import Substitution | ||
from feature_engine.dataframe_checks import ( | ||
_check_contains_inf, | ||
_check_contains_na, | ||
_check_X_matches_training_df, | ||
check_X, | ||
check_X_y, | ||
) | ||
from feature_engine.discretisation import ( | ||
EqualFrequencyDiscretiser, | ||
EqualWidthDiscretiser, | ||
) | ||
from feature_engine.discretisation.base_discretiser import BaseDiscretiser | ||
from feature_engine.encoding import MeanEncoder | ||
from feature_engine.variable_manipulation import ( | ||
_check_input_parameter_variables, | ||
_find_or_check_numerical_variables, | ||
) | ||
|
||
|
||
@Substitution( | ||
return_objects=BaseDiscretiser._return_object_docstring, | ||
return_boundaries=BaseDiscretiser._return_boundaries_docstring, | ||
binner_dict_=BaseDiscretiser._binner_dict_docstring, | ||
transform=BaseDiscretiser._transform_docstring, | ||
variables=_variables_numerical_docstring, | ||
variables_=_variables_attribute_docstring, | ||
feature_names_in_=_feature_names_in_docstring, | ||
n_features_in_=_n_features_in_docstring, | ||
fit=_fit_not_learn_docstring, | ||
fit_transform=_fit_transform_docstring, | ||
) | ||
class TargetMeanDiscretiser(BaseDiscretiser): | ||
""" | ||
|
||
Parameters | ||
---------- | ||
strategy: str, default='equal_width' | ||
Whether the bins should of equal width ('equal_width') or equal frequency | ||
('equal_frequency'). | ||
|
||
{variables} | ||
|
||
bins: int, default=10 | ||
Desired number of equal-width or equal-distance intervals / bins. | ||
|
||
errors: string, default='ignore' | ||
Indicates what to do when a value is outside the limits indicated in the | ||
'binning_dict'. If 'raise', the transformation will raise an error. | ||
If 'ignore', values outside the limits are returned as NaN | ||
and a warning will be raised instead. | ||
|
||
Attributes | ||
---------- | ||
{variables_} | ||
|
||
{binner_dict_} | ||
|
||
{feature_names_in_} | ||
|
||
{n_features_in_} | ||
|
||
Methods | ||
------- | ||
{fit} | ||
|
||
{fit_transform} | ||
|
||
{transform} | ||
|
||
See Also | ||
-------- | ||
pandas.cut | ||
""" | ||
|
||
def __init__( | ||
self, | ||
variables: Union[None, int, str, List[Union[str, int]]] = None, | ||
bins: int = 10, | ||
strategy: str = "equal_frequency", | ||
errors: str = "ignore", | ||
) -> None: | ||
|
||
if not isinstance(bins, int): | ||
raise ValueError( | ||
f"bins must be an integer. Got {bins} instead." | ||
) | ||
if strategy not in ("equal_frequency", "equal_width"): | ||
raise ValueError( | ||
"strategy must equal 'equal_frequency' or 'equal_width'. " | ||
f"Got {strategy} instead." | ||
) | ||
|
||
if errors not in ("ignore", "raise"): | ||
raise ValueError( | ||
"errors only takes values 'ignore' and 'raise. " | ||
f"Got {errors} instead." | ||
) | ||
|
||
self.variables = _check_input_parameter_variables(variables) | ||
self.bins = bins | ||
self.strategy = strategy | ||
self.errors = errors | ||
|
||
def fit(self, X: pd.DataFrame, y: pd.Series): | ||
""" | ||
Learn the boundaries of the selected dicretiser's intervals / bins | ||
for the chosen numerical variables. | ||
|
||
Parameters | ||
---------- | ||
X: pandas dataframe of shape = [n_samples, n_features] | ||
The training dataset. Can be the entire dataframe, not just the | ||
variables to be transformed. | ||
|
||
y : pandas series of shape = [n_samples,] | ||
y is not needed in this discretiser. You can pass y or None. | ||
""" | ||
# check if 'X' is a dataframe | ||
X, y = check_X_y(X, y) | ||
|
||
# identify numerical variables | ||
self.variables_ = _find_or_check_numerical_variables( | ||
X, self.variables | ||
) | ||
|
||
# check for missing values | ||
_check_contains_na(X, self.variables_) | ||
|
||
# check for inf | ||
_check_contains_inf(X, self.variables_) | ||
|
||
# instantiate pipeline | ||
self._pipeline = self._make_pipeline() | ||
self._pipeline.fit(X, y) | ||
|
||
# store input features | ||
self.n_features_in_ = X.shape[1] | ||
self.feature_names_in_ = list(X.columns) | ||
|
||
return self | ||
|
||
def transform(self, X: pd.DataFrame) -> pd.DataFrame: | ||
""" | ||
Replace original values by the average of the target mean value per bin | ||
for each of the variables. | ||
|
||
Parameters | ||
---------- | ||
X: pandas dataframe of shape = [n_samples, n_features] | ||
The data to transform. | ||
|
||
Returns | ||
------- | ||
X_enc: pandas dataframe of shape = [n_samples, n_features] | ||
The transformed data with the means of the selected numerical variables. | ||
|
||
""" | ||
# check that fit method has been called | ||
check_is_fitted(self) | ||
|
||
# check that input is a dataframe | ||
X = check_X(X) | ||
|
||
# check that input data contain number of columns as the fitted df | ||
_check_X_matches_training_df(X, self.n_features_in_) | ||
|
||
# check for missing values | ||
_check_contains_na(X, self.variables_) | ||
|
||
# check for infinite values | ||
_check_contains_inf(X, self.variables_) | ||
|
||
# discretise and encode | ||
X_tr = self._pipeline.transform(X) | ||
|
||
return X_tr | ||
|
||
def _make_discretiser(self): | ||
""" | ||
Instantiate the EqualFrequencyDiscretiser or EqualWidthDiscretiser. | ||
""" | ||
if self.strategy == "equal_frequency": | ||
discretiser = EqualFrequencyDiscretiser( | ||
q=self.bins, | ||
variables=self.variables_, | ||
return_boundaries=True, | ||
return_object=True, | ||
) | ||
else: | ||
discretiser = EqualWidthDiscretiser( | ||
bins=self.bins, | ||
variables=self.variables_, | ||
return_boundaries=True, | ||
return_object=True, | ||
) | ||
|
||
return discretiser | ||
|
||
def _make_pipeline(self): | ||
""" | ||
Instantiate pipeline comprised of discretiser and encoder. | ||
""" | ||
pipe = Pipeline([ | ||
("discretiser", self._make_discretiser()), | ||
("encoder", MeanEncoder(variables=self.variables_)) | ||
]) | ||
|
||
return pipe |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.