Skip to content

Commit 31f6215

Browse files
glevvsolegalli
andauthored
Add new transformer: GeometricWidthDiscretiser (#591)
* new experimental encoder * added increasing width discretiser * Delete beta_encoder.py * added tests * small doc fix * small doc fix * lint * test fix * bug fix * format * format * added warnings * added transformer to the index * fix test * Update test_increasing_width_discretiser.py * Update base_discretiser.py * Update base_discretiser.py * Update IncreasingWidthDiscretiser.rst * Update index.rst * Update index.rst * Update index.rst * Update IncreasingWidthDiscretiser.rst * Update increasing_width.py * Update arbitrary.py * Update test_increasing_width_discretiser.py * flake * Update arbitrary.py * small refactor * tests refactor * flake * Update test_increasing_width_discretiser.py * Update test_increasing_width_discretiser.py * Update increasing_width.py * Update __init__.py * Update index.rst * Update and rename IncreasingWidthDiscretiser.rst to GeometricWidthDiscretiser.rst * Update and rename IncreasingWidthDiscretiser.rst to GeometricWidthDiscretiser.rst * Update index.rst * Update index.rst * Update test_increasing_width_discretiser.py * Update test_increasing_width_discretiser.py * Update test_increasing_width_discretiser.py * Update test_increasing_width_discretiser.py * Update test_increasing_width_discretiser.py * Update GeometricWidthDiscretiser.rst * Update GeometricWidthDiscretiser.rst * add discretiser to readme * re-words first part of user guide * final edits to user guide * rename script * isort script * tidy imports * final docstring update * sort and parametrization * Update base_discretiser.py * Update gemoetric_width.py * Update test_increasing_width_discretiser.py * Update arbitrary.py * Update base_discretiser.py * Update gemoetric_width.py * Update base_discretiser.py * Update base_discretiser.py * Update arbitrary.py * Update equal_frequency.py * Update equal_width.py * Update test_check_estimator_discretisers.py * Update base_discretiser.py * Update base_discretiser.py * typos * bugfix * flake * conflict * reformat base discretiser and add tests * propagate precision to all discretisers * reformat tests * fix indent * Update base_discretiser.py * Update GeometricWidthDiscretiser.rst * Update geometric_width.py * flake * expand test for precision * expand precision test * doc fix * flake * Update geometric_width.py * Update geometric_width.py * Update GeometricWidthDiscretiser.rst * Update base_discretiser.py * Update test_base_discretizer.py * Update test_geometric_width_discretiser.py * Update test_base_discretizer.py * revert * Update test_base_discretizer.py * Update test_base_discretizer.py * Update test_geometric_width_discretiser.py * Update test_geometric_width_discretiser.py --------- Co-authored-by: Soledad Galli <[email protected]>
1 parent 4dea90c commit 31f6215

File tree

17 files changed

+561
-16
lines changed

17 files changed

+561
-16
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,7 @@ transforming parameters from the data and then transform it.
8989
### Discretisation methods
9090
* EqualFrequencyDiscretiser
9191
* EqualWidthDiscretiser
92+
* GeometricWidthDiscretiser
9293
* DecisionTreeDiscretiser
9394
* ArbitraryDiscreriser
9495

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
GeometricWidthDiscretiser
2+
=========================
3+
4+
.. autoclass:: feature_engine.discretisation.GeometricWidthDiscretiser
5+
:members:

docs/api_doc/discretisation/index.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,8 @@ into continuous intervals.
1616
:class:`EqualFrequencyDiscretiser()` Sorts values into intervals with similar number of observations.
1717
:class:`EqualWidthDiscretiser()` Sorts values into intervals of equal size.
1818
:class:`ArbitraryDiscretiser()` Sorts values into intervals predefined by the user.
19-
:class:`DecisionTreeDiscretiser()` Replaces values by predictions of a decision tree, which are discrete
19+
:class:`DecisionTreeDiscretiser()` Replaces values by predictions of a decision tree, which are discrete.
20+
:class:`GeometricWidthDiscretiser()` Sorts variable into geometrical intervals.
2021
===================================== ========================================================================
2122

2223

@@ -28,9 +29,10 @@ into continuous intervals.
2829
EqualWidthDiscretiser
2930
ArbitraryDiscretiser
3031
DecisionTreeDiscretiser
32+
GeometricWidthDiscretiser
3133

3234
Additional transformers for discretisation
3335
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3436

3537
For discretisation using K-means, check Scikit-learn's
36-
`KBinsDiscretizer <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html>`_.
38+
`KBinsDiscretizer <https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html>`_.
16.9 KB
Loading

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,7 @@ Variable Discretisation: Discretisers
137137
- :doc:`api_doc/discretisation/EqualFrequencyDiscretiser`: sorts variable into equal frequency intervals
138138
- :doc:`api_doc/discretisation/EqualWidthDiscretiser`: sorts variable into equal width intervals
139139
- :doc:`api_doc/discretisation/DecisionTreeDiscretiser`: uses decision trees to create finite variables
140+
- :doc:`api_doc/discretisation/GeometricWidthDiscretiser`: sorts variable into geometrical intervals
140141

141142
Outlier Capping or Removal
142143
--------------------------
Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
.. _increasing_width_discretiser:
2+
3+
.. currentmodule:: feature_engine.discretisation
4+
5+
GeometricWidthDiscretiser
6+
=========================
7+
8+
The :class:`GeometricWidthDiscretiser()` divides continuous numerical variables into
9+
intervals of increasing width. The width of each succeeding interval is larger than the
10+
previous interval by a constant amount (cw).
11+
12+
The constant amount is calculated as:
13+
14+
.. math::
15+
cw = (Max - Min)^{1/n}
16+
17+
were Max and Min are the variable's maximum and minimum value, and n is the number of
18+
intervals.
19+
20+
The sizes of the intervals themselves are calculated with a geometric progression:
21+
22+
.. math::
23+
a_{i+1} = a_i cw
24+
25+
Thus, the first interval's width equals cw, the second interval's width equals 2 * cw,
26+
and so on.
27+
28+
Note that the proportion of observations per interval may vary.
29+
30+
This discretisation technique is great when the distribution of the variable is right skewed.
31+
32+
Note: The width of some bins might be very small. Thus, to allow this transformer
33+
to work properly, it might help to increase the precision value, that is,
34+
the number of decimal values allowed to define each bin. If the variable has a
35+
narrow range or you are sorting into several bins, allow greater precision
36+
(i.e., if precision = 3, then 0.001; if precision = 7, then 0.0001).
37+
38+
The :class:`GeometricWidthDiscretiser()` works only with numerical variables. A list of
39+
variables to discretise can be indicated, or the discretiser will automatically select
40+
all numerical variables in the train set.
41+
42+
**Example**
43+
44+
Let's look at an example using the house prices dataset (more details about the
45+
dataset :ref:`here <datasets>`).
46+
47+
Let's load the house prices dataset and separate it into train and test sets:
48+
49+
.. code:: python
50+
51+
import numpy as np
52+
import pandas as pd
53+
import matplotlib.pyplot as plt
54+
from sklearn.model_selection import train_test_split
55+
56+
from feature_engine.discretisation import GeometricWidthDiscretiser
57+
58+
# Load dataset
59+
data = pd.read_csv('houseprice.csv')
60+
61+
# Separate into train and test sets
62+
X_train, X_test, y_train, y_test = train_test_split(
63+
data.drop(['Id', 'SalePrice'], axis=1),
64+
data['SalePrice'], test_size=0.3, random_state=0)
65+
66+
67+
Now, we want to discretise the 2 variables indicated below into 10 intervals of increasing
68+
width:
69+
70+
.. code:: python
71+
72+
# set up the discretisation transformer
73+
disc = GeometricWidthDiscretiser(bins=10, variables=['LotArea', 'GrLivArea'])
74+
75+
# fit the transformer
76+
disc.fit(X_train)
77+
78+
With `fit()` the transformer learns the boundaries of each interval. Then, we can go
79+
ahead and sort the values into the intervals:
80+
81+
.. code:: python
82+
83+
# transform the data
84+
train_t= disc.transform(X_train)
85+
test_t= disc.transform(X_test)
86+
87+
The `binner_dict_` stores the interval limits identified for each variable.
88+
89+
.. code:: python
90+
91+
disc.binner_dict_
92+
93+
.. code:: python
94+
95+
'LotArea': [-inf,
96+
1303.412,
97+
1311.643,
98+
1339.727,
99+
1435.557,
100+
1762.542,
101+
2878.27,
102+
6685.32,
103+
19675.608,
104+
64000.633,
105+
inf],
106+
'GrLivArea': [-inf,
107+
336.311,
108+
339.34,
109+
346.34,
110+
362.515,
111+
399.894,
112+
486.27,
113+
685.871,
114+
1147.115,
115+
2212.974,
116+
inf]}
117+
118+
With increasing width discretisation, each bin does not necessarily contain the same number
119+
of observations. This transformer is suitable for variables with right skewed distributions.
120+
121+
Let's compare the variable distribution before and after the discretization:
122+
123+
.. code:: python
124+
125+
fig, ax = plt.subplots(1, 2)
126+
X_train['LotArea'].hist(ax=ax[0], bins=10);
127+
train_t['LotArea'].hist(ax=ax[1], bins=10);
128+
129+
We can see below that the intervals contain different number of observations. We can also
130+
see that the shape from the distribution changed from skewed to a more "bell shaped"
131+
distribution.
132+
133+
.. image:: ../../images/increasingwidthdisc.png
134+
135+
|
136+
137+
**Discretisation plus encoding**
138+
139+
If we return the interval values as integers, the discretiser has the option to return
140+
the transformed variable as integer or as object. Why would we want the transformed
141+
variables as object?
142+
143+
Categorical encoders in Feature-engine are designed to work with variables of type
144+
object by default. Thus, if you wish to encode the returned bins further, say to try and
145+
obtain monotonic relationships between the variable and the target, you can do so
146+
seamlessly by setting `return_object` to True. You can find an example of how to use
147+
this functionality `here <https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/discretisation/GeometricWidthDiscretiser_plus_MeanEncoder.ipynb>`_.
148+
149+
More details
150+
^^^^^^^^^^^^
151+
152+
Check also for more details on how to use this transformer:
153+
154+
- `Jupyter notebook - Geometric Discretiser <https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/discretisation/GeometricWidthDiscretiser.ipynb>`_
155+
- `Jupyter notebook - Geometric Discretiser plus Mean encoding <https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/discretisation/GeometricWidthDiscretiser_plus_MeanEncoder.ipynb>`_
156+
157+
All notebooks can be found in a `dedicated repository <https://github.com/feature-engine/feature-engine-examples>`_.

docs/user_guide/discretisation/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,3 +34,4 @@ Throughout the user guide, we point to jupyter notebooks that showcase this func
3434
EqualWidthDiscretiser
3535
ArbitraryDiscretiser
3636
DecisionTreeDiscretiser
37+
GeometricWidthDiscretiser

feature_engine/_docstrings/init_parameters/discretisers.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,7 @@
88
Whether the output should be the interval boundaries. If True, it returns
99
the interval boundaries. If False, it returns integers.
1010
""".rstrip()
11+
12+
_precision_docstring = """precision: int, default=3
13+
The precision at which to store and display the bins labels.
14+
""".rstrip()

feature_engine/discretisation/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,12 @@
77
from .decision_tree import DecisionTreeDiscretiser
88
from .equal_frequency import EqualFrequencyDiscretiser
99
from .equal_width import EqualWidthDiscretiser
10+
from .geometric_width import GeometricWidthDiscretiser
1011

1112
__all__ = [
1213
"DecisionTreeDiscretiser",
1314
"EqualFrequencyDiscretiser",
1415
"EqualWidthDiscretiser",
1516
"ArbitraryDiscretiser",
17+
"GeometricWidthDiscretiser",
1618
]

feature_engine/discretisation/arbitrary.py

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,15 @@
88

99
from feature_engine._base_transformers.mixins import FitFromDictMixin
1010
from feature_engine._docstrings.fit_attributes import (
11+
_binner_dict_docstring,
1112
_feature_names_in_docstring,
1213
_n_features_in_docstring,
13-
_variables_attribute_docstring, _binner_dict_docstring,
14+
_variables_attribute_docstring,
1415
)
1516
from feature_engine._docstrings.init_parameters.discretisers import (
16-
_return_object_docstring,
17+
_precision_docstring,
1718
_return_boundaries_docstring,
19+
_return_object_docstring,
1820
)
1921
from feature_engine._docstrings.methods import (
2022
_fit_not_learn_docstring,
@@ -29,6 +31,7 @@
2931
@Substitution(
3032
return_object=_return_object_docstring,
3133
return_boundaries=_return_boundaries_docstring,
34+
precision=_precision_docstring,
3235
binner_dict_=_binner_dict_docstring,
3336
transform=_transform_discretiser_docstring,
3437
variables_=_variables_attribute_docstring,
@@ -59,6 +62,8 @@ class ArbitraryDiscretiser(BaseDiscretiser, FitFromDictMixin):
5962
6063
{return_boundaries}
6164
65+
{precision}
66+
6267
errors: string, default='ignore'
6368
Indicates what to do when a value is outside the limits indicated in the
6469
'binning_dict'. If 'raise', the transformation will raise an error.
@@ -111,6 +116,7 @@ def __init__(
111116
binning_dict: Dict[Union[str, int], List[Union[str, int]]],
112117
return_object: bool = False,
113118
return_boundaries: bool = False,
119+
precision: int = 3,
114120
errors: str = "ignore",
115121
) -> None:
116122

@@ -126,7 +132,7 @@ def __init__(
126132
f"Got {errors} instead."
127133
)
128134

129-
super().__init__(return_object, return_boundaries)
135+
super().__init__(return_object, return_boundaries, precision)
130136

131137
self.binning_dict = binning_dict
132138
self.errors = errors
@@ -153,12 +159,13 @@ def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None):
153159
return self
154160

155161
def transform(self, X: pd.DataFrame) -> pd.DataFrame:
156-
"""Sort the variable values into the intervals.
162+
"""
163+
Sort the variable values into the intervals.
157164
158165
Parameters
159166
----------
160167
X: pandas dataframe of shape = [n_samples, n_features]
161-
The dataframe to be transformed.
168+
The data to transform.
162169
163170
Returns
164171
-------

0 commit comments

Comments
 (0)