|
| 1 | +.. _increasing_width_discretiser: |
| 2 | + |
| 3 | +.. currentmodule:: feature_engine.discretisation |
| 4 | + |
| 5 | +GeometricWidthDiscretiser |
| 6 | +========================= |
| 7 | + |
| 8 | +The :class:`GeometricWidthDiscretiser()` divides continuous numerical variables into |
| 9 | +intervals of increasing width. The width of each succeeding interval is larger than the |
| 10 | +previous interval by a constant amount (cw). |
| 11 | + |
| 12 | +The constant amount is calculated as: |
| 13 | + |
| 14 | + .. math:: |
| 15 | + cw = (Max - Min)^{1/n} |
| 16 | +
|
| 17 | +were Max and Min are the variable's maximum and minimum value, and n is the number of |
| 18 | +intervals. |
| 19 | + |
| 20 | +The sizes of the intervals themselves are calculated with a geometric progression: |
| 21 | + |
| 22 | + .. math:: |
| 23 | + a_{i+1} = a_i cw |
| 24 | +
|
| 25 | +Thus, the first interval's width equals cw, the second interval's width equals 2 * cw, |
| 26 | +and so on. |
| 27 | + |
| 28 | +Note that the proportion of observations per interval may vary. |
| 29 | + |
| 30 | +This discretisation technique is great when the distribution of the variable is right skewed. |
| 31 | + |
| 32 | +Note: The width of some bins might be very small. Thus, to allow this transformer |
| 33 | +to work properly, it might help to increase the precision value, that is, |
| 34 | +the number of decimal values allowed to define each bin. If the variable has a |
| 35 | +narrow range or you are sorting into several bins, allow greater precision |
| 36 | +(i.e., if precision = 3, then 0.001; if precision = 7, then 0.0001). |
| 37 | + |
| 38 | +The :class:`GeometricWidthDiscretiser()` works only with numerical variables. A list of |
| 39 | +variables to discretise can be indicated, or the discretiser will automatically select |
| 40 | +all numerical variables in the train set. |
| 41 | + |
| 42 | +**Example** |
| 43 | + |
| 44 | +Let's look at an example using the house prices dataset (more details about the |
| 45 | +dataset :ref:`here <datasets>`). |
| 46 | + |
| 47 | +Let's load the house prices dataset and separate it into train and test sets: |
| 48 | + |
| 49 | +.. code:: python |
| 50 | +
|
| 51 | + import numpy as np |
| 52 | + import pandas as pd |
| 53 | + import matplotlib.pyplot as plt |
| 54 | + from sklearn.model_selection import train_test_split |
| 55 | +
|
| 56 | + from feature_engine.discretisation import GeometricWidthDiscretiser |
| 57 | +
|
| 58 | + # Load dataset |
| 59 | + data = pd.read_csv('houseprice.csv') |
| 60 | +
|
| 61 | + # Separate into train and test sets |
| 62 | + X_train, X_test, y_train, y_test = train_test_split( |
| 63 | + data.drop(['Id', 'SalePrice'], axis=1), |
| 64 | + data['SalePrice'], test_size=0.3, random_state=0) |
| 65 | +
|
| 66 | +
|
| 67 | +Now, we want to discretise the 2 variables indicated below into 10 intervals of increasing |
| 68 | +width: |
| 69 | + |
| 70 | +.. code:: python |
| 71 | +
|
| 72 | + # set up the discretisation transformer |
| 73 | + disc = GeometricWidthDiscretiser(bins=10, variables=['LotArea', 'GrLivArea']) |
| 74 | +
|
| 75 | + # fit the transformer |
| 76 | + disc.fit(X_train) |
| 77 | +
|
| 78 | +With `fit()` the transformer learns the boundaries of each interval. Then, we can go |
| 79 | +ahead and sort the values into the intervals: |
| 80 | + |
| 81 | +.. code:: python |
| 82 | +
|
| 83 | + # transform the data |
| 84 | + train_t= disc.transform(X_train) |
| 85 | + test_t= disc.transform(X_test) |
| 86 | +
|
| 87 | +The `binner_dict_` stores the interval limits identified for each variable. |
| 88 | + |
| 89 | +.. code:: python |
| 90 | +
|
| 91 | + disc.binner_dict_ |
| 92 | +
|
| 93 | +.. code:: python |
| 94 | +
|
| 95 | + 'LotArea': [-inf, |
| 96 | + 1303.412, |
| 97 | + 1311.643, |
| 98 | + 1339.727, |
| 99 | + 1435.557, |
| 100 | + 1762.542, |
| 101 | + 2878.27, |
| 102 | + 6685.32, |
| 103 | + 19675.608, |
| 104 | + 64000.633, |
| 105 | + inf], |
| 106 | + 'GrLivArea': [-inf, |
| 107 | + 336.311, |
| 108 | + 339.34, |
| 109 | + 346.34, |
| 110 | + 362.515, |
| 111 | + 399.894, |
| 112 | + 486.27, |
| 113 | + 685.871, |
| 114 | + 1147.115, |
| 115 | + 2212.974, |
| 116 | + inf]} |
| 117 | +
|
| 118 | +With increasing width discretisation, each bin does not necessarily contain the same number |
| 119 | +of observations. This transformer is suitable for variables with right skewed distributions. |
| 120 | + |
| 121 | +Let's compare the variable distribution before and after the discretization: |
| 122 | + |
| 123 | +.. code:: python |
| 124 | +
|
| 125 | + fig, ax = plt.subplots(1, 2) |
| 126 | + X_train['LotArea'].hist(ax=ax[0], bins=10); |
| 127 | + train_t['LotArea'].hist(ax=ax[1], bins=10); |
| 128 | +
|
| 129 | +We can see below that the intervals contain different number of observations. We can also |
| 130 | +see that the shape from the distribution changed from skewed to a more "bell shaped" |
| 131 | +distribution. |
| 132 | + |
| 133 | +.. image:: ../../images/increasingwidthdisc.png |
| 134 | + |
| 135 | +| |
| 136 | +
|
| 137 | +**Discretisation plus encoding** |
| 138 | + |
| 139 | +If we return the interval values as integers, the discretiser has the option to return |
| 140 | +the transformed variable as integer or as object. Why would we want the transformed |
| 141 | +variables as object? |
| 142 | + |
| 143 | +Categorical encoders in Feature-engine are designed to work with variables of type |
| 144 | +object by default. Thus, if you wish to encode the returned bins further, say to try and |
| 145 | +obtain monotonic relationships between the variable and the target, you can do so |
| 146 | +seamlessly by setting `return_object` to True. You can find an example of how to use |
| 147 | +this functionality `here <https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/discretisation/GeometricWidthDiscretiser_plus_MeanEncoder.ipynb>`_. |
| 148 | + |
| 149 | +More details |
| 150 | +^^^^^^^^^^^^ |
| 151 | + |
| 152 | +Check also for more details on how to use this transformer: |
| 153 | + |
| 154 | +- `Jupyter notebook - Geometric Discretiser <https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/discretisation/GeometricWidthDiscretiser.ipynb>`_ |
| 155 | +- `Jupyter notebook - Geometric Discretiser plus Mean encoding <https://nbviewer.org/github/feature-engine/feature-engine-examples/blob/main/discretisation/GeometricWidthDiscretiser_plus_MeanEncoder.ipynb>`_ |
| 156 | + |
| 157 | +All notebooks can be found in a `dedicated repository <https://github.com/feature-engine/feature-engine-examples>`_. |
0 commit comments