|
11 | 11 | class CatBoostEncoder(util.BaseEncoder, util.SupervisedTransformerMixin): |
12 | 12 | """CatBoost Encoding for categorical features. |
13 | 13 |
|
14 | | - Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper. |
15 | | -
|
16 | | - This is very similar to leave-one-out encoding, but calculates the |
17 | | - values "on-the-fly". Consequently, the values naturally vary |
18 | | - during the training phase and it is not necessary to add random noise. |
19 | | -
|
20 | | - Beware, the training data have to be randomly permutated. E.g.: |
21 | | -
|
22 | | - # Random permutation |
23 | | - perm = np.random.permutation(len(X)) |
24 | | - X = X.iloc[perm].reset_index(drop=True) |
25 | | - y = y.iloc[perm].reset_index(drop=True) |
26 | | -
|
27 | | - This is necessary because some data sets are sorted based on the target |
28 | | - value and this coder encodes the features on-the-fly in a single pass. |
| 14 | + Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper. |
| 15 | +
|
| 16 | + CatBoostEncoder is the variation of target encoding. It supports |
| 17 | + time-aware encoding, regularization, and online learning. |
| 18 | +
|
| 19 | + This implementation is time-aware (similar to CatBoost's parameter 'has_time=True'), |
| 20 | + so no random permutations are used. It makes this encoder sensitive to |
| 21 | + ordering of the data and suitable for time series problems. If your data |
| 22 | + does not have time dependency, it should still work just fine, assuming |
| 23 | + sorting of the data won't leak any information outside the training scope |
| 24 | + (i.e., no data leakage). When data leakage is a possibility, it is wise to |
| 25 | + eliminate it first (for example, shuffle or resample the data). |
| 26 | +
|
| 27 | + NOTE: behavior of the transformer would differ in transform and fit_transform |
| 28 | + methods depending if y values are passed. If no target is passed, then |
| 29 | + encoder will map the last value of the running mean to each category. If y is passed |
| 30 | + then it will map all values of the running mean to each category's occurrences. |
29 | 31 |
|
30 | 32 | Parameters |
31 | 33 | ---------- |
|
0 commit comments