Skip to content

Commit 7f50004

Browse files
committed
Merge remote-tracking branch 'origin/master'
2 parents baacec7 + b37a212 commit 7f50004

File tree

8 files changed

+97
-83
lines changed

8 files changed

+97
-83
lines changed

README.md

Lines changed: 62 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -22,102 +22,108 @@ Encoding Methods
2222
* Binary [5]
2323
* Hashing [1]
2424
* Helmert Contrast [2][3]
25+
* James-Stein Estimator [9]
2526
* LeaveOneOut [4]
27+
* M-estimator [7]
2628
* Ordinal [2][3]
2729
* One-Hot [2][3]
2830
* Polynomial Contrast [2][3]
2931
* Sum Contrast [2][3]
3032
* Target Encoding [7]
3133
* Weight of Evidence [8]
3234

33-
Usage
35+
Installation
3436
-----
3537

36-
The package by itself comes with a single module and an estimator. Before
37-
installing the module you will need `numpy`, `statsmodels`, and `scipy`.
38+
The package requires: `numpy`, `statsmodels`, and `scipy`.
3839

39-
To install the module execute:
40+
To install the package, execute:
4041

4142
```shell
4243
$ python setup.py install
4344
```
4445

4546
or
4647

47-
```
48+
```shell
4849
pip install category_encoders
4950
```
5051

5152
or
5253

53-
```
54+
```shell
5455
conda install -c conda-forge category_encoders
5556
```
56-
57-
To use:
58-
59-
import category_encoders as ce
60-
61-
encoder = ce.BackwardDifferenceEncoder(cols=[...])
62-
encoder = ce.BaseNEncoder(cols=[...])
63-
encoder = ce.BinaryEncoder(cols=[...])
64-
encoder = ce.HashingEncoder(cols=[...])
65-
encoder = ce.HelmertEncoder(cols=[...])
66-
encoder = ce.LeaveOneOutEncoder(cols=[...])
67-
encoder = ce.OneHotEncoder(cols=[...])
68-
encoder = ce.OrdinalEncoder(cols=[...])
69-
encoder = ce.PolynomialEncoder(cols=[...])
70-
encoder = ce.SumEncoder(cols=[...])
71-
encoder = ce.TargetEncoder(cols=[...])
72-
encoder = ce.WOEEncoder(cols=[...])
73-
74-
All of these are fully compatible sklearn transformers, so they can be used in pipelines or in your existing scripts. Supported input formats include numpy arrays and pandas dataframes. If the cols parameter isn't passed, all columns with object or pandas categorical data type will be encoded. Please see the docs for transformer-specific configuration options.
7557

76-
Examples
77-
--------
58+
To install the development version, you may use:
7859

79-
from category_encoders import *
80-
import pandas as pd
81-
from sklearn.datasets import load_boston
82-
83-
# prepare some data
84-
bunch = load_boston()
85-
y = bunch.target
86-
X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
60+
```shell
61+
pip install --upgrade git+https://github.com/scikit-learn-contrib/categorical-encoding
62+
```
8763

88-
# use binary encoding to encode two categorical features
89-
enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X, y)
64+
Usage
65+
-----
9066

91-
# transform the dataset
92-
numeric_dataset = enc.transform(X)
67+
All of the encoders are fully compatible sklearn transformers, so they can be used in pipelines or in your existing scripts. Supported input formats include numpy arrays and pandas dataframes. If the cols parameter isn't passed, all columns with object or pandas categorical data type will be encoded. Please see the docs for transformer-specific configuration options.
9368

94-
In the examples directory, there is an example script used to benchmark
95-
different encoding techniques on various datasets.
69+
Examples
70+
--------
71+
There are two types of encoders: unsupervised and supervised. An unsupervised example:
72+
```python
73+
from category_encoders import *
74+
import pandas as pd
75+
from sklearn.datasets import load_boston
76+
77+
# prepare some data
78+
bunch = load_boston()
79+
y = bunch.target
80+
X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
81+
82+
# use binary encoding to encode two categorical features
83+
enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X)
84+
85+
# transform the dataset
86+
numeric_dataset = enc.transform(X)
87+
```
9688

97-
The datasets used in the examples are car, mushroom, and splice datasets
98-
from the UCI dataset repository, found here:
89+
And a supervised example:
90+
```python
91+
from category_encoders import *
92+
import pandas as pd
93+
from sklearn.datasets import load_boston
94+
95+
# prepare some data
96+
bunch = load_boston()
97+
y_train = bunch.target[0:250]
98+
y_test = bunch.target[250:506]
99+
X_train = pd.DataFrame(bunch.data[0:250], columns=bunch.feature_names)
100+
X_test = pd.DataFrame(bunch.data[250:506], columns=bunch.feature_names)
101+
102+
# use target encoding to encode two categorical features
103+
enc = TargetEncoder(cols=['CHAS', 'RAD']).fit(X_train, y_train)
104+
105+
# transform the datasets
106+
training_numeric_dataset = enc.transform(X_train, y_train)
107+
testing_numeric_dataset = enc.transform(X_test)
108+
```
99109

100-
[datasets](https://archive.ics.uci.edu/ml/datasets)
110+
Additional examples and benchmarks can be found in the `examples` directory.
101111

102112
Contributing
103113
------------
104114

105115
Category encoders is under active development, if you'd like to be involved, we'd love to have you. Check out the CONTRIBUTING.md file
106116
or open an issue on the github project to get started.
107117

108-
License
109-
-------
110-
111-
BSD 3-Clause
112-
113118
References:
114119
-----------
115120

116121
1. Kilian Weinberger; Anirban Dasgupta; John Langford; Alex Smola; Josh Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. Proc. ICML.
117-
2. Contrast Coding Systems for categorical variables. UCLA: Statistical Consulting Group. from https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/.
118-
3. Gregory Carey (2003). Coding Categorical Variables. from http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf
119-
4. Strategies to encode categorical variables with many categories. from https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154.
120-
5. Beyond One-Hot: an exploration of categorical variables. from http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/
121-
6. BaseN Encoding and Grid Search in categorical variables. from http://www.willmcginnis.com/2016/12/18/basen-encoding-grid-search-category_encoders/
122-
7. Daniele Miccii-Barreca (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1. from http://dx.doi.org/10.1145/507533.507538
123-
8. Weight of Evidence (WOE) and Information Value Explained. from https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
122+
2. Contrast Coding Systems for categorical variables. UCLA: Statistical Consulting Group. From https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/.
123+
3. Gregory Carey (2003). Coding Categorical Variables. From http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf
124+
4. Strategies to encode categorical variables with many categories. From https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154.
125+
5. Beyond One-Hot: an exploration of categorical variables. From http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/
126+
6. BaseN Encoding and Grid Search in categorical variables. From http://www.willmcginnis.com/2016/12/18/basen-encoding-grid-search-category_encoders/
127+
7. Daniele Miccii-Barreca (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1. From http://dx.doi.org/10.1145/507533.507538
128+
8. Weight of Evidence (WOE) and Information Value Explained. From https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
129+
9. Empirical Bayes for multiple sample sizes. From http://chris-said.io/2017/05/03/empirical-bayes-for-multiple-sample-sizes/

category_encoders/backward_difference.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,12 +25,12 @@ class BackwardDifferenceEncoder(BaseEstimator, TransformerMixin):
2525
return_df: bool
2626
boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
2727
handle_unknown: str
28-
options are 'error', 'return_nan' and 'value', defaults to 'value'. Warning: if value is used,
28+
options are 'error', 'return_nan', 'value', and 'indicator'. The default is 'value'. Warning: if indicator is used,
2929
an extra column will be added in if the transform matrix has unknown categories. This can cause
3030
unexpected changes in dimension in some cases.
3131
handle_missing: str
32-
options are 'error', 'return_nan', 'value', and 'indicator', defaults to 'indicator'. Warning: if indicator is used,
33-
an extra column will be added in if the transform matrix has unknown categories. This can cause
32+
options are 'error', 'return_nan', 'value', and 'indicator'. The default is 'value'. Warning: if indicator is used,
33+
an extra column will be added in if the transform matrix has nan values. This can cause
3434
unexpected changes in dimension in some cases.
3535
3636
Example
@@ -267,7 +267,7 @@ def backward_difference_coding(X_in, mapping):
267267
col = switch.get('col')
268268
mod = switch.get('mapping')
269269

270-
base_df = mod.loc[X[col]]
270+
base_df = mod.reindex(X[col])
271271
base_df.set_index(X.index, inplace=True)
272272
X = pd.concat([base_df, X], axis=1)
273273

category_encoders/basen.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,9 +30,13 @@ class BaseNEncoder(BaseEstimator, TransformerMixin):
3030
base: int
3131
when the downstream model copes well with nonlinearities (like decision tree), use higher base.
3232
handle_unknown: str
33-
options are 'error', 'return_nan' and 'value', defaults to 'value'. Warning: if value is used,
33+
options are 'error', 'return_nan', 'value', and 'indicator'. The default is 'value'. Warning: if indicator is used,
3434
an extra column will be added in if the transform matrix has unknown categories. This can cause
3535
unexpected changes in dimension in some cases.
36+
handle_missing: str
37+
options are 'error', 'return_nan', 'value', and 'indicator'. The default is 'value'. Warning: if indicator is used,
38+
an extra column will be added in if the transform matrix has nan values. This can cause
39+
unexpected changes in dimension in some cases.
3640
3741
Example
3842
-------
@@ -319,7 +323,7 @@ def basen_encode(self, X_in, cols=None):
319323
col = switch.get('col')
320324
mod = switch.get('mapping')
321325

322-
base_df = mod.loc[X[col]]
326+
base_df = mod.reindex(X[col])
323327
base_df.set_index(X.index, inplace=True)
324328
X = pd.concat([base_df, X], axis=1)
325329

category_encoders/binary.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,12 +23,12 @@ class BinaryEncoder(BaseEstimator, TransformerMixin):
2323
return_df: bool
2424
boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
2525
handle_unknown: str
26-
options are 'error', 'return_nan' and 'value', defaults to 'value'. Warning: if value is used,
26+
options are 'error', 'return_nan', 'value', and 'indicator'. The default is 'value'. Warning: if indicator is used,
2727
an extra column will be added in if the transform matrix has unknown categories. This can cause
2828
unexpected changes in dimension in some cases.
2929
handle_missing: str
30-
options are 'error', 'return_nan', 'value', and 'indicator', defaults to 'indicator'. Warning: if indicator is used,
31-
an extra column will be added in if the transform matrix has unknown categories. This can cause
30+
options are 'error', 'return_nan', 'value', and 'indicator'. The default is 'value'. Warning: if indicator is used,
31+
an extra column will be added in if the transform matrix has nan values. This can cause
3232
unexpected changes in dimension in some cases.
3333
3434
Example

category_encoders/helmert.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,12 +26,12 @@ class HelmertEncoder(BaseEstimator, TransformerMixin):
2626
return_df: bool
2727
boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
2828
handle_unknown: str
29-
options are 'error', 'return_nan', 'value', and 'indicator', defaults to 'indicator'. Warning: if indicator is used,
29+
options are 'error', 'return_nan', 'value', and 'indicator'. The default is 'value'. Warning: if indicator is used,
3030
an extra column will be added in if the transform matrix has unknown categories. This can cause
3131
unexpected changes in dimension in some cases.
3232
handle_missing: str
33-
options are 'error', 'return_nan', 'value', and 'indicator', defaults to 'indicator'. Warning: if indicator is used,
34-
an extra column will be added in if the transform matrix has unknown categories. This can cause
33+
options are 'error', 'return_nan', 'value', and 'indicator'. The default is 'value'. Warning: if indicator is used,
34+
an extra column will be added in if the transform matrix has nan values. This can cause
3535
unexpected changes in dimension in some cases.
3636
3737
Example
@@ -264,7 +264,7 @@ def helmert_coding(X_in, mapping):
264264
col = switch.get('col')
265265
mod = switch.get('mapping')
266266

267-
base_df = mod.loc[X[col]]
267+
base_df = mod.reindex(X[col])
268268
base_df.set_index(X.index, inplace=True)
269269
X = pd.concat([base_df, X], axis=1)
270270

category_encoders/one_hot.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,13 +23,17 @@ class OneHotEncoder(BaseEstimator, TransformerMixin):
2323
boolean for whether or not to drop columns with 0 variance.
2424
return_df: bool
2525
boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
26-
handle_unknown: str
27-
options are 'error', 'return_nan' and 'value', defaults to 'value'. Warning: if value is used,
28-
an extra column will be added in if the transform matrix has unknown categories. This can cause
29-
unexpected changes in the dimension in some cases.
3026
use_cat_names: bool
3127
if True, category values will be included in the encoded column names. Since this can result into duplicate column names, duplicates are suffixed with '#' symbol until a unique name is generated.
3228
If False, category indices will be used instead of the category values.
29+
handle_unknown: str
30+
options are 'error', 'return_nan', 'value', and 'indicator'. The default is 'value'. Warning: if indicator is used,
31+
an extra column will be added in if the transform matrix has unknown categories. This can cause
32+
unexpected changes in dimension in some cases.
33+
handle_missing: str
34+
options are 'error', 'return_nan', 'value', and 'indicator'. The default is 'value'. Warning: if indicator is used,
35+
an extra column will be added in if the transform matrix has nan values. This can cause
36+
unexpected changes in dimension in some cases.
3337
3438
Example
3539
-------

category_encoders/polynomial.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -25,12 +25,12 @@ class PolynomialEncoder(BaseEstimator, TransformerMixin):
2525
return_df: bool
2626
boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
2727
handle_unknown: str
28-
options are 'error', 'return_nan' and 'value', defaults to 'value'. Warning: if value is used,
28+
options are 'error', 'return_nan', 'value', and 'indicator'. The default is 'value'. Warning: if indicator is used,
2929
an extra column will be added in if the transform matrix has unknown categories. This can cause
30-
unexpected changes in the dimension in some cases.
30+
unexpected changes in dimension in some cases.
3131
handle_missing: str
32-
options are 'error', 'return_nan', 'value', and 'indicator', defaults to 'indicator'. Warning: if indicator is used,
33-
an extra column will be added in if the transform matrix has unknown categories. This can cause
32+
options are 'error', 'return_nan', 'value', and 'indicator'. The default is 'value'. Warning: if indicator is used,
33+
an extra column will be added in if the transform matrix has nan values. This can cause
3434
unexpected changes in dimension in some cases.
3535
3636
Example
@@ -264,7 +264,7 @@ def polynomial_coding(X_in, mapping):
264264
col = switch.get('col')
265265
mod = switch.get('mapping')
266266

267-
base_df = mod.loc[X[col]]
267+
base_df = mod.reindex(X[col])
268268
base_df.set_index(X.index, inplace=True)
269269
X = pd.concat([base_df, X], axis=1)
270270

category_encoders/sum_coding.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -25,12 +25,12 @@ class SumEncoder(BaseEstimator, TransformerMixin):
2525
return_df: bool
2626
boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
2727
handle_unknown: str
28-
options are 'error', 'return_nan' and 'value', defaults to 'value'. Warning: if value is used,
28+
options are 'error', 'return_nan', 'value', and 'indicator'. The default is 'value'. Warning: if indicator is used,
2929
an extra column will be added in if the transform matrix has unknown categories. This can cause
30-
unexpected changes in the dimension in some cases.
30+
unexpected changes in dimension in some cases.
3131
handle_missing: str
32-
options are 'error', 'return_nan', 'value', and 'indicator', defaults to 'indicator'. Warning: if indicator is used,
33-
an extra column will be added in if the transform matrix has unknown categories. This can cause
32+
options are 'error', 'return_nan', 'value', and 'indicator'. The default is 'value'. Warning: if indicator is used,
33+
an extra column will be added in if the transform matrix has nan values. This can cause
3434
unexpected changes in dimension in some cases.
3535
3636
Example
@@ -265,7 +265,7 @@ def sum_coding(X_in, mapping):
265265
col = switch.get('col')
266266
mod = switch.get('mapping')
267267

268-
base_df = mod.loc[X[col]]
268+
base_df = mod.reindex(X[col])
269269
base_df.set_index(X.index, inplace=True)
270270
X = pd.concat([base_df, X], axis=1)
271271

0 commit comments

Comments
 (0)