You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All of these are fully compatible sklearn transformers, so they can be used in pipelines or in your existing scripts. Supported input formats include numpy arrays and pandas dataframes. If the cols parameter isn't passed, all columns with object or pandas categorical data type will be encoded. Please see the docs for transformer-specific configuration options.
75
57
76
-
Examples
77
-
--------
58
+
To install the development version, you may use:
78
59
79
-
from category_encoders import *
80
-
import pandas as pd
81
-
from sklearn.datasets import load_boston
82
-
83
-
# prepare some data
84
-
bunch = load_boston()
85
-
y = bunch.target
86
-
X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
All of the encoders are fully compatible sklearn transformers, so they can be used in pipelines or in your existing scripts. Supported input formats include numpy arrays and pandas dataframes. If the cols parameter isn't passed, all columns with object or pandas categorical data type will be encoded. Please see the docs for transformer-specific configuration options.
93
68
94
-
In the examples directory, there is an example script used to benchmark
95
-
different encoding techniques on various datasets.
69
+
Examples
70
+
--------
71
+
There are two types of encoders: unsupervised and supervised. An unsupervised example:
72
+
```python
73
+
from category_encoders import*
74
+
import pandas as pd
75
+
from sklearn.datasets import load_boston
76
+
77
+
# prepare some data
78
+
bunch = load_boston()
79
+
y = bunch.target
80
+
X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
81
+
82
+
# use binary encoding to encode two categorical features
83
+
enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X)
84
+
85
+
# transform the dataset
86
+
numeric_dataset = enc.transform(X)
87
+
```
96
88
97
-
The datasets used in the examples are car, mushroom, and splice datasets
Additional examples and benchmarks can be found in the `examples` directory.
101
111
102
112
Contributing
103
113
------------
104
114
105
115
Category encoders is under active development, if you'd like to be involved, we'd love to have you. Check out the CONTRIBUTING.md file
106
116
or open an issue on the github project to get started.
107
117
108
-
License
109
-
-------
110
-
111
-
BSD 3-Clause
112
-
113
118
References:
114
119
-----------
115
120
116
121
1. Kilian Weinberger; Anirban Dasgupta; John Langford; Alex Smola; Josh Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. Proc. ICML.
117
-
2. Contrast Coding Systems for categorical variables. UCLA: Statistical Consulting Group. from https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/.
118
-
3. Gregory Carey (2003). Coding Categorical Variables. from http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf
119
-
4. Strategies to encode categorical variables with many categories. from https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154.
120
-
5. Beyond One-Hot: an exploration of categorical variables. from http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/
121
-
6. BaseN Encoding and Grid Search in categorical variables. from http://www.willmcginnis.com/2016/12/18/basen-encoding-grid-search-category_encoders/
122
-
7. Daniele Miccii-Barreca (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1. from http://dx.doi.org/10.1145/507533.507538
123
-
8. Weight of Evidence (WOE) and Information Value Explained. from https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
122
+
2. Contrast Coding Systems for categorical variables. UCLA: Statistical Consulting Group. From https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/.
123
+
3. Gregory Carey (2003). Coding Categorical Variables. From http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf
124
+
4. Strategies to encode categorical variables with many categories. From https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154.
125
+
5. Beyond One-Hot: an exploration of categorical variables. From http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/
126
+
6. BaseN Encoding and Grid Search in categorical variables. From http://www.willmcginnis.com/2016/12/18/basen-encoding-grid-search-category_encoders/
127
+
7. Daniele Miccii-Barreca (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1. From http://dx.doi.org/10.1145/507533.507538
128
+
8. Weight of Evidence (WOE) and Information Value Explained. From https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
129
+
9. Empirical Bayes for multiple sample sizes. From http://chris-said.io/2017/05/03/empirical-bayes-for-multiple-sample-sizes/
Copy file name to clipboardExpand all lines: category_encoders/one_hot.py
+8-4Lines changed: 8 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -23,13 +23,17 @@ class OneHotEncoder(BaseEstimator, TransformerMixin):
23
23
boolean for whether or not to drop columns with 0 variance.
24
24
return_df: bool
25
25
boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
26
-
handle_unknown: str
27
-
options are 'error', 'return_nan' and 'value', defaults to 'value'. Warning: if value is used,
28
-
an extra column will be added in if the transform matrix has unknown categories. This can cause
29
-
unexpected changes in the dimension in some cases.
30
26
use_cat_names: bool
31
27
if True, category values will be included in the encoded column names. Since this can result into duplicate column names, duplicates are suffixed with '#' symbol until a unique name is generated.
32
28
If False, category indices will be used instead of the category values.
29
+
handle_unknown: str
30
+
options are 'error', 'return_nan', 'value', and 'indicator'. The default is 'value'. Warning: if indicator is used,
31
+
an extra column will be added in if the transform matrix has unknown categories. This can cause
32
+
unexpected changes in dimension in some cases.
33
+
handle_missing: str
34
+
options are 'error', 'return_nan', 'value', and 'indicator'. The default is 'value'. Warning: if indicator is used,
35
+
an extra column will be added in if the transform matrix has nan values. This can cause
0 commit comments