feature-engineering-az/categorical-binary.qmd at main · EmilHvitfeldt/feature-engineering-az · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
---
pagetitle: "Feature Engineering A-Z | Binary Encoding"
image: social-cards/categorical-binary.png
---

# Binary Encoding {#sec-categorical-binary}

::: {style="visibility: hidden; height: 0px;"}
## Binary Encoding
:::

Binary encoding encodes each category by encoding it as its binary representation.
From the categorical variables, you assign an integer value to each level, in the same way as in [Label Encoding](categorical-label).
That value will then be converted to its binary representation, and that value will be returned.

Suppose we have the following variable, and the values they take are (cat = 11, dog = 3, horse = 20).
We are using a subset to gain a better understanding of what is happening.

```{r}
#| label: cat-and-dog-vector
#| echo: false
c("dog", "cat", "horse", "dog", "cat") |>
  glue::double_quote() |>
  cat(sep = ", ")
```

The first thing we need to do is to calculate the binary representation of these numbers.
And we should do it to 5 digits since it is the highest we need in this hypothetical example.
11 = 01011, 3 = 00011, 20 = 10100.
We can then encode this in the following matrix

```{r}
#| label: dummy-example
#| echo: false
dummy <- matrix(0L, nrow = 5, ncol = 5)
colnames(dummy) <- c(16, 8, 4, 2, 1)
dummy[1, 5] <- 1L
dummy[1, 4] <- 1L
dummy[2, 5] <- 1L
dummy[2, 4] <- 1L
dummy[2, 2] <- 1L
dummy[3, 1] <- 1L
dummy[3, 3] <- 1L
dummy[4, 5] <- 1L
dummy[4, 4] <- 1L
dummy[5, 5] <- 1L
dummy[5, 4] <- 1L
dummy[5, 2] <- 1L

knitr::kable(dummy)
```

Each we would be able to uniquely encode `2^5=32` different values with just 5 columns compared to the 32 it would take if you used dummy encoding from @sec-categorical-dummy.
In general, you will be able to encode `n` variables in `ceiling(log2(n))` columns.

::: callout-note
This style of encoding is generalized to other bases.
Binary encoding is a base-2 encoder.
You could just as well have a base 3, or base 10 encoding.
We will not cover these methods further than this mentioned as they are similar in function to binary encoding.
:::

This method isn't widely used.
It does a good job of showing the midpoint between dummy encoding and label encoding in terms of how sparse we want to store our data.
Its limitations come in terms of how interpretable the final model ends up being.
Further, if you want to encode your data more compactly than dummy encoding, you will find better luck using some of the later described methods.

::: callout-caution
# TODO

link to actual methods
:::

::: callout-caution
# TODO

talk about grey encoding
:::

## Pros and Cons

### Pros

-   Uses fewer variables to store the same information as dummy encoding

### Cons

-   Less interpretability compared to dummy variables

## R Examples

We will be using the `ames` data set for these examples.
The `step_encoding_binary()` function from the [extrasteps](https://github.com/emilhvitfeldt/extrasteps) package allows us to perform binary encoding.

```{r}
#| label: set-seed
#| echo: false
set.seed(1234)
# To avoid changing recipe ID columns
```

```{r}
#| label: show-data
#| message: false
library(recipes)
library(extrasteps)
library(modeldata)
data("ames")

ames |>
  select(Sale_Price, MS_SubClass, MS_Zoning)
```

We can take a quick look at the possible values `MS_SubClass` takes

```{r}
#| label: count-MS_SubClass
ames |>
  count(MS_SubClass, sort = TRUE)
```

We can then apply binary encoding using `step_encoding_binary()`.
Notice how we only get 1 numeric variable for each categorical variable

```{r}
#| label: step_encoding_binary
dummy_rec <- recipe(Sale_Price ~ ., data = ames) |>
  step_encoding_binary(all_nominal_predictors()) |>
  prep()

dummy_rec |>
  bake(new_data = NULL, starts_with("MS_SubClass"), starts_with("MS_Zoning")) |>
  glimpse()
```

We can pull the number of distinct levels of each variable by using `tidy()`.

```{r}
#| label: tidy
dummy_rec |>
  tidy(1)
```

## Python Examples

```{python}
#| label: python-setup
#| echo: false
import pandas as pd
from sklearn import set_config

set_config(transform_output="pandas")
pd.set_option('display.precision', 3)
```

We are using the `ames` data set for examples.
{category_encoders} provided the `BinaryEncoder()` method we can use.

```{python}
#| label: binaryencoder
from feazdata import ames
from sklearn.compose import ColumnTransformer
from category_encoders.binary import BinaryEncoder

ct = ColumnTransformer(
    [('binary', BinaryEncoder(), ['MS_Zoning'])],
    remainder="passthrough")

ct.fit(ames)
ct.transform(ames).filter(regex="binary.*")
```