-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy pathreadme.Rmd
More file actions
90 lines (66 loc) · 2.21 KB
/
readme.Rmd
File metadata and controls
90 lines (66 loc) · 2.21 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
title: "README"
author: "Me"
date: "April 30, 2017"
output: md_document
---
[](https://travis-ci.org/Zelazny7/onehot)
```{r, echo=FALSE}
options(width=250)
```
## Onehot package
### Installation
```{r, eval=FALSE}
devtools::install_github("https://github.com/Zelazny7/onehot")
```
### Usage
```{r}
set.seed(100)
test <- data.frame(
factor = factor(sample(c(NA, letters[1:3]), 100, T)),
integer = as.integer(runif(100) * 10),
real = rnorm(100),
logical = sample(c(T, F), 100, T),
character = sample(letters, 100, T),
stringsAsFactors = FALSE)
head(test)
```
### Create a onehot object
A onehot object contains information about the data.frame. This is used to
transform a data.frame into a onehot encoded matrix. It should be saved to
transform future datasets into the same exact layout.
```{r}
library(onehot)
encoder <- onehot(test)
## printe a summary
encoder
```
### Transforming data.frames
The onehot object has a predict method which may be used to transform a
data.frame. Factors are onehot encoded. Character variables are skipped.
However calling predict with `stringsAsFactors=TRUE` will convert character
vectors to factors first.
```{r}
train_data <- predict(encoder, test)
head(train_data)
```
### NA indicator columns
`add_NA_factors=TRUE` (the default) will create an indicator column for every factor column. Having NAs as a factor
level will result in an indicator column being created without using this option.
```{r}
encoder <- onehot(test, add_NA_factors=TRUE)
train_data <- predict(encoder, test)
head(train_data)
```
### Sentinel values for numeric columns
The `sentinel=VALUE` argument will replace all numeric NAs with the provided value. Some ML algorithms such
as `randomForest` and `xgboost` do not handle NA values. However, by using sentinel values such algorithms are
usually able to separate them with enough decision-tree splits. The default value is `-999`
### Sparse Matrices
`onehot` also provides support for predicting sparse, column compressed matrices
from the `Matrix` package:
```{r}
encoder <- onehot(test)
train_data <- predict(encoder, test, sparse=TRUE)
head(train_data)
```