You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
returnhcat([create_helmert_vector(i, k) for i in1:k-1]...)
54
+
end
55
+
56
+
"""
57
+
** Private Method **
58
+
59
+
Fit a contrast encoing scheme on given data in `X`.
60
+
61
+
# Arguments
62
+
63
+
- `X`: A table where the elements of the categorical features have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/) `Multiclass` or `OrderedFactor`
64
+
- `features=[]`: A list of names of categorical features given as symbols to exclude or include from encoding
65
+
- `mode=:dummy`: The type of encoding to use. Can be one of `:contrast`, `:dummy`, `:sum`, `:backward_diff`, `:forward_diff`, `:helmert` or `:hypothesis`.
66
+
If `ignore=false` (features to be encoded are listed explictly in `features`), then this can be a vector of the same length as `features` to specify a different
67
+
contrast encoding scheme for each feature
68
+
- `buildmatrix=nothing`: A function that takes a vector of levels and the number of levels as input and should return a contrast or hypothesis matrix.
69
+
Only relevant if `mode` is `:contrast` or `:hypothesis`.
70
+
- `ignore=true`: Whether to exclude or includes the features given in `features`
71
+
- `ordered_factor=false`: Whether to encode `OrderedFactor` or ignore them
72
+
73
+
# Returns (in a dict)
74
+
75
+
- `vec_given_feat_level`: Maps each level for each column in the selected categorical features to a vector
76
+
- `encoded_features`: The subset of the categorical features of X that were encoded
Use a fitted contrast encoder to encode the levels of selected categorical variables with contrast encoding.
148
+
149
+
# Arguments
150
+
151
+
- `X`: A table where the elements of the categorical features have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/) `Multiclass` or `OrderedFactor`
152
+
- `cache`: The output of `contrast_encoder_fit`
153
+
154
+
# Returns
155
+
156
+
- `X_tr`: The table with selected features after the selected features are encoded by contrast encoding.
MATRIX_SIZE_ERROR(k, matrix_size, feat_name)="In ContrastEncoder, a categorical variable with $k levels should have a contrast matrix of size ($k, $k-1). However, the given contrast matrix by `buildmatrix` is $matrix_size for feature $feat_name."
2
+
MATRIX_SIZE_ERROR_HYP(k, matrix_size, feat_name)="In ContrastEncoder, a categorical variable with $k levels should have a hypothesis matrix of size ($k-1, $k). However, the given hypothesis matrix by `buildmatrix` is $matrix_size for feature $feat_name."
3
+
IGNORE_MUST_FALSE_VEC_MODE ="In ContrastEncoder with mode given as a vector of symbols, the ignore argument must be set to false and features must be explictly specified in features."
4
+
BUILDFUNC_MUST_BE_SPECIFIED ="In ContrastEncoder with mode=:contrast or mode=:hypothesis, the `buildmatrix` argument must be specified."
5
+
LENGTH_MISMATCH_VEC_MODE(len_mode, len_feat) ="In ContrastEncoder with mode given as a vector of symbols, the length of the features argument must match the number of specified modes. However, the method received $(len_mode) modes and $(len_feat) features."
`ContrastEncoder` implements various contrast encoding methods including dummy, sum, backward/forward different, and helmert coding and
78
+
supports more generic coding methods by specifying a function that returns a contrast or hypothesis matrix.
79
+
80
+
# Training data
81
+
82
+
In MLJ (or MLJBase) bind an instance unsupervised `model` to data with
83
+
84
+
mach = machine(model, X)
85
+
86
+
Here:
87
+
88
+
- `X` is any table of input features (eg, a `DataFrame`). Features to be transformed must
89
+
have element scitype `Multiclass` or `OrderedFactor`. Use `schema(X)` to
90
+
check scitypes.
91
+
92
+
Train the machine using `fit!(mach, rows=...)`.
93
+
94
+
# Hyper-parameters
95
+
96
+
- `features=[]`: A list of names of categorical features given as symbols to exclude or include from encoding
97
+
- `mode=:dummy`: The type of encoding to use. Can be one of `:contrast`, `:dummy`, `:sum`, `:backward_diff`, `:forward_diff`, `:helmert` or `:hypothesis`.
98
+
If `ignore=false` (features to be encoded are listed explictly in `features`), then this can be a vector of the same length as `features` to specify a different
99
+
contrast encoding scheme for each feature
100
+
- `buildmatrix=nothing`: A function that takes a vector of levels and the number of levels as input and should return a contrast or hypothesis matrix.
101
+
Only relevant if `mode` is `:contrast` or `:hypothesis`.
102
+
- `ignore=true`: Whether to exclude or includes the features given in `features`
103
+
- `ordered_factor=false`: Whether to encode `OrderedFactor` or ignore them
104
+
105
+
# Operations
106
+
107
+
- `transform(mach, Xnew)`: Apply contrast encoding to selected `Multiclass` or `OrderedFactor features of `Xnew` specified by hyper-parameters, and
108
+
return the new table. Features that are neither `Multiclass` nor `OrderedFactor`
109
+
are always left unchanged.
110
+
111
+
# Fitted parameters
112
+
113
+
The fields of `fitted_params(mach)` are:
114
+
115
+
- `vec_given_feat_val`: A dictionary that maps each level for each column in a subset of the categorical features of X into its frequency.
116
+
117
+
# Report
118
+
119
+
The fields of `report(mach)` are:
120
+
121
+
- `encoded_features`: The subset of the categorical features of X that were encoded
122
+
123
+
# Examples
124
+
125
+
```julia
126
+
using MLJ
127
+
128
+
# Define categorical dataset
129
+
X = (name = categorical(["Ben", "John", "Mary", "John"]),
0 commit comments