You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
returnhcat([create_helmert_vector(i, k) for i in1:k-1]...)
55
+
end
56
+
57
+
"""
58
+
** Private Method **
59
+
60
+
Fit a contrast encoing scheme on given data in `X`.
61
+
62
+
# Arguments
63
+
64
+
- `X`: A table where the elements of the categorical features have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/) `Multiclass` or `OrderedFactor`
65
+
- `features=[]`: A list of names of categorical features given as symbols to exclude or include from encoding
66
+
- `mode=:dummy`: The type of encoding to use. Can be one of `:contrast`, `:dummy`, `:sum`, `:backward_diff`, `:forward_diff`, `:helmert` or `:hypothesis`.
67
+
If `ignore=false` (features to be encoded are listed explictly in `features`), then this can be a vector of the same length as `features` to specify a different
68
+
contrast encoding scheme for each feature
69
+
- `buildmatrix=nothing`: A function or other callable with signature `buildmatrix(colname, k)`,
70
+
where `colname` is the name of the feature levels and `k` is it's length, and which returns contrast or
71
+
hypothesis matrix with row/column ordering consistent with the ordering of `levels(col)`. Only relevant if `mode` is `:contrast` or `:hypothesis`.
72
+
- `ignore=true`: Whether to exclude or includes the features given in `features`
73
+
- `ordered_factor=false`: Whether to encode `OrderedFactor` or ignore them
74
+
75
+
# Returns (in a dict)
76
+
77
+
- `vec_given_feat_level`: Maps each level for each column in the selected categorical features to a vector
78
+
- `encoded_features`: The subset of the categorical features of X that were encoded
Use a fitted contrast encoder to encode the levels of selected categorical variables with contrast encoding.
150
+
151
+
# Arguments
152
+
153
+
- `X`: A table where the elements of the categorical features have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/) `Multiclass` or `OrderedFactor`
154
+
- `cache`: The output of `contrast_encoder_fit`
155
+
156
+
# Returns
157
+
158
+
- `X_tr`: The table with selected features after the selected features are encoded by contrast encoding.
MATRIX_SIZE_ERROR(k, matrix_size, feat_name)="In ContrastEncoder, a categorical variable with $k levels should have a contrast matrix of size ($k, $k-1). However, the contrast matrix returned by `buildmatrix` is $matrix_size for feature $feat_name."
2
+
MATRIX_SIZE_ERROR_HYP(k, matrix_size, feat_name)="In ContrastEncoder, a categorical variable with $k levels should have a hypothesis matrix of size ($k-1, $k). However, the given hypothesis matrix returned by `buildmatrix` is $matrix_size for feature $feat_name."
3
+
IGNORE_MUST_FALSE_VEC_MODE ="In ContrastEncoder with mode given as a vector of symbols, the ignore argument must be set to false and features must be explictly specified in features."
4
+
BUILDFUNC_MUST_BE_SPECIFIED ="In ContrastEncoder with mode=:contrast or mode=:hypothesis, the `buildmatrix` argument must be specified."
5
+
LENGTH_MISMATCH_VEC_MODE(len_mode, len_feat) ="In ContrastEncoder with mode given as a vector of symbols, the length of the features argument must match the number of specified modes. However, the method received $(len_mode) modes and $(len_feat) features."
`ContrastEncoder` implements the following contrast encoding methods for
78
+
categorical features: dummy, sum, backward/forward difference, and Helmert coding.
79
+
More generally, users can specify a custom contrast or hypothesis matrix, and each feature
80
+
can be encoded using a different method.
81
+
82
+
# Training data
83
+
84
+
In MLJ (or MLJBase) bind an instance unsupervised `model` to data with
85
+
86
+
mach = machine(model, X)
87
+
88
+
Here:
89
+
90
+
- `X` is any table of input features (eg, a `DataFrame`). Features to be transformed must
91
+
have element scitype `Multiclass` or `OrderedFactor`. Use `schema(X)` to
92
+
check scitypes.
93
+
94
+
Train the machine using `fit!(mach, rows=...)`.
95
+
96
+
# Hyper-parameters
97
+
98
+
- `features=[]`: A list of names of categorical features given as symbols to exclude or include from encoding
99
+
- `mode=:dummy`: The type of encoding to use. Can be one of `:contrast`, `:dummy`, `:sum`, `:backward_diff`, `:forward_diff`, `:helmert` or `:hypothesis`.
100
+
If `ignore=false` (features to be encoded are listed explictly in `features`), then this can be a vector of the same length as `features` to specify a different
101
+
contrast encoding scheme for each feature
102
+
- `buildmatrix=nothing`: A function or other callable with signature `buildmatrix(colname, k)`,
103
+
where `colname` is the name of the feature levels and `k` is it's length, and which returns contrast or
104
+
hypothesis matrix with row/column ordering consistent with the ordering of `levels(col)`. Only relevant if `mode` is `:contrast` or `:hypothesis`.
105
+
- `ignore=true`: Whether to exclude or includes the features given in `features`
106
+
- `ordered_factor=false`: Whether to encode `OrderedFactor` or ignore them
107
+
108
+
# Operations
109
+
110
+
- `transform(mach, Xnew)`: Apply contrast encoding to selected `Multiclass` or `OrderedFactor features of `Xnew` specified by hyper-parameters, and
111
+
return the new table. Features that are neither `Multiclass` nor `OrderedFactor`
112
+
are always left unchanged.
113
+
114
+
# Fitted parameters
115
+
116
+
The fields of `fitted_params(mach)` are:
117
+
118
+
- `vector_given_value_given_feature`: A dictionary that maps each level for each column in a subset of the categorical features of X into its frequency.
119
+
120
+
# Report
121
+
122
+
The fields of `report(mach)` are:
123
+
124
+
- `encoded_features`: The subset of the categorical features of X that were encoded
125
+
126
+
# Examples
127
+
128
+
```julia
129
+
using MLJ
130
+
131
+
# Define categorical dataset
132
+
X = (
133
+
name = categorical(["Ben", "John", "Mary", "John"]),
0 commit comments