|
1 | 1 | # generic functions go here; such function can be used throughout multiple methods |
2 | 2 |
|
3 | 3 | """ |
4 | | -**Private method.** |
| 4 | +```julia |
| 5 | +generic_fit(X, |
| 6 | + features = Symbol[], |
| 7 | + args...; |
| 8 | + ignore::Bool = true, |
| 9 | + ordered_factor::Bool = false, |
| 10 | + feature_mapper, |
| 11 | + kwargs..., |
| 12 | +) |
| 13 | +``` |
| 14 | +
|
| 15 | +Given a `feature_mapper` (see definition below), this method applies |
| 16 | + `feature_mapper` across a specified subset of categorical columns in X and returns a dictionary |
| 17 | + whose keys are the feature names, and each value is the corresponding |
| 18 | + level‑to‑value mapping produced by `feature_mapper`. |
| 19 | +
|
| 20 | +In essence, it spares effort of looping over each column and applying the `feature_mapper` function manually as well as handling the feature selection logic. |
5 | 21 |
|
6 | | -A generic function to fit a class of transformers where its convenient to define a single `feature_mapper` function that |
7 | | -takes the column as a vector and potentially other arguments (as passed in ...args and ...kwargs) and returns |
8 | | -a dictionary that maps each level of the categorical feature to a scalar or vector |
9 | | -according to the transformation logic. In other words, the `feature_mapper` simply answers the question "For level n of |
10 | | -the current categorical feature c, what should the new value or vector (multiple features) be as defined by the transformation |
11 | | -logic?" |
12 | 22 |
|
13 | 23 | # Arguments |
14 | 24 |
|
15 | | - $X_doc |
16 | | - $features_doc |
17 | | - $ignore_doc |
18 | | - $ordered_factor_doc |
19 | | - - feature_mapper: Defined above. |
| 25 | +$X_doc |
| 26 | +$features_doc |
| 27 | +$ignore_doc |
| 28 | +$ordered_factor_doc |
| 29 | +- feature_mapper: function that, for a given vector (eg, corresponding to a categorical column from the dataset `X`), |
| 30 | + produces a mapping from each category level name in this vector to a scalar or vector according to specified transformation logic. |
20 | 31 |
|
21 | | -# Returns |
| 32 | +# Note |
| 33 | +
|
| 34 | +- Any additional arguments (whether keyword or not) provided to this function are passed to the `feature_mapper` function which |
| 35 | + is helpful when `feature_mapper` requires additional arguments to compute the mapping (eg, hyperparameters). |
22 | 36 |
|
23 | | - - mapping_per_feat_level: Maps each level for each feature in a subset of the categorical features of |
24 | | - X into a scalar or a vector. |
25 | | - $encoded_features_doc |
| 37 | +# Returns |
| 38 | +- `mapping_per_feat_level`: Maps each level for each feature in a subset of the categorical features of |
| 39 | + X into a scalar or a vector. |
| 40 | +$encoded_features_doc |
26 | 41 | """ |
27 | 42 | function generic_fit(X, |
28 | 43 | features = Symbol[], |
@@ -116,25 +131,47 @@ end |
116 | 131 |
|
117 | 132 |
|
118 | 133 | """ |
119 | | -**Private method.** |
| 134 | +```julia |
| 135 | +generic_transform( |
| 136 | + X, |
| 137 | + mapping_per_feat_level; |
| 138 | + single_feat::Bool = true, |
| 139 | + ignore_unknown::Bool = false, |
| 140 | + use_levelnames::Bool = false, |
| 141 | + custom_levels = nothing, |
| 142 | + ensure_categorical::Bool = false, |
| 143 | +) |
| 144 | +``` |
| 145 | +
|
| 146 | +
|
| 147 | +Apply a per‐level feature mapping to selected categorical columns in `X`, returning a new table of the same type. |
| 148 | +
|
| 149 | +# Arguments |
| 150 | +
|
| 151 | +$X_doc |
| 152 | +- `mapping_per_feat_level::Dict{Symbol,Dict}`: |
| 153 | + A dict whose keys are feature names (`Symbol`) and values are themselves dictionaries |
| 154 | + mapping each observed level to either a scalar (if `single_feat=true`) or a fixed‐length vector |
| 155 | + (if `single_feat=false`). Only columns whose names appear in `mapping_per_feat_level` are |
| 156 | + transformed; others pass through unchanged. |
| 157 | +- `single_feat::Bool=true`: |
| 158 | + If `true`, each input level is mapped to a single scalar feature; if `false`, |
| 159 | + each input level is mapped to a length‑`k` vector, producing `k` output columns. |
| 160 | +- `ignore_unknown::Bool=false`: |
| 161 | + If `false`, novel levels in `X` (not seen during fit) will raise an error; |
| 162 | + if `true`, novel levels will be left unchanged (identity mapping). |
| 163 | +- `use_levelnames::Bool=false`: |
| 164 | + When `single_feat=false`, controls naming of the expanded columns: `true`: use actual level names (e.g. `:color_red`, `:color_blue`), |
| 165 | + `false`: use numeric indices (e.g. `:color_1`, `:color_2`). |
| 166 | +- `custom_levels::Union{Nothing,Vector}`: |
| 167 | + If not `nothing`, overrides the names of levels used to generate feature names when `single_feat=false`. |
| 168 | +- `ensure_categorical::Bool=false`: |
| 169 | + Only when `single_feat=true` and if `true`, preserves the categorical type of the column after |
| 170 | + recoding (eg, feature should still be recognized as `Multiclass` after transformation) |
| 171 | +
|
| 172 | +# Returns |
120 | 173 |
|
121 | | -Given a table `X` and a dictionary `mapping_per_feat_level` which maps each level for each column in |
122 | | -a subset of categorical features of X into a scalar or a vector (as specified in `single_feat`) |
123 | | -
|
124 | | - - transforms each value (some level) in each column in `X` using the function in `mapping_per_feat_level` |
125 | | - into a scalar (`single_feat=true`) |
126 | | -
|
127 | | - - transforms each value (some level) in each column in `X` using the function in `mapping_per_feat_level` |
128 | | - into a set of `k` features where `k` is the length of the vector (`single_feat=false`) |
129 | | - - In both cases it attempts to preserve the type of the table. |
130 | | - - In the latter case, it assumes that all levels under the same category are mapped to vectors of the same length. Such |
131 | | - assumption is necessary because any column in X must correspond to a constant number of features |
132 | | - in the output table (which is equal to k). |
133 | | - - Features not in the dictionary are mapped to themselves (i.e., not changed). |
134 | | - - Levels not in the nested dictionary are mapped to themselves if `identity_map_unknown` is true else raise an error. |
135 | | - - use_levelnames: if true, the new feature names are generated using the level names when the transform generates multiple features; |
136 | | - else they are generated using the indices of the levels. |
137 | | - - custom_levels: if not `nothing`, then the levels of the categorical features are replaced by the custom_levels |
| 174 | +A new table of potentially similar to `X` but with categorical columns transformed according to `mapping_per_feat_level`. |
138 | 175 | """ |
139 | 176 | function generic_transform( |
140 | 177 | X, |
|
0 commit comments