Skip to content

Commit bc3b1df

Browse files
committed
✨ Improving internal documentation
1 parent 02c8688 commit bc3b1df

File tree

7 files changed

+81
-44
lines changed

7 files changed

+81
-44
lines changed

Project.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
name = "MLJTransforms"
22
uuid = "23777cdb-d90c-4eb0-a694-7c2b83d5c1d6"
33
authors = ["Essam <[email protected]> and contributors"]
4-
version = "0.1.1"
4+
version = "0.1.4"
55

66
[deps]
77
BitBasis = "50ba71b6-fa0f-514d-ae9a-0916efc90dcf"

src/common_docs.jl

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
const X_doc = """
2-
- X: A table where the elements of the categorical features have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/)
2+
- `X`: A table where the elements of the categorical features have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/)
33
`Multiclass` or `OrderedFactor`
44
"""
55
const X_doc_mlj = """
@@ -8,18 +8,18 @@ const X_doc_mlj = """
88
check scitypes.
99
"""
1010
const features_doc = """
11-
- features=[]: A list of names of categorical features given as symbols to exclude or include from encoding,
11+
- `features=[]`: A list of names of categorical features given as symbols to exclude or include from encoding,
1212
according to the value of `ignore`, or a single symbol (which is treated as a vector with one symbol),
1313
or a callable that returns true for features to be included/excluded
1414
"""
1515
const ignore_doc = """
16-
- ignore=true: Whether to exclude or include the features given in `features`
16+
- `ignore=true`: Whether to exclude or include the features given in `features`
1717
"""
1818
const ordered_factor_doc = """
19-
- ordered_factor=false: Whether to encode `OrderedFactor` or ignore them
19+
- `ordered_factor=false`: Whether to encode `OrderedFactor` or ignore them
2020
"""
2121
const encoded_features_doc = """
22-
- encoded_features: The subset of the categorical features of `X` that were encoded
22+
- `encoded_features`: The subset of the categorical features of `X` that were encoded
2323
"""
2424
const cache_doc = """
2525
- `cache`: The output of `contrast_encoder_fit`

src/encoders/contrast_encoder/interface_mlj.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ $ordered_factor_doc
104104
105105
# Operations
106106
107-
- `transform(mach, Xnew)`: Apply contrast encoding to selected `Multiclass` or `OrderedFactor features of `Xnew` specified by hyper-parameters, and
107+
- `transform(mach, Xnew)`: Apply contrast encoding to selected `Multiclass` or `OrderedFactor` features of `Xnew` specified by hyper-parameters, and
108108
return the new table. Features that are neither `Multiclass` nor `OrderedFactor`
109109
are always left unchanged.
110110

src/encoders/frequency_encoding/interface_mlj.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ $ordered_factor_doc
100100
101101
# Operations
102102
103-
- `transform(mach, Xnew)`: Apply frequency encoding to selected `Multiclass` or `OrderedFactor features of `Xnew` specified by hyper-parameters, and
103+
- `transform(mach, Xnew)`: Apply frequency encoding to selected `Multiclass` or `OrderedFactor` features of `Xnew` specified by hyper-parameters, and
104104
return the new table. Features that are neither `Multiclass` nor `OrderedFactor`
105105
are always left unchanged.
106106

src/encoders/ordinal_encoding/interface_mlj.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ $ordered_factor_doc
9595
9696
# Operations
9797
98-
- `transform(mach, Xnew)`: Apply ordinal encoding to selected `Multiclass` or `OrderedFactor features of `Xnew` specified by hyper-parameters, and
98+
- `transform(mach, Xnew)`: Apply ordinal encoding to selected `Multiclass` or `OrderedFactor` features of `Xnew` specified by hyper-parameters, and
9999
return the new table. Features that are neither `Multiclass` nor `OrderedFactor`
100100
are always left unchanged.
101101

src/encoders/target_encoding/interface_mlj.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -150,7 +150,7 @@ $ordered_factor_doc
150150
151151
# Operations
152152
153-
- `transform(mach, Xnew)`: Apply target encoding to selected `Multiclass` or `OrderedFactor features of `Xnew` specified by hyper-parameters, and
153+
- `transform(mach, Xnew)`: Apply target encoding to selected `Multiclass` or `OrderedFactor` features of `Xnew` specified by hyper-parameters, and
154154
return the new table. Features that are neither `Multiclass` nor `OrderedFactor`
155155
are always left unchanged.
156156

src/generic.jl

Lines changed: 71 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,43 @@
11
# generic functions go here; such function can be used throughout multiple methods
22

33
"""
4-
**Private method.**
4+
```julia
5+
generic_fit(X,
6+
features = Symbol[],
7+
args...;
8+
ignore::Bool = true,
9+
ordered_factor::Bool = false,
10+
feature_mapper,
11+
kwargs...,
12+
)
13+
```
14+
15+
Given a `feature_mapper` (see definition below), this method applies
16+
`feature_mapper` across a specified subset of categorical columns in X and returns a dictionary
17+
whose keys are the feature names, and each value is the corresponding
18+
level‑to‑value mapping produced by `feature_mapper`.
19+
20+
In essence, it spares effort of looping over each column and applying the `feature_mapper` function manually as well as handling the feature selection logic.
521
6-
A generic function to fit a class of transformers where its convenient to define a single `feature_mapper` function that
7-
takes the column as a vector and potentially other arguments (as passed in ...args and ...kwargs) and returns
8-
a dictionary that maps each level of the categorical feature to a scalar or vector
9-
according to the transformation logic. In other words, the `feature_mapper` simply answers the question "For level n of
10-
the current categorical feature c, what should the new value or vector (multiple features) be as defined by the transformation
11-
logic?"
1222
1323
# Arguments
1424
15-
$X_doc
16-
$features_doc
17-
$ignore_doc
18-
$ordered_factor_doc
19-
- feature_mapper: Defined above.
25+
$X_doc
26+
$features_doc
27+
$ignore_doc
28+
$ordered_factor_doc
29+
- feature_mapper: function that, for a given vector (eg, corresponding to a categorical column from the dataset `X`),
30+
produces a mapping from each category level name in this vector to a scalar or vector according to specified transformation logic.
2031
21-
# Returns
32+
# Note
33+
34+
- Any additional arguments (whether keyword or not) provided to this function are passed to the `feature_mapper` function which
35+
is helpful when `feature_mapper` requires additional arguments to compute the mapping (eg, hyperparameters).
2236
23-
- mapping_per_feat_level: Maps each level for each feature in a subset of the categorical features of
24-
X into a scalar or a vector.
25-
$encoded_features_doc
37+
# Returns
38+
- `mapping_per_feat_level`: Maps each level for each feature in a subset of the categorical features of
39+
X into a scalar or a vector.
40+
$encoded_features_doc
2641
"""
2742
function generic_fit(X,
2843
features = Symbol[],
@@ -116,25 +131,47 @@ end
116131

117132

118133
"""
119-
**Private method.**
134+
```julia
135+
generic_transform(
136+
X,
137+
mapping_per_feat_level;
138+
single_feat::Bool = true,
139+
ignore_unknown::Bool = false,
140+
use_levelnames::Bool = false,
141+
custom_levels = nothing,
142+
ensure_categorical::Bool = false,
143+
)
144+
```
145+
146+
147+
Apply a per‐level feature mapping to selected categorical columns in `X`, returning a new table of the same type.
148+
149+
# Arguments
150+
151+
$X_doc
152+
- `mapping_per_feat_level::Dict{Symbol,Dict}`:
153+
A dict whose keys are feature names (`Symbol`) and values are themselves dictionaries
154+
mapping each observed level to either a scalar (if `single_feat=true`) or a fixed‐length vector
155+
(if `single_feat=false`). Only columns whose names appear in `mapping_per_feat_level` are
156+
transformed; others pass through unchanged.
157+
- `single_feat::Bool=true`:
158+
If `true`, each input level is mapped to a single scalar feature; if `false`,
159+
each input level is mapped to a length‑`k` vector, producing `k` output columns.
160+
- `ignore_unknown::Bool=false`:
161+
If `false`, novel levels in `X` (not seen during fit) will raise an error;
162+
if `true`, novel levels will be left unchanged (identity mapping).
163+
- `use_levelnames::Bool=false`:
164+
When `single_feat=false`, controls naming of the expanded columns: `true`: use actual level names (e.g. `:color_red`, `:color_blue`),
165+
`false`: use numeric indices (e.g. `:color_1`, `:color_2`).
166+
- `custom_levels::Union{Nothing,Vector}`:
167+
If not `nothing`, overrides the names of levels used to generate feature names when `single_feat=false`.
168+
- `ensure_categorical::Bool=false`:
169+
Only when `single_feat=true` and if `true`, preserves the categorical type of the column after
170+
recoding (eg, feature should still be recognized as `Multiclass` after transformation)
171+
172+
# Returns
120173
121-
Given a table `X` and a dictionary `mapping_per_feat_level` which maps each level for each column in
122-
a subset of categorical features of X into a scalar or a vector (as specified in `single_feat`)
123-
124-
- transforms each value (some level) in each column in `X` using the function in `mapping_per_feat_level`
125-
into a scalar (`single_feat=true`)
126-
127-
- transforms each value (some level) in each column in `X` using the function in `mapping_per_feat_level`
128-
into a set of `k` features where `k` is the length of the vector (`single_feat=false`)
129-
- In both cases it attempts to preserve the type of the table.
130-
- In the latter case, it assumes that all levels under the same category are mapped to vectors of the same length. Such
131-
assumption is necessary because any column in X must correspond to a constant number of features
132-
in the output table (which is equal to k).
133-
- Features not in the dictionary are mapped to themselves (i.e., not changed).
134-
- Levels not in the nested dictionary are mapped to themselves if `identity_map_unknown` is true else raise an error.
135-
- use_levelnames: if true, the new feature names are generated using the level names when the transform generates multiple features;
136-
else they are generated using the indices of the levels.
137-
- custom_levels: if not `nothing`, then the levels of the categorical features are replaced by the custom_levels
174+
A new table of potentially similar to `X` but with categorical columns transformed according to `mapping_per_feat_level`.
138175
"""
139176
function generic_transform(
140177
X,

0 commit comments

Comments
 (0)