Skip to content

Commit ca37c76

Browse files
committed
✅ Change column and columns to feature and features
1 parent 3e65218 commit ca37c76

File tree

13 files changed

+90
-90
lines changed

13 files changed

+90
-90
lines changed

src/encoders/frequency_encoding/frequency_encoding.jl

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,20 +3,20 @@
33
**Private method.**
44
55
Fit an encoder that encodes the categorical values in the specified
6-
categorical columns with their (normalized or raw) frequencies of occurrence in the dataset.
6+
categorical features with their (normalized or raw) frequencies of occurrence in the dataset.
77
88
# Arguments
99
10-
- `X`: A table where the elements of the categorical columns have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/) `Multiclass` or `OrderedFactor`
11-
- `features=[]`: A list of names of categorical columns given as symbols to exclude or include from encoding
12-
- `ignore=true`: Whether to exclude or includes the columns given in `features`
10+
- `X`: A table where the elements of the categorical features have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/) `Multiclass` or `OrderedFactor`
11+
- `features=[]`: A list of names of categorical features given as symbols to exclude or include from encoding
12+
- `ignore=true`: Whether to exclude or includes the features given in `features`
1313
- `ordered_factor=false`: Whether to encode `OrderedFactor` or ignore them
1414
- `normalize=false`: Whether to use normalized frequencies that sum to 1 over category values or to use raw counts.
1515
1616
# Returns (in a dict)
1717
18-
- `statistic_given_feat_val`: The frequency of each level of each selected categorical column
19-
- `encoded_features`: The subset of the categorical columns of X that were encoded
18+
- `statistic_given_feat_val`: The frequency of each level of each selected categorical feature
19+
- `encoded_features`: The subset of the categorical features of X that were encoded
2020
"""
2121
function frequency_encoder_fit(
2222
X,
@@ -25,7 +25,7 @@ function frequency_encoder_fit(
2525
ordered_factor::Bool = false,
2626
normalize::Bool = false,
2727
)
28-
# 1. Define column mapper
28+
# 1. Define feature mapper
2929
function feature_mapper(col, name)
3030
frequency_map = (!normalize) ? countmap(col) : proportionmap(col)
3131
statistic_given_feat_val = Dict{Any, Real}(level=>frequency_map[level] for level in levels(col))
@@ -51,12 +51,12 @@ Encode the levels of a categorical variable in a given table with their (normali
5151
5252
# Arguments
5353
54-
- `X`: A table where the elements of the categorical columns have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/) `Multiclass` or `OrderedFactor`
54+
- `X`: A table where the elements of the categorical features have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/) `Multiclass` or `OrderedFactor`
5555
- `cache`: The output of `frequency_encoder_fit`
5656
5757
# Returns
5858
59-
- `X_tr`: The table with selected columns after the selected columns are encoded by frequency encoding.
59+
- `X_tr`: The table with selected features after the selected features are encoded by frequency encoding.
6060
"""
6161
function frequency_encoder_transform(X, cache::Dict)
6262
statistic_given_feat_val = cache[:statistic_given_feat_val]

src/encoders/frequency_encoding/interface_mlj.jl

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ function MMI.fit(transformer::FrequencyEncoder, verbosity::Int, X)
3535
)
3636
fitresult = generic_cache[:statistic_given_feat_val]
3737

38-
report = (encoded_features = generic_cache[:encoded_features],) # report only has list of encoded columns
38+
report = (encoded_features = generic_cache[:encoded_features],) # report only has list of encoded features
3939
cache = nothing
4040
return fitresult, cache, report
4141
end;
@@ -74,7 +74,7 @@ MMI.metadata_model(
7474
$(MMI.doc_header(FrequencyEncoder))
7575
7676
`FrequencyEncoder` implements frequency encoding which replaces the categorical values in the specified
77-
categorical columns with their (normalized or raw) frequencies of occurrence in the dataset.
77+
categorical features with their (normalized or raw) frequencies of occurrence in the dataset.
7878
7979
# Training data
8080
@@ -92,8 +92,8 @@ Train the machine using `fit!(mach, rows=...)`.
9292
9393
# Hyper-parameters
9494
95-
- `features=[]`: A list of names of categorical columns given as symbols to exclude or include from encoding
96-
- `ignore=true`: Whether to exclude or include the columns given in `features`
95+
- `features=[]`: A list of names of categorical features given as symbols to exclude or include from encoding
96+
- `ignore=true`: Whether to exclude or include the features given in `features`
9797
- `ordered_factor=false`: Whether to encode `OrderedFactor` or ignore them
9898
- `normalize=false`: Whether to use normalized frequencies that sum to 1 over category values or to use raw counts.
9999
@@ -107,20 +107,20 @@ Train the machine using `fit!(mach, rows=...)`.
107107
108108
The fields of `fitted_params(mach)` are:
109109
110-
- `statistic_given_feat_val`: A dictionary that maps each level for each column in a subset of the categorical columns of X into its frequency.
110+
- `statistic_given_feat_val`: A dictionary that maps each level for each column in a subset of the categorical features of X into its frequency.
111111
112112
# Report
113113
114114
The fields of `report(mach)` are:
115115
116-
- `encoded_features`: The subset of the categorical columns of X that were encoded
116+
- `encoded_features`: The subset of the categorical features of X that were encoded
117117
118118
# Examples
119119
120120
```julia
121121
using MLJ
122122
123-
# Define categorical columns
123+
# Define categorical features
124124
A = ["g", "b", "g", "r", "r",]
125125
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
126126
C = ["f", "f", "f", "m", "f",]

src/encoders/ordinal_encoding/interface_mlj.jl

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ function MMI.fit(transformer::OrdinalEncoder, verbosity::Int, X)
3232
)
3333
fitresult =
3434
generic_cache[:index_given_feat_level]
35-
report = (encoded_features = generic_cache[:encoded_features],) # report only has list of encoded columns
35+
report = (encoded_features = generic_cache[:encoded_features],) # report only has list of encoded features
3636
cache = nothing
3737
return fitresult, cache, report
3838
end;
@@ -70,7 +70,7 @@ MMI.metadata_model(
7070
$(MMI.doc_header(OrdinalEncoder))
7171
7272
`OrdinalEncoder` implements ordinal encoding which replaces the categorical values in the specified
73-
categorical columns with integers (ordered arbitrarily). This will create an implicit ordering between
73+
categorical features with integers (ordered arbitrarily). This will create an implicit ordering between
7474
categories which may not be a proper modelling assumption.
7575
7676
# Training data
@@ -89,8 +89,8 @@ Train the machine using `fit!(mach, rows=...)`.
8989
9090
# Hyper-parameters
9191
92-
- `features=[]`: A list of names of categorical columns given as symbols to exclude or include from encoding
93-
- `ignore=true`: Whether to exclude or includes the columns given in `features`
92+
- `features=[]`: A list of names of categorical features given as symbols to exclude or include from encoding
93+
- `ignore=true`: Whether to exclude or includes the features given in `features`
9494
- `ordered_factor=false`: Whether to encode `OrderedFactor` or ignore them
9595
9696
# Operations
@@ -103,20 +103,20 @@ Train the machine using `fit!(mach, rows=...)`.
103103
104104
The fields of `fitted_params(mach)` are:
105105
106-
- `index_given_feat_level`: A dictionary that maps each level for each column in a subset of the categorical columns of X into an integer.
106+
- `index_given_feat_level`: A dictionary that maps each level for each column in a subset of the categorical features of X into an integer.
107107
108108
# Report
109109
110110
The fields of `report(mach)` are:
111111
112-
- `encoded_features`: The subset of the categorical columns of X that were encoded
112+
- `encoded_features`: The subset of the categorical features of X that were encoded
113113
114114
# Examples
115115
116116
```julia
117117
using MLJ
118118
119-
# Define categorical columns
119+
# Define categorical features
120120
A = ["g", "b", "g", "r", "r",]
121121
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
122122
C = ["f", "f", "f", "m", "f",]

src/encoders/ordinal_encoding/ordinal_encoding.jl

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,23 +6,23 @@ Fit an encoder to encode the levels of categorical variables in a given table as
66
77
# Arguments
88
9-
- `X`: A table where the elements of the categorical columns have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/) `Multiclass` or `OrderedFactor`
10-
- `features=[]`: A list of names of categorical columns given as symbols to exclude or include from encoding
11-
- `ignore=true`: Whether to exclude or includes the columns given in `features`
9+
- `X`: A table where the elements of the categorical features have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/) `Multiclass` or `OrderedFactor`
10+
- `features=[]`: A list of names of categorical features given as symbols to exclude or include from encoding
11+
- `ignore=true`: Whether to exclude or includes the features given in `features`
1212
- `ordered_factor=false`: Whether to encode `OrderedFactor` or ignore them
1313
1414
# Returns (in a dict)
1515
16-
- `index_given_feat_level`: Maps each level for each column in a subset of the categorical columns of X into an integer.
17-
- `encoded_features`: The subset of the categorical columns of X that were encoded
16+
- `index_given_feat_level`: Maps each level for each column in a subset of the categorical features of X into an integer.
17+
- `encoded_features`: The subset of the categorical features of X that were encoded
1818
"""
1919
function ordinal_encoder_fit(
2020
X,
2121
features::AbstractVector{Symbol} = Symbol[];
2222
ignore::Bool = true,
2323
ordered_factor::Bool = false,
2424
)
25-
# 1. Define column mapper
25+
# 1. Define feature mapper
2626
function feature_mapper(col, name)
2727
feat_levels = levels(col)
2828
index_given_feat_val =
@@ -50,12 +50,12 @@ Encode the levels of a categorical variable in a given table as integers.
5050
5151
# Arguments
5252
53-
- `X`: A table where the elements of the categorical columns have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/) `Multiclass` or `OrderedFactor`
53+
- `X`: A table where the elements of the categorical features have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/) `Multiclass` or `OrderedFactor`
5454
- `cache`: The output of `ordinal_encoder_fit`
5555
5656
# Returns
5757
58-
- `X_tr`: The table with selected columns after the selected columns are encoded by ordinal encoding.
58+
- `X_tr`: The table with selected features after the selected features are encoded by ordinal encoding.
5959
"""
6060
function ordinal_encoder_transform(X, cache::Dict)
6161
index_given_feat_level = cache[:index_given_feat_level]

src/encoders/target_encoding/interface_mlj.jl

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ struct TargetEncoderResult{
4747
S <: AbstractString,
4848
A <: Any # Useless but likely can't do much better
4949
} <: MMI.MLJType
50-
# target statistic for each level of each categorical column
50+
# target statistic for each level of each categorical feature
5151
y_stat_given_feat_level::Dict{A, A}
5252
task::S # "Regression", "Classification"
5353
num_classes::I # num_classes in case of classification
@@ -77,7 +77,7 @@ function MMI.fit(transformer::TargetEncoder, verbosity::Int, X, y)
7777
generic_cache[:task],
7878
generic_cache[:num_classes],
7979
)
80-
report = (encoded_features = generic_cache[:encoded_features],) # report only has list of encoded columns
80+
report = (encoded_features = generic_cache[:encoded_features],) # report only has list of encoded features
8181
cache = nothing
8282
return fitresult, cache, report
8383
end;
@@ -140,8 +140,8 @@ Train the machine using `fit!(mach, rows=...)`.
140140
141141
# Hyper-parameters
142142
143-
- `features=[]`: A list of names of categorical columns given as symbols to exclude or include from encoding
144-
- `ignore=true`: Whether to exclude or includes the columns given in `features`
143+
- `features=[]`: A list of names of categorical features given as symbols to exclude or include from encoding
144+
- `ignore=true`: Whether to exclude or includes the features given in `features`
145145
- `ordered_factor=false`: Whether to encode `OrderedFactor` or ignore them
146146
- `λ`: Shrinkage hyperparameter used to mix between posterior and prior statistics as described in [1]
147147
- `m`: An integer hyperparameter to compute shrinkage as described in [1]. If `m=:auto` then m will be computed using
@@ -158,21 +158,21 @@ Train the machine using `fit!(mach, rows=...)`.
158158
The fields of `fitted_params(mach)` are:
159159
160160
- `task`: Whether the task is `Classification` or `Regression`
161-
- `y_statistic_given_feat_level`: A dictionary with the necessary statistics to encode each categorical column. It maps each
162-
level in each categorical column to a statistic computed over the target.
161+
- `y_statistic_given_feat_level`: A dictionary with the necessary statistics to encode each categorical feature. It maps each
162+
level in each categorical feature to a statistic computed over the target.
163163
164164
# Report
165165
166166
The fields of `report(mach)` are:
167167
168-
- `encoded_features`: The subset of the categorical columns of X that were encoded
168+
- `encoded_features`: The subset of the categorical features of X that were encoded
169169
170170
# Examples
171171
172172
```julia
173173
using MLJ
174174
175-
# Define categorical columns
175+
# Define categorical features
176176
A = ["g", "b", "g", "r", "r",]
177177
B = [1.0, 2.0, 3.0, 4.0, 5.0,]
178178
C = ["f", "f", "f", "m", "f",]

src/encoders/target_encoding/target_encoding.jl

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -110,15 +110,15 @@ end
110110
111111
target_encoder_fit(X, y, features=[]; ignore=true, ordered_factor=false, λ = 1.0, m=0)
112112
113-
Fit a target encoder on table X with target y by computing the necessary statistics for every categorical column.
113+
Fit a target encoder on table X with target y by computing the necessary statistics for every categorical feature.
114114
115115
# Arguments
116116
117-
- `X`: A table where the elements of the categorical columns have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/)
117+
- `X`: A table where the elements of the categorical features have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/)
118118
`Multiclass` or `OrderedFactor`
119119
- `y`: An abstract vector of labels (e.g., strings) that correspond to the observations in X
120-
- `features=[]`: A list of names of categorical columns given as symbols to exclude or include from encoding
121-
- `ignore=true`: Whether to exclude or includes the columns given in `features`
120+
- `features=[]`: A list of names of categorical features given as symbols to exclude or include from encoding
121+
- `ignore=true`: Whether to exclude or includes the features given in `features`
122122
- `ordered_factor=false`: Whether to encode `OrderedFactor` or ignore them
123123
- `λ`: Shrinkage hyperparameter used to mix between posterior and prior statistics as described in [1]
124124
- `m`: An integer hyperparameter to compute shrinkage as described in [1]. If `m=:auto` then m will be computed using
@@ -127,7 +127,7 @@ Fit a target encoder on table X with target y by computing the necessary statist
127127
# Returns
128128
129129
- `cache`: A dictionary containing a dictionary `y_stat_given_feat_level` with the necessary statistics needed to transform
130-
every categorical column as well as other metadata needed for transform.
130+
every categorical feature as well as other metadata needed for transform.
131131
"""
132132
function target_encoder_fit(
133133
X,
@@ -229,13 +229,13 @@ end
229229
Transform given data with fitted target encoder cache.
230230
231231
# Arguments
232-
- `X`: A table where the elements of the categorical columns have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/)
232+
- `X`: A table where the elements of the categorical features have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/)
233233
`Multiclass` or `OrderedFactor`
234234
- `cache`: A dictionary containing a dictionary `y_stat_given_feat_level` with the necessary statistics for
235-
every categorical column as well as other metadata needed for transform
235+
every categorical feature as well as other metadata needed for transform
236236
237237
# Returns
238-
- `X`: A table where the categorical columns as specified during fitting are transformed by target encoding. Other columns will remain
238+
- `X`: A table where the categorical features as specified during fitting are transformed by target encoding. Other features will remain
239239
the same. This will attempt to preserve the type of the table but may not succeed.
240240
"""
241241

src/generic.jl

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6,25 +6,25 @@
66
77
A generic function to fit a class of transformers where its convenient to define a single `feature_mapper` function that
88
takes the column as a vector and potentially other arguments (as passed in ...args and ...kwargs) and returns
9-
a dictionary that maps each level of the categorical column to a scalar or vector
9+
a dictionary that maps each level of the categorical feature to a scalar or vector
1010
according to the transformation logic. In other words, the `feature_mapper` simply answers the question "For level n of
11-
the current categorical column c, what should the new value or vector (multiple columns) be as defined by the transformation
11+
the current categorical feature c, what should the new value or vector (multiple features) be as defined by the transformation
1212
logic?"
1313
1414
# Arguments
1515
16-
- `X`: A table where the elements of the categorical columns have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/)
16+
- `X`: A table where the elements of the categorical features have [scitypes](https://juliaai.github.io/ScientificTypes.jl/dev/)
1717
`Multiclass` or `OrderedFactor`
18-
- `features=[]`: A list of names of categorical columns given as symbols to exclude or include from encoding
19-
- `ignore=true`: Whether to exclude or includes the columns given in `features`
18+
- `features=[]`: A list of names of categorical features given as symbols to exclude or include from encoding
19+
- `ignore=true`: Whether to exclude or includes the features given in `features`
2020
- `ordered_factor=false`: Whether to encode `OrderedFactor` or ignore them
2121
- `feature_mapper`: Defined above.
2222
2323
# Returns
2424
25-
- `mapping_per_feat_level`: Maps each level for each column in a subset of the categorical columns of
25+
- `mapping_per_feat_level`: Maps each level for each feature in a subset of the categorical features of
2626
X into a scalar or a vector.
27-
- `encoded_features`: The subset of the categorical columns of X that were encoded
27+
- `encoded_features`: The subset of the categorical features of X that were encoded
2828
"""
2929
function generic_fit(X,
3030
features::AbstractVector{Symbol} = Symbol[],
@@ -43,7 +43,7 @@ function generic_fit(X,
4343
# 3. Define mapping per column per level dictionary
4444
mapping_per_feat_level = Dict()
4545

46-
# 4. Use column mapper to compute the mapping of each level in each column
46+
# 4. Use feature mapper to compute the mapping of each level in each column
4747
encoded_features = Symbol[]# to store column that were actually encoded
4848
for feat_name in feat_names
4949
feat_col = Tables.getcolumn(X, feat_name)
@@ -64,7 +64,7 @@ end
6464
"""
6565
**Private method.**
6666
67-
Function to generate new column names: feat_name_0, feat_name_1,..., feat_name_n
67+
Function to generate new feature names: feat_name_0, feat_name_1,..., feat_name_n
6868
"""
6969
function generate_new_feat_names(feat_name, num_inds, existing_names)
7070
conflict = true # will be kept true as long as there is a conflict
@@ -86,18 +86,18 @@ end
8686
**Private method.**
8787
8888
Given a table `X` and a dictionary `mapping_per_feat_level` which maps each level for each column in
89-
a subset of categorical columns of X into a scalar or a vector (as specified in single_feat)
89+
a subset of categorical features of X into a scalar or a vector (as specified in single_feat)
9090
9191
- transforms each value (some level) in each column in `X` using the function in `mapping_per_feat_level`
9292
into a scalar (single_feat=true)
9393
9494
- transforms each value (some level) in each column in `X` using the function in `mapping_per_feat_level`
95-
into a set of k columns where k is the length of the vector (single_feat=false)
95+
into a set of k features where k is the length of the vector (single_feat=false)
9696
- In both cases it attempts to preserve the type of the table.
9797
- In the latter case, it assumes that all levels under the same category are mapped to vectors of the same length. Such
98-
assumption is necessary because any column in X must correspond to a constant number of columns
98+
assumption is necessary because any column in X must correspond to a constant number of features
9999
in the output table (which is equal to k).
100-
- Columns not in the dictionary are mapped to themselves (i.e., not changed).
100+
- Features not in the dictionary are mapped to themselves (i.e., not changed).
101101
- Levels not in the nested dictionary are mapped to themselves if `identity_map_unknown` is true else raise an error.
102102
"""
103103
function generic_transform(X, mapping_per_feat_level; single_feat = true, ignore_unknown = false)

0 commit comments

Comments
 (0)