Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
f51274d
🌟 Initialize docs
EssamWisam Sep 15, 2024
1e8a421
🔥 Delete useless logo
EssamWisam Sep 15, 2024
924df68
⭐️ Better documentation
EssamWisam Sep 29, 2024
28c1cb9
Merge pull request #29 from JuliaAI/dev
EssamWisam Jun 18, 2025
027f620
✨ Better structure and definitions
EssamWisam Jun 18, 2025
03892b7
✨ Link full list above
EssamWisam Jun 20, 2025
0263009
Create readme-reflect.yml
EssamWisam Jun 24, 2025
c0bb79d
Update readme-reflect.yml
EssamWisam Jun 24, 2025
27da502
chore: update README from docs index
invalid-email-address Jun 24, 2025
9953cef
✨ Better org
EssamWisam Jun 24, 2025
b92eefb
chore: update README from docs index
invalid-email-address Jun 24, 2025
24e7e0b
✨ Better style
EssamWisam Jun 25, 2025
cae2e9c
Merge branch 'docs' of https://github.com/JuliaAI/MLJTransforms.jl in…
EssamWisam Jun 25, 2025
50e74d1
Update docs/src/index.md
EssamWisam Jun 29, 2025
e85e221
chore: update README from docs index
invalid-email-address Jun 29, 2025
4bbc4e7
✨ Improve README, Transformers page and Entity Embedder page
EssamWisam Jun 29, 2025
02c8688
Update README.md
EssamWisam Jun 29, 2025
bc3b1df
✨ Improving internal documentation
EssamWisam Jun 29, 2025
d5af39a
💫 Add contributing and about pages
EssamWisam Jun 29, 2025
4270d25
🤩 Add four tutorials
EssamWisam Aug 18, 2025
a0244a1
Merge branch 'dev' into docs
EssamWisam Aug 18, 2025
d11d892
Update docs/src/index.md
EssamWisam Aug 18, 2025
14a8205
✨ Contributions change
EssamWisam Aug 24, 2025
7fff9bf
Update make.jl
EssamWisam Aug 24, 2025
2832695
Update make.jl
EssamWisam Aug 24, 2025
72d6ab2
Enable documentation deployment for docs branch
EssamWisam Aug 24, 2025
489dc42
Update documenter.yml
EssamWisam Aug 24, 2025
6cfd374
Update MLJFlux to v0.6.6 to include EntityEmbedder support
EssamWisam Aug 24, 2025
9f9d28e
fix mljflux
EssamWisam Aug 24, 2025
09de647
✨ Add CV analysis
EssamWisam Aug 25, 2025
f7ac80e
✨ improve entity embeddings tutorial
EssamWisam Aug 27, 2025
8bd0dc9
✨ Improve standardization
EssamWisam Aug 27, 2025
2630f08
✨ Fix high cardinality dataset
EssamWisam Aug 28, 2025
6dca39a
fix docs
EssamWisam Aug 28, 2025
94f562b
Update docs/src/tutorials/adult_example/notebook.md
EssamWisam Aug 28, 2025
c89a00d
Update docs/src/tutorials/adult_example/notebook.jl
EssamWisam Aug 28, 2025
1e78e89
Update docs/src/transformers/all_transformers.md
EssamWisam Sep 1, 2025
3d72ff6
✨ Fix links for contrast encoder
EssamWisam Sep 1, 2025
82c1631
👨‍🔧 More doc fixes
EssamWisam Sep 1, 2025
0fc26c0
Merge branch 'dev' into docs
EssamWisam Sep 1, 2025
c283429
Update all_transformers.md
EssamWisam Sep 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ makedocs(
"Contrast Encoders"=>"transformers/contrast.md",
"Utility Encoders"=>"transformers/utility.md",
"Other Transformers"=>"transformers/others.md",
"API Index" => "transformers/all_transformers.md",
],
"Extended Examples" => Any[
"Tutorial A" => "tutorials/T1.md",
Expand Down
27 changes: 15 additions & 12 deletions docs/src/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# MLJTransforms.jl

A Julia package providing a wide range of categorical encoders and transformers to be used with the [MLJ](https://juliaai.github.io/MLJ.jl/dev/) package.
A Julia package providing a wide range of categorical encoders and transformers to be used with the [MLJ](https://juliaai.github.io/MLJ.jl/dev/) package. Transformers help convert raw features into a representation that's better suited for downstream models. Meanwhile, categorical encoders are a type of transformer that specifically encodes categorical features into numerical forms.

## Installation

Expand All @@ -24,9 +24,11 @@ X = RDatasets.dataset("HSAUR", "Forbes2000");
# 2. Load the model
FrequencyEncoder = @load FrequencyEncoder pkg="MLJTransforms"
encoder = FrequencyEncoder(
features=[:Country, :Category],
ignore=false, ordered_factor = false,
normalize=true)
features=[:Country, :Category], # The categorical columns to select
ignore=false, # Whether to exclude or include selected columns
ordered_factor = false, # Whether to also encode columns of ordered factor elements
normalize=true # Whether to normalize the frequencies used for encoding
)


# 3. Wrap it in a machine and fit
Expand All @@ -35,15 +37,16 @@ Xnew = transform(mach, X)
```

## Available Transformers
In `MLJTransforms` we denote transformers that operate on columns with `Continuous` and/or `Count` [scientific types](https://juliaai.github.io/ScientificTypes.jl/dev/) as numerical transformers. Meanwhile, categorical transformers operate on `Multiclass` and/or `OrderedFactor` [scientific types](https://juliaai.github.io/ScientificTypes.jl/dev/). Most categorical transformers in this package operate by converting categorical values into numerical values or vectors, and are therefore considered categorical encoders.
In `MLJTransforms` we denote transformers that can operate on columns with `Continuous` and/or `Count` [scientific types](https://juliaai.github.io/ScientificTypes.jl/dev/) as numerical transformers. Meanwhile, categorical transformers operate on `Multiclass` and/or `OrderedFactor` [scientific types](https://juliaai.github.io/ScientificTypes.jl/dev/). Most categorical transformers in this package operate by converting categorical values into numerical values or vectors, and are therefore considered categorical encoders.

Based on this, we categorize the methods as follows, with further distinctions for categorical encoders:
Based on this, we categorize the methods in this package as follows, with further distinctions for categorical encoders:

| **Category** | **Description** |
|:---------------------------:|:-------------------------------------------------------------------------------:|
| **Numerical Transformers** | Transformers that operate on `Continuous` or `Count` columns in a given dataset.|
| **Classical Encoders** | Widely recognized and frequently utilized categorical encoders. |
| **Neural-based Encoders** | Categorical encoders based on neural networks. |
| **Contrast Encoders** | Categorical encoders modeled via a contrast matrix. |
| **Utility Encoders** | Categorical encoders meant to be used as preprocessors for other encoders or models.|
| **Other Transformers** | Transformers that fall into other categories. |
| [Numerical Transformers](transformers/numerical) | Transformers that operate on `Continuous` or `Count` columns in a given dataset.|
| [Classical Encoders](transformers/classical.md) | Traditional categorical encoding algorithms and techniques. |
| [Neural-based Encoders](transformers/neural) | Categorical encoders based on neural networks. |
| [Contrast Encoders](transformers/contrast.md) | Categorical encoders that could be modeled via a contrast matrix. |
| [Utility Encoders](transformers/utility.md) | Categorical encoders meant to be used as preprocessors for other transformers or models.|
| [Other Transformers](transformers/others.md) | Transformers that operate on scientific types that are neither `Finite` nor `Infinite` |

80 changes: 80 additions & 0 deletions docs/src/transformers/all_transformers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
| Transformer | Brief Description |
|:----------:|:----------:|
| [Standardizer](@ref) | Transforming columns of numerical features by standardization |
| [BoxCoxTransformer](@ref) | Transforming columns of numerical features by BoxCox transformation |
| [UnivariateBoxCoxTransformer](@ref) | Apply BoxCox transformation given a single vector |
| [InteractionTransformer](@ref) | Transforming columns of numerical features to create new interaction features |
| [UnivariateDiscretizer](@ref) | Discretize a continuous vector into an ordered factor |
| [FillImputer](@ref) | Fill missing values of features belonging to any scientific type |
| [UnivariateTimeTypeToContinuous](@ref) | Transform a vector of time type into continuous type |
| [OneHotEncoder](@ref) | Encode categorical variables into one-hot vectors |
| [ContinuousEncoder](@ref) | Adds type casting functionality to OnehotEncoder |
| [OrdinalEncoder](@ref) | Encode categorical variables into ordered integers |
| [FrequencyEncoder](@ref) | Encode categorical variables into their normalized or unormalized frequencies |
| [TargetEncoder](@ref) | Encode categorical variables into relevant target statistics |
| [DummyEncoder](@ref) | Encodes by comparing each level to the reference level, intercept being the cell mean of the reference group |
| [SumEncoder](@ref) | Encodes by comparing each level to the reference level, intercept being the grand mean |
| [HelmertEncoder](@ref) | Encodes by comparing levels of a variable with the mean of the subsequent levels of the variable
| [ForwardDifferenceEncoder](@ref) | Encodes by comparing adjacent levels of a variable (each level minus the next level)
| [ContrastEncoder](@ref) | Allows defining a custom contrast encoder via a contrast matrix |
| [HypothesisEncoder](@ref) | Allows defining a custom contrast encoder via a hypothesis matrix |
| [EntityEmbedders](@ref) | Encode categorical variables into dense embedding vectors |
| [CardinalityReducer](@ref) | Reduce cardinality of high cardinality categorical features by grouping infrequent categories |
| [MissingnessEncoder](@ref) | Encode missing values of categorical features into new values |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Thank you.



```@docs; canonical = false
MLJTransforms.Standardizer
```

```@docs; canonical = false
MLJTransforms.InteractionTransformer
```

```@docs; canonical = false
MLJTransforms.BoxCoxTransformer
```

```@docs; canonical = false
MLJTransforms.UnivariateDiscretizer
```

```@docs; canonical = false
MLJTransforms.FillImputer
```

```@docs; canonical = false
MLJTransforms.UnivariateTimeTypeToContinuous
```

```@docs; canonical = false
MLJTransforms.OneHotEncoder
```

```@docs; canonical = false
MLJTransforms.ContinuousEncoder
```

```@docs; canonical = false
MLJTransforms.OrdinalEncoder
```

```@docs; canonical = false
MLJTransforms.FrequencyEncoder
```

```@docs; canonical = false
MLJTransforms.TargetEncoder
```

```@docs; canonical = false
MLJTransforms.ContrastEncoder
```

```@docs; canonical = false
MLJTransforms.CardinalityReducer
```

```@docs; canonical = false
MLJTransforms.MissingnessEncoder
```
2 changes: 1 addition & 1 deletion docs/src/transformers/neural.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Neural-based Encoders include categorical encoders based on neural networks:

| Transformer | Brief Description |
|:----------:|:----------:|
| [EntityEmbedders](@ref) | Encode categorical variables into dense embedding vectors |
| [EntityEmbedder](@ref) | Encode categorical variables into dense embedding vectors |


Entity Embedder docstring will go here.
5 changes: 5 additions & 0 deletions docs/src/transformers/numerical.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Other Transformers include more generic transformers that go beyond categorical
| [UnivariateBoxCoxTransformer](@ref) | Apply BoxCox transformation given a single vector |
| [InteractionTransformer](@ref) | Transforming columns of numerical features to create new interaction features |
| [UnivariateDiscretizer](@ref) | Discretize a continuous vector into an ordered factor |
| [FillImputer](@ref) | Fill missing values of features belonging to any finite or infinite scientific type |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FillImputer can be used to impute into categorical (e.g with mode) or numerical (e.g. with median) features. So how does that fit into your scheme?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In MLJTransforms we denote transformers that can operate on columns with Continuous and/or Count scientific types as numerical transformers.

So it sufficies that it can operate on infinite types.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I do perceive that the taxonomy makes lots more sense for categorical encoder (as opposed to transformers that aren't simply encoders); especially that entity embeddings, contrast encoders and utility encoders are all nontypical encoders and deserve better exposure (aside from helping for organization).

What do you think about the following, if I can do it by next Monday:

  • Split this package into two packages MLJEncoding and MLJTransforms. The former to carry the encoder methods and the latter for the broader category of transformers?

It doesn't seem like a lot of effort to me and it's intuitive in the sense that encoding packages do indeed tend to be standalone in other languages (eg, Python) as they constitute a specific type of transformers that is widely needed.

Copy link
Member

@ablaom ablaom Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| So it sufficies that it can operate on infinite types.

Not sure what you are getting at. My point is your taxonomy, as I understand it, splits algorithms according to whether they operate on numerical or categorical features, no? But FillImputer operates on both kinds. So where do you put it? You put in under numerical.md but that's not right is it? Or, is that page name misleading ... does "numerical.md" just mean "not a categorical encoder with continuous output"?

| What do you think about the following, if I can do it by next Monday:

| Split this package into two packages MLJEncoding and MLJTransforms. The former to carry the encoder methods and the latter for the broader category of transformers?

I still think we should be careful to usurp "encoder" as a word used exclusively in the context of categorial input data. Auto encoders, and variational encoders are two very important examples where the input is not necessarily categorial (typically, it's just the output that is categorical, or categorical pdf. ). Maybe we should say "categorical encoding" (and MLJCategoricalEncoding is more informative pkg name). (And who is to say what a "transformer" is, especicially now that LLMs have usurped this already ubiquitous term for a rather specialise use case?)

I you want to split the package, and believe this will help users, and you have the time to it quickly I don't object. I don't think there is any maintenance benefit for doing so, in fact probably more of a maintenance burden: extra code fragmentation that doesn't seem justified from a dev point of view. You could alternatively achieve the separation you are after in the way documentation is organised. For example, you could have separate doc pages both living at MLJ.jl (which is where the current MLJTransforms.jl MLJModels.jl docs, and most other documentation, lives).

But, I'll support whatever option you're happy to work out.

Copy link
Collaborator Author

@EssamWisam EssamWisam Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you are getting at.

Sorry if I wasn't clear. I meant the definition we denote transformers that can operate on columns with Continuous and/or Count scientific types implies that any transformer that "can" take these types will be considered numerical which is why I put Fill Imputer there. So it's as if the taxonomy distinguishes those transformers that "can" take numerical types from those that only take categorical ones.

I don't claim this is the best approach and I am open to recommendations. What do you think about adding another category Multi-type Transformers? By this, we have numerical, categorical, multi-type and other transformers which covers all possible scitypes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact probably more of a maintenance burden: extra code fragmentation that doesn't seem justified from a dev point of view.

Okay then I am no longer motivated to do that and think improving the taxonomy could be sufficient.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind my recommendation as numerical transformers already operate on multiple scientific types.


```@docs
MLJTransforms.Standardizer
Expand All @@ -23,3 +24,7 @@ MLJTransforms.BoxCoxTransformer
```@docs
MLJTransforms.UnivariateDiscretizer
```

```@docs
MLJTransforms.FillImputer
```
7 changes: 1 addition & 6 deletions docs/src/transformers/others.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,9 @@
Transformers that operate on columns with general or specialized scientific types.
ransformers that operate on scientific types that are neither `Finite` nor `Infinite`.

| Transformer | Brief Description |
|:----------:|:----------:|
| [FillImputer](@ref) | Fill missing values of features belonging to any scientific type |
| [UnivariateTimeTypeToContinuous](@ref) | Transform a vector of time type into continuous type |

```@docs
MLJTransforms.FillImputer
```


```@docs
MLJTransforms.UnivariateTimeTypeToContinuous
Expand Down
Loading