🌟 Initialize documentation #19

EssamWisam · 2024-09-15T03:14:07Z

This is a draft for the documentation page of MLJTransforms.

@ablaom addition:

Some further plan of action:

downstream comment

EssamWisam · 2024-09-15T03:15:47Z

@ablaom this is a draft no need to review yet. That said, you can take a look at what I did so far to have an idea of what's on my mind. The current taxonomy can highly likely be improved (along with the descriptions) but I am a strong fan of having some sort of taxonomy anyway; this is just an initial draft.

ablaom · 2024-09-15T23:15:34Z

docs/src/index.md

+
+## Available Transformers
+In `MLJTransforms` we define "encoders" to encompass models that specifically operate by encoding categorical variables; meanwhile, "transformers" refers to models that apply more generic transformations on columns that are not necessarily categorical. We define the following taxonomy for different models founds in `MLJTransforms`:
+


As you know I'm not a big fan of enforcing taxonomy on models. That said, this is docs and not API and your suggestion looks reasonable. Here are things i don't like about it:

Many will object to your definition of "encoder", ie, that to be an encoder the input must be categorical. One can immediately think of counterexamples and and there is a real danger that we later add a model whose provider insists is an "encoder" which breaks your definition. I suggest you avoid making this formal definition. By the way, it is different from the one given here.

It feels like categorical encoders have a very heavy weight compared to the other transformers. If you are skimming documentation quickly, you might decide something like a Standardizer is not provided here. Perhaps the suggestion below can counter this.

Related to the above: While it may have advantages as an organising principle, structuring the docs this way actually introduces an extra layer making it harder to find the transformer you want quickly. I think in the first place software docs are not for educating, but for facilitating. When I use software docs, I mostly want to "find" something, not "learn" something. I'm not suggesting you abandon the taxonomy, just giving you some feedback to take into account.

docs/src/index.md

ablaom · 2024-09-15T23:18:40Z

Thanks @EssamWisam for this substantial effort. I have had a brief look and made some comments.

EssamWisam · 2024-09-29T02:45:26Z

@ablaom I tried to refine what I did last time. This time you can review the text content before I proceed with adding tutorials and other additions.

docs/src/transformers/numerical.md

ablaom · 2024-09-30T00:23:46Z

Thanks for these further improvements.

Let me re-iterate my unhappiness about enforcing a classification in this document. To me it's just an extra layer in the way of me finding what I'm looking for. Once you have more than about 3 categories, any classification becomes more a cognitive burden than a help, in my opinion, and now you have 6.

And I worry about new transformers breaking whatever classification you come up with. This is based on my own personal experience, and I've been doing this a while now.

EssamWisam · 2024-09-30T00:46:22Z

Let me re-iterate my unhappiness about enforcing a classification in this document. To me it's just an extra layer in the way of me finding what I'm looking for. Once you have more than about 3 categories, any classification becomes more a cognitive burden than a help, in my opinion, and now you have 6.

And I worry about new transformers breaking whatever classification you come up with. This is based on my own personal experience, and I've been doing this a while now.

I don't know why my impression is that such classification makes it easier to find methods and "understand the package". I always highly value it when I read some documentation and can immediately understand all that this package does or offers. I do agree that people should go to manuals to look for the existence of methods and how to use them but I think specifically in machine learning, people will appreciate when there is some explanation. E.g., Scikit-learn often has a one line explanation for its methods. This is because otherwise one kind of assumes everyone visiting their package is an ML expert.

Looking into the taxonomy here:

Numerical Transformers, Classical Encoders, Neural-based Encoders, Contrast Encoders, Utility Encoders, Other Transformers

The terms "Classical" and "Utility" are used in their widely accepted meanings (e.g., as in classical machine learning, utility function, etc.)
Neural-based and Contrast encoders are in my opinion helpful because some people will be looking for these specific types many others will be rather interested in other types (e.g., just classical ones).
Numerical and Other transformers also make it possible to look at the other types or transformers that are not encoders.

I think the taxonomy here is mostly total in that any new categorical encoder or transformer will be mapped to something (almost certainly for transformers). Only a slight chance of adding another type of encoder if that comes up which shouldn't break anything and should be highly unlikely.

Also NB, in class imbalance we had: oversampling, undersampling, hybrid and ensemble.

ablaom · 2024-09-30T01:18:23Z

Thanks @EssamWisam for considering my arguments. I'm afraid I remain unconvinced by yours.

As a compromise, can we simply add a combined list of all the transformers to the index.html page, at the bottom if you prefer? Your organzation of the docstrings can remain unaltered.

It might be helpful to change your entries in the first column of the table on index.html into actual links to the relevant sections.

EssamWisam · 2024-09-30T01:25:13Z

Thanks @EssamWisam for considering my arguments. I'm afraid I remain unconvinced by yours.

As a compromise, can we simply add a combined list of all the transformers to the index.html page, at the bottom if you prefer? Your organzation of the docstrings can remain unaltered.

It might be helpful to change your entries in the first column of the table on index.html into actual links to the relevant sections.

This middle ground highly resonates with me and I will be happy to apply it. As for your argument, I do agree that it can make it harder for someone to see all the methods at once and find something (if they don't use search) but to me having an intuitive structure carries a lot of value and your middle ground is a perfect solution. I don't know why but my initial impression is that a lot of people like it when documentation is highly structured and is carries some helpful intelect. I also do agree that overdoing it is not good but hopefully, it's not the case here.

Thank you so much.

EssamWisam · 2024-09-30T01:30:15Z

@ablaom I will likely proceed as follows for the tutorials; let me hear your thoughts:

A tutorial on standardization and how it can make a huge difference in performance
A tutorial comparing classical/widely known encoders together
Potentially, a tutorial to analyze overfitting in target encoding
A tutorial for entity embeddings (e.g., with some plots to showcase what it does)
A tutorial to compare different contrast encoders (if they do show different performance) else just using them
A tutorial comparing one-hot encoder only with one-hot encoder preceded by a cardinality reducer (to see an improvement in runtime and potentially performance)

Let me know if you have any other thoughts. I may give higher priority to making an interface for entity embeddings first now that MLJFlux exposes it in the new release and I will likely get back to this new weekend.

ablaom · 2024-09-30T01:33:56Z

These all sound like useful tutorials. Thanks for offering to write them!

ablaom · 2025-05-12T08:10:05Z

Any progress here?

EssamWisam · 2025-05-12T09:42:44Z

Yes, working on it.

Sync docs with dev

EssamWisam · 2025-06-18T23:46:23Z

@ablaom can you have a look here before I get back to this next Sunday to integrate the entity embedder and potentially two tutorials or something.

Check if you would like me to remove the table in the api index, I did view it as a little overwhelming.

Note the definitions I used for transformers and categorical encoders are mostly insipred by scikit-learn and I also improved the definition in the taxonomy (honestly I thought for a while to just drop it, but I really felt that is a degree of obvious clustering among the methods and that it could be overwhelming otherwise).

Let me know if you have other structural (or non-structural) recommendations.

ablaom · 2025-06-19T04:57:12Z

docs/src/transformers/numerical.md

 | [UnivariateBoxCoxTransformer](@ref) | Apply BoxCox transformation given a single vector | 
 | [InteractionTransformer](@ref) | Transforming columns of numerical features to create new interaction features |
 | [UnivariateDiscretizer](@ref) | Discretize a continuous vector into an ordered factor | 
+| [FillImputer](@ref) | Fill missing values of features belonging to any finite or infinite scientific type | 


FillImputer can be used to impute into categorical (e.g with mode) or numerical (e.g. with median) features. So how does that fit into your scheme?

In MLJTransforms we denote transformers that can operate on columns with Continuous and/or Count scientific types as numerical transformers.

So it sufficies that it can operate on infinite types.

Now I do perceive that the taxonomy makes lots more sense for categorical encoder (as opposed to transformers that aren't simply encoders); especially that entity embeddings, contrast encoders and utility encoders are all nontypical encoders and deserve better exposure (aside from helping for organization).

What do you think about the following, if I can do it by next Monday:

Split this package into two packages MLJEncoding and MLJTransforms. The former to carry the encoder methods and the latter for the broader category of transformers?

It doesn't seem like a lot of effort to me and it's intuitive in the sense that encoding packages do indeed tend to be standalone in other languages (eg, Python) as they constitute a specific type of transformers that is widely needed.

| So it sufficies that it can operate on infinite types.

Not sure what you are getting at. My point is your taxonomy, as I understand it, splits algorithms according to whether they operate on numerical or categorical features, no? But FillImputer operates on both kinds. So where do you put it? You put in under numerical.md but that's not right is it? Or, is that page name misleading ... does "numerical.md" just mean "not a categorical encoder with continuous output"?

| What do you think about the following, if I can do it by next Monday:

| Split this package into two packages MLJEncoding and MLJTransforms. The former to carry the encoder methods and the latter for the broader category of transformers?

I still think we should be careful to usurp "encoder" as a word used exclusively in the context of categorial input data. Auto encoders, and variational encoders are two very important examples where the input is not necessarily categorial (typically, it's just the output that is categorical, or categorical pdf. ). Maybe we should say "categorical encoding" (and MLJCategoricalEncoding is more informative pkg name). (And who is to say what a "transformer" is, especicially now that LLMs have usurped this already ubiquitous term for a rather specialise use case?)

I you want to split the package, and believe this will help users, and you have the time to it quickly I don't object. I don't think there is any maintenance benefit for doing so, in fact probably more of a maintenance burden: extra code fragmentation that doesn't seem justified from a dev point of view. You could alternatively achieve the separation you are after in the way documentation is organised. For example, you could have separate doc pages both living at MLJ.jl (which is where the current ~~MLJTransforms.jl~~ MLJModels.jl docs, and most other documentation, lives).

But, I'll support whatever option you're happy to work out.

Not sure what you are getting at.

Sorry if I wasn't clear. I meant the definition we denote transformers that can operate on columns with Continuous and/or Count scientific types implies that any transformer that "can" take these types will be considered numerical which is why I put Fill Imputer there. So it's as if the taxonomy distinguishes those transformers that "can" take numerical types from those that only take categorical ones.

I don't claim this is the best approach and I am open to recommendations. What do you think about adding another category Multi-type Transformers? By this, we have numerical, categorical, multi-type and other transformers which covers all possible scitypes.

in fact probably more of a maintenance burden: extra code fragmentation that doesn't seem justified from a dev point of view.

Okay then I am no longer motivated to do that and think improving the taxonomy could be sufficient.

Nevermind my recommendation as numerical transformers already operate on multiple scientific types.

docs/src/index.md

ablaom

I'm happy so long as my two new comments are addressed. Thanks for creating the table listing all the transformers.

EssamWisam · 2025-06-20T01:43:10Z

@ablaom will use the docstring from here (cross package), after its merged: FluxML/MLJFlux.jl#306

EssamWisam · 2025-09-23T21:47:26Z

@ablaom the last time I tried updating MLJ it didn't work but that was more than week ago. I will aim to get back to this today or tomorrow.

EssamWisam · 2025-09-24T01:07:22Z

@ablaom the tutorials are hopefully ready for a final review.

docs/src/index.md

docs/src/transformers/all_transformers.md

docs/src/tutorials/standardization/notebook.jl

ablaom · 2025-09-25T23:08:21Z

There's something not right in the Standardisation Extended Impact tutorial, as revealed by some warnings in the output:

Maybe you need to coerce NPreg to Continuous. Also, I suggest using evaluate(model, X, y; resampling=[(train, test),], ...) to make a much shorter workflow. This method also automatically applies the mode when needed to prepare for applying accuracy.

ablaom · 2025-09-25T23:32:56Z

docs/src/tutorials/classic_comparison/notebook.jl

+# **Julia version** is assumed to be 1.10.*
+
+# This demonstration is available as a Jupyter notebook or julia script (as well as the dataset)
+# [here](https://github.com/essamwise/MLJTransforms.jl/tree/main/docs/src/tutorials/classic_comparison).


This link is broken. Probably the other tutorials have the same problem, but I didn't check.

It's broken until we merge this 😉

docs/src/tutorials/classic_comparison/notebook.jl

docs/src/tutorials/standardization/notebook.jl

ablaom

Thanks @EssamWisam for the update. I think we are almost there.

Co-authored-by: Anthony Blaom, PhD <[email protected]>

EssamWisam · 2025-09-26T02:39:10Z

There's something not right in the Standardisation Extended Impact tutorial, as revealed by some warnings in the output:

I have addressed this in the last two commits.

Co-authored-by: Anthony Blaom, PhD <[email protected]>

EssamWisam · 2025-09-26T02:40:16Z

Thanks @EssamWisam for the update. I think we are almost there.

Thank you so much @ablaom would be delighted to here that we can finally merge this.

ablaom

I'm happy now. If you could just check one last time that all the various notebooks are getting generated.

Congratulations 🎉 . I appreciate your persistence here while getting busy with new projects.

Before tagging a new release, it'd be good to address #56 and #57, which are minor doc fixes.

I'm going on leave today; back on deck Oct 17th.

EssamWisam · 2025-09-26T03:25:00Z

@ablaom okay so I will:

ensure the notebooks being generated
Solve the two issues
Merge this and tag release.

Wishing you some quality time.

🌟 Initialize docs

f51274d

EssamWisam self-assigned this Sep 15, 2024

🔥 Delete useless logo

1e8a421

ablaom reviewed Sep 15, 2024

View reviewed changes

docs/src/index.md Outdated Show resolved Hide resolved

⭐️ Better documentation

924df68

ablaom reviewed Sep 30, 2024

View reviewed changes

docs/src/transformers/numerical.md Outdated Show resolved Hide resolved

ablaom reviewed Sep 30, 2024

View reviewed changes

docs/src/transformers/numerical.md Outdated Show resolved Hide resolved

ablaom mentioned this pull request May 28, 2025

Checklist for first release #21

Open

EssamWisam added 2 commits June 18, 2025 16:47

Merge pull request #29 from JuliaAI/dev

28c1cb9

Sync docs with dev

✨ Better structure and definitions

027f620

ablaom reviewed Jun 19, 2025

View reviewed changes

docs/src/index.md Outdated Show resolved Hide resolved

ablaom reviewed Jun 19, 2025

View reviewed changes

✨ Link full list above

03892b7

EssamWisam added 2 commits June 25, 2025 02:28

Create readme-reflect.yml

0263009

Update readme-reflect.yml

c0bb79d

EssamWisam added 2 commits September 23, 2025 20:06

✨ MLJ Update

99ccadc

Merge branch 'dev' into docs

73a9609