Skip to content

Conversation

EssamWisam
Copy link
Collaborator

@EssamWisam EssamWisam commented Sep 15, 2024

This is a draft for the documentation page of MLJTransforms.

@ablaom addition:

Some further plan of action:

@EssamWisam EssamWisam self-assigned this Sep 15, 2024
@EssamWisam
Copy link
Collaborator Author

@ablaom this is a draft no need to review yet. That said, you can take a look at what I did so far to have an idea of what's on my mind. The current taxonomy can highly likely be improved (along with the descriptions) but I am a strong fan of having some sort of taxonomy anyway; this is just an initial draft.


## Available Transformers
In `MLJTransforms` we define "encoders" to encompass models that specifically operate by encoding categorical variables; meanwhile, "transformers" refers to models that apply more generic transformations on columns that are not necessarily categorical. We define the following taxonomy for different models founds in `MLJTransforms`:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you know I'm not a big fan of enforcing taxonomy on models. That said, this is docs and not API and your suggestion looks reasonable. Here are things i don't like about it:

  • Many will object to your definition of "encoder", ie, that to be an encoder the input must be categorical. One can immediately think of counterexamples and and there is a real danger that we later add a model whose provider insists is an "encoder" which breaks your definition. I suggest you avoid making this formal definition. By the way, it is different from the one given here.

  • It feels like categorical encoders have a very heavy weight compared to the other transformers. If you are skimming documentation quickly, you might decide something like a Standardizer is not provided here. Perhaps the suggestion below can counter this.

  • Related to the above: While it may have advantages as an organising principle, structuring the docs this way actually introduces an extra layer making it harder to find the transformer you want quickly. I think in the first place software docs are not for educating, but for facilitating. When I use software docs, I mostly want to "find" something, not "learn" something. I'm not suggesting you abandon the taxonomy, just giving you some feedback to take into account.

@ablaom
Copy link
Member

ablaom commented Sep 15, 2024

Thanks @EssamWisam for this substantial effort. I have had a brief look and made some comments.

@EssamWisam
Copy link
Collaborator Author

@ablaom I tried to refine what I did last time. This time you can review the text content before I proceed with adding tutorials and other additions.

@ablaom
Copy link
Member

ablaom commented Sep 30, 2024

Thanks for these further improvements.

Let me re-iterate my unhappiness about enforcing a classification in this document. To me it's just an extra layer in the way of me finding what I'm looking for. Once you have more than about 3 categories, any classification becomes more a cognitive burden than a help, in my opinion, and now you have 6.

And I worry about new transformers breaking whatever classification you come up with. This is based on my own personal experience, and I've been doing this a while now.

@EssamWisam
Copy link
Collaborator Author

Let me re-iterate my unhappiness about enforcing a classification in this document. To me it's just an extra layer in the way of me finding what I'm looking for. Once you have more than about 3 categories, any classification becomes more a cognitive burden than a help, in my opinion, and now you have 6.

And I worry about new transformers breaking whatever classification you come up with. This is based on my own personal experience, and I've been doing this a while now.

I don't know why my impression is that such classification makes it easier to find methods and "understand the package". I always highly value it when I read some documentation and can immediately understand all that this package does or offers. I do agree that people should go to manuals to look for the existence of methods and how to use them but I think specifically in machine learning, people will appreciate when there is some explanation. E.g., Scikit-learn often has a one line explanation for its methods. This is because otherwise one kind of assumes everyone visiting their package is an ML expert.

Looking into the taxonomy here:

Numerical Transformers, Classical Encoders, Neural-based Encoders, Contrast Encoders, Utility Encoders, Other Transformers

  • The terms "Classical" and "Utility" are used in their widely accepted meanings (e.g., as in classical machine learning, utility function, etc.)
  • Neural-based and Contrast encoders are in my opinion helpful because some people will be looking for these specific types many others will be rather interested in other types (e.g., just classical ones).
  • Numerical and Other transformers also make it possible to look at the other types or transformers that are not encoders.

I think the taxonomy here is mostly total in that any new categorical encoder or transformer will be mapped to something (almost certainly for transformers). Only a slight chance of adding another type of encoder if that comes up which shouldn't break anything and should be highly unlikely.

Also NB, in class imbalance we had: oversampling, undersampling, hybrid and ensemble.

@ablaom
Copy link
Member

ablaom commented Sep 30, 2024

Thanks @EssamWisam for considering my arguments. I'm afraid I remain unconvinced by yours.

As a compromise, can we simply add a combined list of all the transformers to the index.html page, at the bottom if you prefer? Your organzation of the docstrings can remain unaltered.

It might be helpful to change your entries in the first column of the table on index.html into actual links to the relevant sections.

@EssamWisam
Copy link
Collaborator Author

Thanks @EssamWisam for considering my arguments. I'm afraid I remain unconvinced by yours.

As a compromise, can we simply add a combined list of all the transformers to the index.html page, at the bottom if you prefer? Your organzation of the docstrings can remain unaltered.

It might be helpful to change your entries in the first column of the table on index.html into actual links to the relevant sections.

This middle ground highly resonates with me and I will be happy to apply it. As for your argument, I do agree that it can make it harder for someone to see all the methods at once and find something (if they don't use search) but to me having an intuitive structure carries a lot of value and your middle ground is a perfect solution. I don't know why but my initial impression is that a lot of people like it when documentation is highly structured and is carries some helpful intelect. I also do agree that overdoing it is not good but hopefully, it's not the case here.

Thank you so much.

@EssamWisam
Copy link
Collaborator Author

EssamWisam commented Sep 30, 2024

@ablaom I will likely proceed as follows for the tutorials; let me hear your thoughts:

  • A tutorial on standardization and how it can make a huge difference in performance
  • A tutorial comparing classical/widely known encoders together
  • Potentially, a tutorial to analyze overfitting in target encoding
  • A tutorial for entity embeddings (e.g., with some plots to showcase what it does)
  • A tutorial to compare different contrast encoders (if they do show different performance) else just using them
  • A tutorial comparing one-hot encoder only with one-hot encoder preceded by a cardinality reducer (to see an improvement in runtime and potentially performance)

Let me know if you have any other thoughts. I may give higher priority to making an interface for entity embeddings first now that MLJFlux exposes it in the new release and I will likely get back to this new weekend.

@ablaom
Copy link
Member

ablaom commented Sep 30, 2024

These all sound like useful tutorials. Thanks for offering to write them!

@ablaom
Copy link
Member

ablaom commented May 12, 2025

Any progress here?

@EssamWisam
Copy link
Collaborator Author

Yes, working on it.

@EssamWisam
Copy link
Collaborator Author

@ablaom can you have a look here before I get back to this next Sunday to integrate the entity embedder and potentially two tutorials or something.

Check if you would like me to remove the table in the api index, I did view it as a little overwhelming.

Note the definitions I used for transformers and categorical encoders are mostly insipred by scikit-learn and I also improved the definition in the taxonomy (honestly I thought for a while to just drop it, but I really felt that is a degree of obvious clustering among the methods and that it could be overwhelming otherwise).

Let me know if you have other structural (or non-structural) recommendations.

@@ -7,6 +7,7 @@ Other Transformers include more generic transformers that go beyond categorical
| [UnivariateBoxCoxTransformer](@ref) | Apply BoxCox transformation given a single vector |
| [InteractionTransformer](@ref) | Transforming columns of numerical features to create new interaction features |
| [UnivariateDiscretizer](@ref) | Discretize a continuous vector into an ordered factor |
| [FillImputer](@ref) | Fill missing values of features belonging to any finite or infinite scientific type |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FillImputer can be used to impute into categorical (e.g with mode) or numerical (e.g. with median) features. So how does that fit into your scheme?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In MLJTransforms we denote transformers that can operate on columns with Continuous and/or Count scientific types as numerical transformers.

So it sufficies that it can operate on infinite types.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I do perceive that the taxonomy makes lots more sense for categorical encoder (as opposed to transformers that aren't simply encoders); especially that entity embeddings, contrast encoders and utility encoders are all nontypical encoders and deserve better exposure (aside from helping for organization).

What do you think about the following, if I can do it by next Monday:

  • Split this package into two packages MLJEncoding and MLJTransforms. The former to carry the encoder methods and the latter for the broader category of transformers?

It doesn't seem like a lot of effort to me and it's intuitive in the sense that encoding packages do indeed tend to be standalone in other languages (eg, Python) as they constitute a specific type of transformers that is widely needed.

Copy link
Member

@ablaom ablaom Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| So it sufficies that it can operate on infinite types.

Not sure what you are getting at. My point is your taxonomy, as I understand it, splits algorithms according to whether they operate on numerical or categorical features, no? But FillImputer operates on both kinds. So where do you put it? You put in under numerical.md but that's not right is it? Or, is that page name misleading ... does "numerical.md" just mean "not a categorical encoder with continuous output"?

| What do you think about the following, if I can do it by next Monday:

| Split this package into two packages MLJEncoding and MLJTransforms. The former to carry the encoder methods and the latter for the broader category of transformers?

I still think we should be careful to usurp "encoder" as a word used exclusively in the context of categorial input data. Auto encoders, and variational encoders are two very important examples where the input is not necessarily categorial (typically, it's just the output that is categorical, or categorical pdf. ). Maybe we should say "categorical encoding" (and MLJCategoricalEncoding is more informative pkg name). (And who is to say what a "transformer" is, especicially now that LLMs have usurped this already ubiquitous term for a rather specialise use case?)

I you want to split the package, and believe this will help users, and you have the time to it quickly I don't object. I don't think there is any maintenance benefit for doing so, in fact probably more of a maintenance burden: extra code fragmentation that doesn't seem justified from a dev point of view. You could alternatively achieve the separation you are after in the way documentation is organised. For example, you could have separate doc pages both living at MLJ.jl (which is where the current MLJTransforms.jl MLJModels.jl docs, and most other documentation, lives).

But, I'll support whatever option you're happy to work out.

Copy link
Collaborator Author

@EssamWisam EssamWisam Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you are getting at.

Sorry if I wasn't clear. I meant the definition we denote transformers that can operate on columns with Continuous and/or Count scientific types implies that any transformer that "can" take these types will be considered numerical which is why I put Fill Imputer there. So it's as if the taxonomy distinguishes those transformers that "can" take numerical types from those that only take categorical ones.

I don't claim this is the best approach and I am open to recommendations. What do you think about adding another category Multi-type Transformers? By this, we have numerical, categorical, multi-type and other transformers which covers all possible scitypes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact probably more of a maintenance burden: extra code fragmentation that doesn't seem justified from a dev point of view.

Okay then I am no longer motivated to do that and think improving the taxonomy could be sufficient.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind my recommendation as numerical transformers already operate on multiple scientific types.

Copy link
Member

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy so long as my two new comments are addressed. Thanks for creating the table listing all the transformers.

@EssamWisam
Copy link
Collaborator Author

@ablaom will use the docstring from here (cross package), after its merged: FluxML/MLJFlux.jl#306

rmse_value = MLJ.root_mean_squared_error(y[test], predictions)
push!(results, (name, rmse_value, training_time))
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, and in the first tutorial as well, you could make use of evaluate(pipe, X, y, resampling=[(train, test),], measure=rmse) for a shorter workflow. Or at least mention that this is available to do the same thing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the text (the .jl files) for the new tutorials but haven't yet checked the rendering locally. Let me do that after we rebase later, as per "plan" in this comment, now linked in the opening post for this PR.

Thanks again for your valuable contributions, @EssamWisam

@EssamWisam
Copy link
Collaborator Author

@ablaom I believe I have addressed all your comments excluding those about changing the imports which require registration first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants