-
Notifications
You must be signed in to change notification settings - Fork 0
🌟 Initialize documentation #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
@ablaom this is a draft no need to review yet. That said, you can take a look at what I did so far to have an idea of what's on my mind. The current taxonomy can highly likely be improved (along with the descriptions) but I am a strong fan of having some sort of taxonomy anyway; this is just an initial draft. |
|
||
## Available Transformers | ||
In `MLJTransforms` we define "encoders" to encompass models that specifically operate by encoding categorical variables; meanwhile, "transformers" refers to models that apply more generic transformations on columns that are not necessarily categorical. We define the following taxonomy for different models founds in `MLJTransforms`: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you know I'm not a big fan of enforcing taxonomy on models. That said, this is docs and not API and your suggestion looks reasonable. Here are things i don't like about it:
-
Many will object to your definition of "encoder", ie, that to be an encoder the input must be categorical. One can immediately think of counterexamples and and there is a real danger that we later add a model whose provider insists is an "encoder" which breaks your definition. I suggest you avoid making this formal definition. By the way, it is different from the one given here.
-
It feels like categorical encoders have a very heavy weight compared to the other transformers. If you are skimming documentation quickly, you might decide something like a Standardizer is not provided here. Perhaps the suggestion below can counter this.
-
Related to the above: While it may have advantages as an organising principle, structuring the docs this way actually introduces an extra layer making it harder to find the transformer you want quickly. I think in the first place software docs are not for educating, but for facilitating. When I use software docs, I mostly want to "find" something, not "learn" something. I'm not suggesting you abandon the taxonomy, just giving you some feedback to take into account.
Thanks @EssamWisam for this substantial effort. I have had a brief look and made some comments. |
@ablaom I tried to refine what I did last time. This time you can review the text content before I proceed with adding tutorials and other additions. |
Thanks for these further improvements. Let me re-iterate my unhappiness about enforcing a classification in this document. To me it's just an extra layer in the way of me finding what I'm looking for. Once you have more than about 3 categories, any classification becomes more a cognitive burden than a help, in my opinion, and now you have 6. And I worry about new transformers breaking whatever classification you come up with. This is based on my own personal experience, and I've been doing this a while now. |
I don't know why my impression is that such classification makes it easier to find methods and "understand the package". I always highly value it when I read some documentation and can immediately understand all that this package does or offers. I do agree that people should go to manuals to look for the existence of methods and how to use them but I think specifically in machine learning, people will appreciate when there is some explanation. E.g., Scikit-learn often has a one line explanation for its methods. This is because otherwise one kind of assumes everyone visiting their package is an ML expert. Looking into the taxonomy here: Numerical Transformers, Classical Encoders, Neural-based Encoders, Contrast Encoders, Utility Encoders, Other Transformers
I think the taxonomy here is mostly total in that any new categorical encoder or transformer will be mapped to something (almost certainly for transformers). Only a slight chance of adding another type of encoder if that comes up which shouldn't break anything and should be highly unlikely. Also NB, in class imbalance we had: oversampling, undersampling, hybrid and ensemble. |
Thanks @EssamWisam for considering my arguments. I'm afraid I remain unconvinced by yours. As a compromise, can we simply add a combined list of all the transformers to the index.html page, at the bottom if you prefer? Your organzation of the docstrings can remain unaltered. It might be helpful to change your entries in the first column of the table on index.html into actual links to the relevant sections. |
This middle ground highly resonates with me and I will be happy to apply it. As for your argument, I do agree that it can make it harder for someone to see all the methods at once and find something (if they don't use search) but to me having an intuitive structure carries a lot of value and your middle ground is a perfect solution. I don't know why but my initial impression is that a lot of people like it when documentation is highly structured and is carries some helpful intelect. I also do agree that overdoing it is not good but hopefully, it's not the case here. Thank you so much. |
@ablaom I will likely proceed as follows for the tutorials; let me hear your thoughts:
Let me know if you have any other thoughts. I may give higher priority to making an interface for entity embeddings first now that MLJFlux exposes it in the new release and I will likely get back to this new weekend. |
These all sound like useful tutorials. Thanks for offering to write them! |
Any progress here? |
Yes, working on it. |
Sync docs with dev
@ablaom can you have a look here before I get back to this next Sunday to integrate the entity embedder and potentially two tutorials or something. Check if you would like me to remove the table in the api index, I did view it as a little overwhelming. Note the definitions I used for Let me know if you have other structural (or non-structural) recommendations. |
docs/src/transformers/numerical.md
Outdated
@@ -7,6 +7,7 @@ Other Transformers include more generic transformers that go beyond categorical | |||
| [UnivariateBoxCoxTransformer](@ref) | Apply BoxCox transformation given a single vector | | |||
| [InteractionTransformer](@ref) | Transforming columns of numerical features to create new interaction features | | |||
| [UnivariateDiscretizer](@ref) | Discretize a continuous vector into an ordered factor | | |||
| [FillImputer](@ref) | Fill missing values of features belonging to any finite or infinite scientific type | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FillImputer
can be used to impute into categorical (e.g with mode) or numerical (e.g. with median) features. So how does that fit into your scheme?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In MLJTransforms we denote transformers that can operate on columns with Continuous and/or Count scientific types as numerical transformers.
So it sufficies that it can operate on infinite types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I do perceive that the taxonomy makes lots more sense for categorical encoder (as opposed to transformers that aren't simply encoders); especially that entity embeddings, contrast encoders and utility encoders are all nontypical encoders and deserve better exposure (aside from helping for organization).
What do you think about the following, if I can do it by next Monday:
- Split this package into two packages
MLJEncoding
andMLJTransforms
. The former to carry the encoder methods and the latter for the broader category of transformers?
It doesn't seem like a lot of effort to me and it's intuitive in the sense that encoding packages do indeed tend to be standalone in other languages (eg, Python) as they constitute a specific type of transformers that is widely needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| So it sufficies that it can operate on infinite types.
Not sure what you are getting at. My point is your taxonomy, as I understand it, splits algorithms according to whether they operate on numerical or categorical features, no? But FillImputer
operates on both kinds. So where do you put it? You put in under numerical.md
but that's not right is it? Or, is that page name misleading ... does "numerical.md" just mean "not a categorical encoder with continuous output"?
| What do you think about the following, if I can do it by next Monday:
| Split this package into two packages MLJEncoding and MLJTransforms. The former to carry the encoder methods and the latter for the broader category of transformers?
I still think we should be careful to usurp "encoder" as a word used exclusively in the context of categorial input data. Auto encoders, and variational encoders are two very important examples where the input is not necessarily categorial (typically, it's just the output that is categorical, or categorical pdf. ). Maybe we should say "categorical encoding" (and MLJCategoricalEncoding
is more informative pkg name). (And who is to say what a "transformer" is, especicially now that LLMs have usurped this already ubiquitous term for a rather specialise use case?)
I you want to split the package, and believe this will help users, and you have the time to it quickly I don't object. I don't think there is any maintenance benefit for doing so, in fact probably more of a maintenance burden: extra code fragmentation that doesn't seem justified from a dev point of view. You could alternatively achieve the separation you are after in the way documentation is organised. For example, you could have separate doc pages both living at MLJ.jl (which is where the current MLJTransforms.jl MLJModels.jl docs, and most other documentation, lives).
But, I'll support whatever option you're happy to work out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you are getting at.
Sorry if I wasn't clear. I meant the definition we denote transformers that can operate on columns with Continuous and/or Count scientific types
implies that any transformer that "can" take these types will be considered numerical which is why I put Fill Imputer
there. So it's as if the taxonomy distinguishes those transformers that "can" take numerical types from those that only take categorical ones.
I don't claim this is the best approach and I am open to recommendations. What do you think about adding another category Multi-type Transformers
? By this, we have numerical, categorical, multi-type and other transformers which covers all possible scitypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fact probably more of a maintenance burden: extra code fragmentation that doesn't seem justified from a dev point of view.
Okay then I am no longer motivated to do that and think improving the taxonomy could be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind my recommendation as numerical transformers already operate on multiple scientific types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy so long as my two new comments are addressed. Thanks for creating the table listing all the transformers.
@ablaom will use the docstring from here (cross package), after its merged: FluxML/MLJFlux.jl#306 |
rmse_value = MLJ.root_mean_squared_error(y[test], predictions) | ||
push!(results, (name, rmse_value, training_time)) | ||
end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, and in the first tutorial as well, you could make use of evaluate(pipe, X, y, resampling=[(train, test),], measure=rmse)
for a shorter workflow. Or at least mention that this is available to do the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed the text (the .jl files) for the new tutorials but haven't yet checked the rendering locally. Let me do that after we rebase later, as per "plan" in this comment, now linked in the opening post for this PR.
Thanks again for your valuable contributions, @EssamWisam
@ablaom I believe I have addressed all your comments excluding those about changing the imports which require registration first. |
Co-authored-by: Anthony Blaom, PhD <[email protected]>
This is a draft for the documentation page of
MLJTransforms
.@ablaom addition:
Some further plan of action: