-
Notifications
You must be signed in to change notification settings - Fork 200
Expose "vocabulary" parameter to "StringEncoder" #1819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose "vocabulary" parameter to "StringEncoder" #1819
Conversation
|
hello @emassoulie, thank you for your contribution 🎉 ! I think it's also worth to add an entry in the changelog :). |
| Used during randomized svd. Pass an int for reproducible results across | ||
| multiple function calls. | ||
|
|
||
| vocabulary : Mapping or iterable, default=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| vocabulary : Mapping or iterable, default=None | |
| vocabulary_ : Mapping or iterable, default=None |
The scikit-learn convention requires to have an underscore at the end of attributes that are derived from the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here vocabulary is a user-provided value though, it's not derived from the data. In fact, we never define a vocabulary_ in fit_transform.
|
|
CHANGES.rst
Outdated
| - Computing the associations in :class:`TableReport` is now deterministic and can | ||
| be controlled by the new parameter ``subsampling_seed`` of the global configuration. | ||
| :pr:`1775` by :user:`Thomas S. <thomass-dev>`. | ||
| - The :class: `StringEncoder` now exposes the ``vocabulary`` parameter from the parent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changelog entry was added in the wrong place, it should be at the top of the file in the proper section
skrub/_string_encoder.py
Outdated
| ngram_range=self.ngram_range, | ||
| analyzer=self.analyzer, | ||
| stop_words=self.stop_words, | ||
| if self.vocabulary is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that having something like
if self.vocabulary is not None:
raise ValueError(...)and then continuing with self.vectorizer_ would be more readable
rcap107
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks a lot @emassoulie
Response to issue 1792