Skip to content

Conversation

@emassoulie
Copy link
Contributor

Response to issue 1792

@MarieSacksick
Copy link
Contributor

hello @emassoulie, thank you for your contribution 🎉 !
Could you add a test to this new option please?

I think it's also worth to add an entry in the changelog :).

Used during randomized svd. Pass an int for reproducible results across
multiple function calls.

vocabulary : Mapping or iterable, default=None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vocabulary : Mapping or iterable, default=None
vocabulary_ : Mapping or iterable, default=None

The scikit-learn convention requires to have an underscore at the end of attributes that are derived from the data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here vocabulary is a user-provided value though, it's not derived from the data. In fact, we never define a vocabulary_ in fit_transform.

@emassoulie
Copy link
Contributor Author

vocabulary name reverted following a discussion with @rcap107!
We were also wondering whether it is necessary to include an example in the docstring for this specific case. It feels like any use case for "vocabulary" would just be passing on another model's vocabulary attribute to the StringEncoder, which doesn't make for particularly compelling code

CHANGES.rst Outdated
- Computing the associations in :class:`TableReport` is now deterministic and can
be controlled by the new parameter ``subsampling_seed`` of the global configuration.
:pr:`1775` by :user:`Thomas S. <thomass-dev>`.
- The :class: `StringEncoder` now exposes the ``vocabulary`` parameter from the parent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changelog entry was added in the wrong place, it should be at the top of the file in the proper section

ngram_range=self.ngram_range,
analyzer=self.analyzer,
stop_words=self.stop_words,
if self.vocabulary is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that having something like

            if self.vocabulary is not None:
                raise ValueError(...)

and then continuing with self.vectorizer_ would be more readable

Copy link
Member

@rcap107 rcap107 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks a lot @emassoulie

@rcap107 rcap107 merged commit 85c1c06 into skrub-data:main Jan 14, 2026
29 checks passed
@emassoulie emassoulie deleted the issue_1792-expose_vocabulary_attribute_in_StringEncoder branch January 14, 2026 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Exposing the TfidfVectorizer's vocabulary attribute in StringEncoder

4 participants