Expose "vocabulary" parameter to "StringEncoder" #1819

emassoulie · 2025-12-19T16:22:43Z

Response to issue 1792

…string

MarieSacksick · 2025-12-25T14:03:24Z

hello @emassoulie, thank you for your contribution 🎉 !
Could you add a test to this new option please?

I think it's also worth to add an entry in the changelog :).

GaelVaroquaux · 2025-12-29T13:17:55Z

skrub/_string_encoder.py

        Used during randomized svd. Pass an int for reproducible results across
        multiple function calls.

+    vocabulary : Mapping or iterable, default=None


Suggested change

vocabulary : Mapping or iterable, default=None

vocabulary_ : Mapping or iterable, default=None

The scikit-learn convention requires to have an underscore at the end of attributes that are derived from the data

Here vocabulary is a user-provided value though, it's not derived from the data. In fact, we never define a vocabulary_ in fit_transform.

…ringEncoder

emassoulie · 2026-01-12T16:42:25Z

vocabulary name reverted following a discussion with @rcap107!
We were also wondering whether it is necessary to include an example in the docstring for this specific case. It feels like any use case for "vocabulary" would just be passing on another model's vocabulary attribute to the StringEncoder, which doesn't make for particularly compelling code

rcap107 · 2026-01-12T17:04:07Z

CHANGES.rst

 - Computing the associations in :class:`TableReport` is now deterministic and can
  be controlled by the new parameter ``subsampling_seed`` of the global configuration.
  :pr:`1775` by :user:`Thomas S. <thomass-dev>`.
+- The :class: `StringEncoder` now exposes the ``vocabulary`` parameter from the parent


The changelog entry was added in the wrong place, it should be at the top of the file in the proper section

rcap107 · 2026-01-12T17:06:38Z

skrub/_string_encoder.py

-                            ngram_range=self.ngram_range,
-                            analyzer=self.analyzer,
-                            stop_words=self.stop_words,
+            if self.vocabulary is None:


I think that having something like

if self.vocabulary is not None: raise ValueError(...)

and then continuing with self.vectorizer_ would be more readable

rcap107

Looks good to me, thanks a lot @emassoulie

Exposed "vocabulary" parameter to "StringEncoder" and added it to doc…

128ca03

…string

emassoulie self-assigned this Dec 19, 2025

emassoulie mentioned this pull request Dec 22, 2025

Exposing the TfidfVectorizer's vocabulary attribute in StringEncoder #1792

Closed

MarieSacksick linked an issue Dec 25, 2025 that may be closed by this pull request

Exposing the TfidfVectorizer's vocabulary attribute in StringEncoder #1792

Closed

GaelVaroquaux reviewed Dec 29, 2025

View reviewed changes

emassoulie and others added 3 commits January 12, 2026 16:11

Added test and changed parameter name

86595ea

Adjustments to the parameter name

9cddeff

Merge branch 'main' into issue_1792-expose_vocabulary_attribute_in_St…

39aa02a

…ringEncoder

rcap107 reviewed Jan 12, 2026

View reviewed changes

emassoulie and others added 4 commits January 13, 2026 11:39

Format adjustments

b10d7d7

Test fix

40fed52

Changelog again

f96389b

fixing formatting in changelog

90540a1

rcap107 approved these changes Jan 14, 2026

View reviewed changes

rcap107 merged commit 85c1c06 into skrub-data:main Jan 14, 2026
29 checks passed

emassoulie deleted the issue_1792-expose_vocabulary_attribute_in_StringEncoder branch January 14, 2026 15:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose "vocabulary" parameter to "StringEncoder" #1819

Expose "vocabulary" parameter to "StringEncoder" #1819

Uh oh!

emassoulie commented Dec 19, 2025

Uh oh!

MarieSacksick commented Dec 25, 2025

Uh oh!

GaelVaroquaux Dec 29, 2025

Uh oh!

rcap107 Jan 12, 2026

Uh oh!

emassoulie commented Jan 12, 2026

Uh oh!

rcap107 Jan 12, 2026

Uh oh!

rcap107 Jan 12, 2026

Uh oh!

rcap107 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	vocabulary : Mapping or iterable, default=None
	vocabulary_ : Mapping or iterable, default=None

Expose "vocabulary" parameter to "StringEncoder" #1819

Expose "vocabulary" parameter to "StringEncoder" #1819

Uh oh!

Conversation

emassoulie commented Dec 19, 2025

Uh oh!

MarieSacksick commented Dec 25, 2025

Uh oh!

GaelVaroquaux Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

rcap107 Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

emassoulie commented Jan 12, 2026

Uh oh!

rcap107 Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

rcap107 Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

rcap107 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants