Add support for char and char_wb analyzers in TfidfVectorizer/CountVe… by fkrasnov · Pull Request #1211 · onnx/sklearn-onnx

fkrasnov · 2025-09-03T17:46:24Z

Add support for char and char_wb analyzers in TfidfVectorizer/CountVectorizer

Currently skl2onnx only supports analyzer="word" for CountVectorizer and
TfidfVectorizer. Using "char" or "char_wb" raises NotImplementedError.

This PR extends the converter to handle character-based analyzers by
emitting ONNX Tokenizer + Ngram operators configured for character-level
ngrams. For "char_wb" mode, a regex approximation is used to simulate
boundary-aware ngrams.

Extended converter to support analyzer in {"char", "char_wb"}
Added unit tests for char and char_wb vectorizers
Verified multilingual support with Cyrillic inputs

…ctorizer

Add support for char and char_wb analyzers in TfidfVectorizer/CountVe…

1bd1f96

…ctorizer

fkrasnov marked this pull request as draft September 4, 2025 04:18

Merge branch 'main' into add_support_for_char_wb_analyzers_in_TfidfVe…

4e706d0

…ctorizer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for char and char_wb analyzers in TfidfVectorizer/CountVe…#1211

Add support for char and char_wb analyzers in TfidfVectorizer/CountVe…#1211
fkrasnov wants to merge 2 commits intoonnx:mainfrom
fkrasnov:add_support_for_char_wb_analyzers_in_TfidfVectorizer

fkrasnov commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fkrasnov commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants