Skip to content

Add support for char and char_wb analyzers in TfidfVectorizer/CountVe…#1211

Draft
fkrasnov wants to merge 2 commits intoonnx:mainfrom
fkrasnov:add_support_for_char_wb_analyzers_in_TfidfVectorizer
Draft

Add support for char and char_wb analyzers in TfidfVectorizer/CountVe…#1211
fkrasnov wants to merge 2 commits intoonnx:mainfrom
fkrasnov:add_support_for_char_wb_analyzers_in_TfidfVectorizer

Conversation

@fkrasnov
Copy link
Copy Markdown

@fkrasnov fkrasnov commented Sep 3, 2025

Add support for char and char_wb analyzers in TfidfVectorizer/CountVectorizer

Currently skl2onnx only supports analyzer="word" for CountVectorizer and
TfidfVectorizer. Using "char" or "char_wb" raises NotImplementedError.

This PR extends the converter to handle character-based analyzers by
emitting ONNX Tokenizer + Ngram operators configured for character-level
ngrams. For "char_wb" mode, a regex approximation is used to simulate
boundary-aware ngrams.

  • Extended converter to support analyzer in {"char", "char_wb"}
  • Added unit tests for char and char_wb vectorizers
  • Verified multilingual support with Cyrillic inputs

@fkrasnov fkrasnov marked this pull request as draft September 4, 2025 04:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants