Add support for char and char_wb analyzers in TfidfVectorizer/CountVe… #1211

fkrasnov · 2025-09-03T17:46:24Z

Add support for char and char_wb analyzers in TfidfVectorizer/CountVectorizer

Currently skl2onnx only supports analyzer="word" for CountVectorizer and
TfidfVectorizer. Using "char" or "char_wb" raises NotImplementedError.

This PR extends the converter to handle character-based analyzers by
emitting ONNX Tokenizer + Ngram operators configured for character-level
ngrams. For "char_wb" mode, a regex approximation is used to simulate
boundary-aware ngrams.

Extended converter to support analyzer in {"char", "char_wb"}
Added unit tests for char and char_wb vectorizers
Verified multilingual support with Cyrillic inputs

…ctorizer

Add support for char and char_wb analyzers in TfidfVectorizer/CountVe…

1bd1f96

…ctorizer

fkrasnov marked this pull request as draft September 4, 2025 04:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for char and char_wb analyzers in TfidfVectorizer/CountVe… #1211

Add support for char and char_wb analyzers in TfidfVectorizer/CountVe… #1211

Uh oh!

fkrasnov commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add support for char and char_wb analyzers in TfidfVectorizer/CountVe… #1211

Are you sure you want to change the base?

Add support for char and char_wb analyzers in TfidfVectorizer/CountVe… #1211

Uh oh!

Conversation

fkrasnov commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant