Skip to content

Conversation

@fkrasnov
Copy link

@fkrasnov fkrasnov commented Sep 3, 2025

Add support for char and char_wb analyzers in TfidfVectorizer/CountVectorizer

Currently skl2onnx only supports analyzer="word" for CountVectorizer and
TfidfVectorizer. Using "char" or "char_wb" raises NotImplementedError.

This PR extends the converter to handle character-based analyzers by
emitting ONNX Tokenizer + Ngram operators configured for character-level
ngrams. For "char_wb" mode, a regex approximation is used to simulate
boundary-aware ngrams.

  • Extended converter to support analyzer in {"char", "char_wb"}
  • Added unit tests for char and char_wb vectorizers
  • Verified multilingual support with Cyrillic inputs

@fkrasnov fkrasnov marked this pull request as draft September 4, 2025 04:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant