Skip to content

Normalize text before checking for missing characters in the tokenizer #352

@mmartin9684-sil

Description

@mmartin9684-sil

SentencePiece normalizes some decomposed characters (e.g., "B" and underscore) into their canonical form (e.g., "Ḇ). When we are checking the source text for characters that are not known to the tokenizer, we are checking before this normalization occurs. It would be better to check for characters that are unknown to the tokenizer using the normalized version of the text.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingpipeline 3: preprocessIssue related to preprocessing.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions