You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SentencePiece normalizes some decomposed characters (e.g., "B" and underscore) into their canonical form (e.g., "Ḇ). When we are checking the source text for characters that are not known to the tokenizer, we are checking before this normalization occurs. It would be better to check for characters that are unknown to the tokenizer using the normalized version of the text.