Normalize text before checking for missing characters in the tokenizer

SentencePiece normalizes some decomposed characters (e.g., "B" and underscore) into their canonical form (e.g., "Ḇ).  When we are checking the source text for characters that are not known to the tokenizer, we are checking before this normalization occurs.  It would be better to check for characters that are unknown to the tokenizer using the normalized version of the text.