-
Notifications
You must be signed in to change notification settings - Fork 30.7k
Closed
Description
Hello, the BertTokenizer seems loose accents when convert_ids_to_tokens() is used :
Example:
- original sentence: "great breakfasts in a nice furnished cafè, slightly bohemian."
- corresponding list of token produced : ['great', 'breakfast', '##s', 'in', 'a', 'nice', 'fur', '##nis', '##hed', 'cafe', ',', 'slightly', 'bohemia', '##n', '.']
Here the problem is in "cafe" that loses its accent. I'm using BertTokenizer.from_pretrained('Bert-base-multilingual') as the tokenizer, I also tried with "Bert-base-uncased" and experienced the same issue.
Thanks for this great work!
Metadata
Metadata
Assignees
Labels
No labels