issues with accents on convert_ids_to_tokens()

Hello, the BertTokenizer seems loose accents when convert_ids_to_tokens() is used :

Example:
- original sentence: "great breakfasts in a nice furnished cafè, slightly bohemian."
- corresponding list of token produced : ['great', 'breakfast', '##s', 'in', 'a', 'nice', 'fur', '##nis', '##hed', 'cafe', ',', 'slightly', 'bohemia', '##n', '.']

Here the problem is in "cafe" that loses its accent. I'm using BertTokenizer.from_pretrained('Bert-base-multilingual') as the tokenizer, I also tried with "Bert-base-uncased" and experienced the same issue.

Thanks for this great work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

issues with accents on convert_ids_to_tokens() #35

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

issues with accents on convert_ids_to_tokens() #35

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions