-
Couldn't load subscription status.
- Fork 31k
Closed
Labels
Core: TokenizationInternals of the library; Tokenization.Internals of the library; Tokenization.
Description
System Info
Transformers v4.29.2
Who can help?
Reproduction
For a new model (#23460), I'd like to get equivalent behaviour between the slow and fast LLaMa tokenizers. The code of the slow tokenizer was taken from the original code, and now I'd like to translate this to the fast tokenizer as well.
However, as can be seen below, behaviour is not equivalent:
from transformers import LlamaTokenizer, LlamaTokenizerFast
import torch
tokenizer = LlamaTokenizer.from_pretrained("huggyllama/llama-7b", truncation_side="left")
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
tokenizer.add_special_tokens({"bos_token": "</s>"})
tokenizer.add_special_tokens({"eos_token": "</s>"})
tokenizer.add_special_tokens({"unk_token": "</s>"})
fast_tokenizer = LlamaTokenizerFast.from_pretrained("huggyllama/llama-7b", truncation_side="left")
fast_tokenizer.add_special_tokens({"pad_token": "[PAD]"})
fast_tokenizer.add_special_tokens({"bos_token": "</s>"})
fast_tokenizer.add_special_tokens({"eos_token": "</s>"})
fast_tokenizer.add_special_tokens({"unk_token": "</s>"})
prompt = "What is unusual about this image?"
encoding = tokenizer(prompt, return_tensors="pt")
fast_encoding = fast_tokenizer(prompt, return_tensors="pt")
for k,v in encoding.items():
assert torch.allclose(fast_encoding[k], v)
=> this assertion fails since the input_ids differ:
tensor([[ 2, 1724, 338, 22910, 1048, 445, 1967, 29973]])
tensor([[ 1, 1724, 338, 22910, 1048, 445, 1967, 29973]])
Expected behavior
I'd expect that the assertion above passes.
Metadata
Metadata
Assignees
Labels
Core: TokenizationInternals of the library; Tokenization.Internals of the library; Tokenization.