Skip to content

Behaviour between slow and fast LLaMa tokenizer not equivalent #23889

@NielsRogge

Description

@NielsRogge

System Info

Transformers v4.29.2

Who can help?

@ArthurZucker

Reproduction

For a new model (#23460), I'd like to get equivalent behaviour between the slow and fast LLaMa tokenizers. The code of the slow tokenizer was taken from the original code, and now I'd like to translate this to the fast tokenizer as well.

However, as can be seen below, behaviour is not equivalent:

from transformers import LlamaTokenizer, LlamaTokenizerFast
import torch

tokenizer = LlamaTokenizer.from_pretrained("huggyllama/llama-7b", truncation_side="left")
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
tokenizer.add_special_tokens({"bos_token": "</s>"})
tokenizer.add_special_tokens({"eos_token": "</s>"})
tokenizer.add_special_tokens({"unk_token": "</s>"})

fast_tokenizer = LlamaTokenizerFast.from_pretrained("huggyllama/llama-7b", truncation_side="left")
fast_tokenizer.add_special_tokens({"pad_token": "[PAD]"})
fast_tokenizer.add_special_tokens({"bos_token": "</s>"})
fast_tokenizer.add_special_tokens({"eos_token": "</s>"})
fast_tokenizer.add_special_tokens({"unk_token": "</s>"})

prompt = "What is unusual about this image?"

encoding = tokenizer(prompt, return_tensors="pt")

fast_encoding = fast_tokenizer(prompt, return_tensors="pt")

for k,v in encoding.items():
    assert torch.allclose(fast_encoding[k], v)
=> this assertion fails since the input_ids differ:

tensor([[    2,  1724,   338, 22910,  1048,   445,  1967, 29973]])
tensor([[    1,  1724,   338, 22910,  1048,   445,  1967, 29973]])

Expected behavior

I'd expect that the assertion above passes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Core: TokenizationInternals of the library; Tokenization.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions