-
Notifications
You must be signed in to change notification settings - Fork 31k
Description
System Info
transformersversion: 4.33.3- Platform: Linux-5.15.120+-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.17.3
- Safetensors version: 0.3.3
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.0.1+cu118 (False)
- Tensorflow version (GPU?): 2.13.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.7.4 (cpu)
- Jax version: 0.4.10
- JaxLib version: 0.4.10
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
The discussed file is transformers/src/transformers/models/t5/modeling_flax_t5.py at line 408
Expected behavior
Hi,
During autoregressive decoding, keys and values are computed one token at a time and cache is used to recover the keys and values from previous calls.
In the _concatenate_to_cache method of the Attention module, an attention mask is computed in order for the new query to only attend to the previous key positions and not the remaining zero elements. This is what is explained in the comments in this function.
However, the new attention mask is not used afterwards, because its name is attention_attention_mask and not attention_mask which is the one being used in every other line.
From my understanding, this is likely a typo and I am not sure how it changes the behavior of the model, if at all.