Skip to content

Commit 1646ffb

Browse files
authored
VLMs: patch_size -> num_image_tokens in processing (#33424)
* use num additional tokens * fix copies + docs * another fix copies :) * add docs * move order for BC
1 parent 3ee24e2 commit 1646ffb

File tree

17 files changed

+131
-15
lines changed

17 files changed

+131
-15
lines changed

docs/source/en/model_doc/blip-2.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,10 @@ The original code can be found [here](https://github.com/salesforce/LAVIS/tree/5
4040
- BLIP-2 can be used for conditional text generation given an image and an optional text prompt. At inference time, it's recommended to use the [`generate`] method.
4141
- One can use [`Blip2Processor`] to prepare images for the model, and decode the predicted tokens ID's back to text.
4242

43+
> [!NOTE]
44+
> BLIP models after release v4.46 will raise warnings about adding `processor.num_query_tokens = {{num_query_tokens}}` and expand model embeddings layer to add special `<image>` token. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you. Adding these attributes means that BLIP will add the number of query tokens required per image and expand the text with as many `<image>` placeholders as there will be query tokens. Usually it is around 500 tokens per image, so make sure that the text is not truncated as otherwise there wil be failure when merging the embeddings.
45+
The attributes can be obtained from model config, as `model.config.num_query_tokens` and model embeddings expansion can be done by following [this link](https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042).
46+
4347
## Resources
4448

4549
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with BLIP-2.

docs/source/en/model_doc/instructblip.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,10 @@ The original code can be found [here](https://github.com/salesforce/LAVIS/tree/m
3333

3434
InstructBLIP uses the same architecture as [BLIP-2](blip2) with a tiny but important difference: it also feeds the text prompt (instruction) to the Q-Former.
3535

36+
> [!NOTE]
37+
> BLIP models after release v4.46 will raise warnings about adding `processor.num_query_tokens = {{num_query_tokens}}` and expand model embeddings layer to add special `<image>` token. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you. Adding these attributes means that BLIP will add the number of query tokens required per image and expand the text with as many `<image>` placeholders as there will be query tokens. Usually it is around 500 tokens per image, so make sure that the text is not truncated as otherwise there wil be failure when merging the embeddings.
38+
The attributes can be obtained from model config, as `model.config.num_query_tokens` and model embeddings expansion can be done by following [this link](https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042).
39+
3640
## InstructBlipConfig
3741

3842
[[autodoc]] InstructBlipConfig

docs/source/en/model_doc/instructblipvideo.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,10 @@ The original code can be found [here](https://github.com/salesforce/LAVIS/tree/m
3535

3636
- The model was trained by sampling 4 frames per video, so it's recommended to sample 4 frames
3737

38+
> [!NOTE]
39+
> BLIP models after release v4.46 will raise warnings about adding `processor.num_query_tokens = {{num_query_tokens}}` and expand model embeddings layer to add special `<image>` token. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you. Adding these attributes means that BLIP will add the number of query tokens required per image and expand the text with as many `<image>` placeholders as there will be query tokens. Usually it is around 500 tokens per image, so make sure that the text is not truncated as otherwise there wil be failure when merging the embeddings.
40+
The attributes can be obtained from model config, as `model.config.num_query_tokens` and model embeddings expansion can be done by following [this link](https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042).
41+
3842
## InstructBlipVideoConfig
3943

4044
[[autodoc]] InstructBlipVideoConfig

docs/source/en/model_doc/llava.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,13 @@ The original code can be found [here](https://github.com/haotian-liu/LLaVA/tree/
4040

4141
- Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results.
4242

43+
44+
> [!NOTE]
45+
> LLaVA models after release v4.46 will raise warnings about adding `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you.
46+
Adding these attributes means that LLaVA will try to infer the number of image tokens required per image and expand the text with as many `<image>` placeholders as there will be tokens. Usually it is around 500 tokens per image, so make sure that the text is not truncated as otherwise there will be failure when merging the embeddings.
47+
The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches.
48+
49+
4350
### Single image inference
4451

4552
For best results, we recommend users to use the processor's `apply_chat_template()` method to format your prompt correctly. For that you need to construct a conversation history, passing in a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities, as follows:

docs/source/en/model_doc/llava_next.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,12 @@ The original code can be found [here](https://github.com/haotian-liu/LLaVA/tree/
5353
</Tip>
5454

5555

56+
> [!NOTE]
57+
> LLaVA models after release v4.46 will raise warnings about adding `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you.
58+
Adding these attributes means that LLaVA will try to infer the number of image tokens required per image and expand the text with as many `<image>` placeholders as there will be tokens. Usually it is around 500 tokens per image, so make sure that the text is not truncated as otherwise there will be failure when merging the embeddings.
59+
The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches.
60+
61+
5662
- Note that each checkpoint has been trained with a specific prompt format, depending on which large language model (LLM) was used. You can use the processor's `apply_chat_template` to format your prompts correctly. For that you have to construct a conversation history, passing a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities. Below is an example of how to do that and the list of formats accepted by each checkpoint.
5763

5864
We will use [llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) and a conversation history of text and image. Each content field has to be a list of dicts, as follows:

docs/source/en/model_doc/llava_next_video.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,12 @@ The original code can be found [here](https://github.com/LLaVA-VL/LLaVA-NeXT/tre
5050
</Tip>
5151

5252

53+
> [!NOTE]
54+
> LLaVA models after release v4.46 will raise warnings about adding `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you.
55+
Adding these attributes means that LLaVA will try to infer the number of image tokens required per image and expand the text with as many `<image>` placeholders as there will be tokens. Usually it is around 500 tokens per image, so make sure that the text is not truncated as otherwise there will be failure when merging the embeddings.
56+
The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches.
57+
58+
5359
- Note that each checkpoint has been trained with a specific prompt format, depending on which large language model (LLM) was used. You can use tokenizer's `apply_chat_template` to format your prompts correctly. Below is an example of how to do that.
5460

5561
We will use [LLaVA-NeXT-Video-7B-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf) and a conversation history of videos and images. Each content field has to be a list of dicts, as follows:

docs/source/en/model_doc/video_llava.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,12 @@ This model was contributed by [RaushanTurganbay](https://huggingface.co/RaushanT
5454
The original code can be found [here](https://github.com/PKU-YuanGroup/Video-LLaVA).
5555

5656

57+
> [!NOTE]
58+
> LLaVA models after release v4.46 will raise warnings about adding `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you.
59+
Adding these attributes means that LLaVA will try to infer the number of image tokens required per image and expand the text with as many `<image>` placeholders as there will be tokens. Usually it is around 500 tokens per image, so make sure that the text is not truncated as otherwise there will be failure when merging the embeddings.
60+
The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches.
61+
62+
5763
## Usage example
5864

5965
### Single Media Mode

docs/source/en/model_doc/vipllava.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,12 @@ This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada)
3939

4040
- Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results.
4141

42+
> [!NOTE]
43+
> LLaVA models after release v4.46 will raise warnings about adding `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. It is strongly recommended to add the attributes to the processor if you own the model checkpoint, or open a PR if it is not owned by you.
44+
Adding these attributes means that LLaVA will try to infer the number of image tokens required per image and expand the text with as many `<image>` placeholders as there will be tokens. Usually it is around 500 tokens per image, so make sure that the text is not truncated as otherwise there will be failure when merging the embeddings.
45+
The attributes can be obtained from model config, as `model.config.vision_config.patch_size` or `model.config.vision_feature_select_strategy`. The `num_additional_image_tokens` should be `1` if the vision backbone adds a CLS token or `0` if nothing extra is added to the vision patches.
46+
47+
4248
- For better results, we recommend users to use the processor's `apply_chat_template()` method to format your prompt correctly. For that you need to construct a conversation history, passing in a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities, as follows:
4349

4450
```python

src/transformers/models/llava/processing_llava.py

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -58,10 +58,19 @@ class LlavaProcessor(ProcessorMixin):
5858
in a chat into a tokenizable string.
5959
image_token (`str`, *optional*, defaults to `"<image>"`):
6060
Special token used to denote image location.
61+
num_additional_image_tokens (`int`, *optional*, defaults to 0):
62+
Number of additional tokens added to the image embeddings, such as CLS (+1). If the backbone has no CLS or other
63+
extra tokens appended, no need to set this arg.
6164
"""
6265

6366
attributes = ["image_processor", "tokenizer"]
64-
valid_kwargs = ["chat_template", "patch_size", "vision_feature_select_strategy", "image_token"]
67+
valid_kwargs = [
68+
"chat_template",
69+
"patch_size",
70+
"vision_feature_select_strategy",
71+
"image_token",
72+
"num_additional_image_tokens",
73+
]
6574
image_processor_class = "AutoImageProcessor"
6675
tokenizer_class = "AutoTokenizer"
6776

@@ -73,9 +82,11 @@ def __init__(
7382
vision_feature_select_strategy=None,
7483
chat_template=None,
7584
image_token="<image>", # set the default and let users change if they have peculiar special tokens in rare cases
85+
num_additional_image_tokens=0,
7686
**kwargs,
7787
):
7888
self.patch_size = patch_size
89+
self.num_additional_image_tokens = num_additional_image_tokens
7990
self.vision_feature_select_strategy = vision_feature_select_strategy
8091
self.image_token = tokenizer.image_token if hasattr(tokenizer, "image_token") else image_token
8192
super().__init__(image_processor, tokenizer, chat_template=chat_template)
@@ -147,9 +158,11 @@ def __call__(
147158
# Replace the image token with the expanded image token sequence
148159
pixel_values = image_inputs["pixel_values"]
149160
height, width = get_image_size(to_numpy_array(pixel_values[0]))
150-
num_image_tokens = (height // self.patch_size) * (width // self.patch_size) + 1
161+
num_image_tokens = (height // self.patch_size) * (
162+
width // self.patch_size
163+
) + self.num_additional_image_tokens
151164
if self.vision_feature_select_strategy == "default":
152-
num_image_tokens -= 1
165+
num_image_tokens -= self.num_additional_image_tokens
153166

154167
prompt_strings = []
155168
for sample in text:

src/transformers/models/llava_next/processing_llava_next.py

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -61,10 +61,19 @@ class LlavaNextProcessor(ProcessorMixin):
6161
in a chat into a tokenizable string.
6262
image_token (`str`, *optional*, defaults to `"<image>"`):
6363
Special token used to denote image location.
64+
num_additional_image_tokens (`int`, *optional*, defaults to 0):
65+
Number of additional tokens added to the image embeddings, such as CLS (+1). If the backbone has no CLS or other
66+
extra tokens appended, no need to set this arg.
6467
"""
6568

6669
attributes = ["image_processor", "tokenizer"]
67-
valid_kwargs = ["chat_template", "patch_size", "vision_feature_select_strategy", "image_token"]
70+
valid_kwargs = [
71+
"chat_template",
72+
"patch_size",
73+
"vision_feature_select_strategy",
74+
"image_token",
75+
"num_additional_image_tokens",
76+
]
6877
image_processor_class = "AutoImageProcessor"
6978
tokenizer_class = "AutoTokenizer"
7079

@@ -76,9 +85,11 @@ def __init__(
7685
vision_feature_select_strategy=None,
7786
chat_template=None,
7887
image_token="<image>", # set the default and let users change if they have peculiar special tokens in rare cases
88+
num_additional_image_tokens=0,
7989
**kwargs,
8090
):
8191
self.patch_size = patch_size
92+
self.num_additional_image_tokens = num_additional_image_tokens
8293
self.vision_feature_select_strategy = vision_feature_select_strategy
8394
self.image_token = tokenizer.image_token if hasattr(tokenizer, "image_token") else image_token
8495
super().__init__(image_processor, tokenizer, chat_template=chat_template)
@@ -155,7 +166,7 @@ def __call__(
155166
orig_height, orig_width = image_size
156167
num_image_tokens = self._get_number_of_features(orig_height, orig_width, height, width)
157168
if self.vision_feature_select_strategy == "default":
158-
num_image_tokens -= 1
169+
num_image_tokens -= self.num_additional_image_tokens
159170
sample = sample.replace(self.image_token, "<placeholder>" * num_image_tokens, 1)
160171
prompt_strings.append(sample)
161172
prompt_strings = [sample.replace("<placeholder>", self.image_token) for sample in prompt_strings]
@@ -178,7 +189,7 @@ def _get_number_of_features(self, orig_height: int, orig_width: int, height: int
178189
orig_height, orig_width, patches_height, patches_width, scale_height, scale_width
179190
)
180191
# The base patch covers the entire image (+1 for the CLS)
181-
base_features = patches_height * patches_width + 1
192+
base_features = patches_height * patches_width + self.num_additional_image_tokens
182193
num_image_tokens = unpadded_features + newline_features + base_features
183194
return num_image_tokens
184195

0 commit comments

Comments
 (0)