Skip to content

Commit a31fa21

Browse files
zucchini-nlpqubvel
andauthored
🔴 Video processors as a separate class (#35206)
* initial design * update all video processors * add tests * need to add qwen2-vl (not tested yet) * add qwen2-vl in auto map * fix copies * isort * resolve confilicts kinda * nit: * qwen2-vl is happy now * qwen2-5 happy * other models are happy * fix copies * fix tests * add docs * CI green now? * add more tests * even more changes + tests * doc builder fail * nit * Update src/transformers/models/auto/processing_auto.py Co-authored-by: Pavel Iakubovskii <[email protected]> * small update * imports correctly * dump, otherwise this is getting unmanagebale T-T * dump * update * another update * update * tests * move * modular * docs * test * another update * init * remove flakiness in tests * fixup * clean up and remove commented lines * docs * skip this one! * last fix after rebasing * run fixup * delete slow files * remove unnecessary tests + clean up a bit * small fixes * fix tests * more updates * docs * fix tests * update * style * fix qwen2-5-vl * fixup * fixup * unflatten batch when preparing * dump, come back soon * add docs and fix some tests * how to guard this with new dummies? * chat templates in qwen * address some comments * remove `Fast` suffix * fixup * oops should be imported from transforms * typo in requires dummies * new model added with video support * fixup once more * last fixup I hope * revert image processor name + comments * oh, this is why fetch test is failing * fix tests * fix more tests * fixup * add new models: internvl, smolvlm * update docs * imprt once * fix failing tests * do we need to guard it here again, why? * new model was added, update it * remove testcase from tester * fix tests * make style * not related CI fail, lets' just fix here * mark flaky for now, filas 15 out of 100 * style * maybe we can do this way? * don't download images in setup class --------- Co-authored-by: Pavel Iakubovskii <[email protected]>
1 parent 716819b commit a31fa21

File tree

83 files changed

+5419
-2005
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

83 files changed

+5419
-2005
lines changed

docs/source/en/_toctree.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@
3939
title: Tokenizers
4040
- local: image_processors
4141
title: Image processors
42+
- local: video_processors
43+
title: Video processors
4244
- local: backbones
4345
title: Backbones
4446
- local: feature_extractors
@@ -362,7 +364,9 @@
362364
title: Feature Extractor
363365
- local: main_classes/image_processor
364366
title: Image Processor
365-
title: Main classes
367+
- local: main_classes/video_processor
368+
title: Video Processor
369+
title: Main Classes
366370
- sections:
367371
- sections:
368372
- local: model_doc/albert

docs/source/en/image_processors.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
1616

1717
# Image processors
1818

19-
Image processors converts images into pixel values, tensors that represent image colors and size. The pixel values are inputs to a vision or video model. To ensure a pretrained model receives the correct input, an image processor can perform the following operations to make sure an image is exactly like the images a model was pretrained on.
19+
Image processors converts images into pixel values, tensors that represent image colors and size. The pixel values are inputs to a vision model. To ensure a pretrained model receives the correct input, an image processor can perform the following operations to make sure an image is exactly like the images a model was pretrained on.
2020

2121
- [`~BaseImageProcessor.center_crop`] to resize an image
2222
- [`~BaseImageProcessor.normalize`] or [`~BaseImageProcessor.rescale`] pixel values
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
18+
# Video Processor
19+
20+
A **Video Processor** is a utility responsible for preparing input features for video models, as well as handling the post-processing of their outputs. It provides transformations such as resizing, normalization, and conversion into PyTorch.
21+
22+
The video processor extends the functionality of image processors by allowing Vision Large Language Models (VLMs) to handle videos with a distinct set of arguments compared to images. It serves as the bridge between raw video data and the model, ensuring that input features are optimized for the VLM.
23+
24+
When adding a new VLM or updating an existing one to enable distinct video preprocessing, saving and reloading the processor configuration will store the video related arguments in a dedicated file named `video_preprocessing_config.json`. Don't worry if you haven't upadted your VLM, the processor will try to load video related configurations from a file named `preprocessing_config.json`.
25+
26+
27+
### Usage Example
28+
Here's an example of how to load a video processor with [`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf) model:
29+
30+
```python
31+
from transformers import AutoVideoProcessor
32+
33+
processor = AutoVideoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")
34+
```
35+
36+
Currently, if using base image processor for videos, it processes video data by treating each frame as an individual image and applying transformations frame-by-frame. While functional, this approach is not highly efficient. Using `AutoVideoProcessor` allows us to take advantage of **fast video processors**, leveraging the [torchvision](https://pytorch.org/vision/stable/index.html) library. Fast processors handle the whole batch of videos at once, without iterating over each video or frame. These updates introduce GPU acceleration and significantly enhance processing speed, especially for tasks requiring high throughput.
37+
38+
Fast video processors are available for all models and are loaded by default when an `AutoVideoProcessor` is initialized. When using a fast video processor, you can also set the `device` argument to specify the device on which the processing should be done. By default, the processing is done on the same device as the inputs if the inputs are tensors, or on the CPU otherwise. For even more speed improvement, we can compile the processor when using 'cuda' as device.
39+
40+
```python
41+
import torch
42+
from transformers.video_utils import load_video
43+
from transformers import AutoVideoProcessor
44+
45+
video = load_video("video.mp4")
46+
processor = AutoVideoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf", device="cuda")
47+
processor = torch.compile(processor)
48+
processed_video = processor(video, return_tensors="pt")
49+
```
50+
51+
52+
## BaseVideoProcessor
53+
54+
[[autodoc]] video_processing_utils.BaseVideoProcessor
55+

docs/source/en/model_doc/auto.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,10 @@ Likewise, if your `NewModel` is a subclass of [`PreTrainedModel`], make sure its
7474

7575
[[autodoc]] AutoImageProcessor
7676

77+
## AutoVideoProcessor
78+
79+
[[autodoc]] AutoVideoProcessor
80+
7781
## AutoProcessor
7882

7983
[[autodoc]] AutoProcessor

docs/source/en/model_doc/instructblipvideo.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,12 @@ The attributes can be obtained from model config, as `model.config.num_query_tok
5858

5959
[[autodoc]] InstructBlipVideoProcessor
6060

61+
62+
## InstructBlipVideoVideoProcessor
63+
64+
[[autodoc]] InstructBlipVideoVideoProcessor
65+
- preprocess
66+
6167
## InstructBlipVideoImageProcessor
6268

6369
[[autodoc]] InstructBlipVideoImageProcessor

docs/source/en/model_doc/internvl.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -353,3 +353,7 @@ This example showcases how to handle a batch of chat conversations with interlea
353353
## InternVLProcessor
354354

355355
[[autodoc]] InternVLProcessor
356+
357+
## InternVLVideoProcessor
358+
359+
[[autodoc]] InternVLVideoProcessor

docs/source/en/model_doc/llava_next_video.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -262,6 +262,10 @@ model = LlavaNextVideoForConditionalGeneration.from_pretrained(
262262

263263
[[autodoc]] LlavaNextVideoImageProcessor
264264

265+
## LlavaNextVideoVideoProcessor
266+
267+
[[autodoc]] LlavaNextVideoVideoProcessor
268+
265269
## LlavaNextVideoModel
266270

267271
[[autodoc]] LlavaNextVideoModel

docs/source/en/model_doc/llava_onevision.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -303,6 +303,7 @@ model = LlavaOnevisionForConditionalGeneration.from_pretrained(
303303
## LlavaOnevisionImageProcessor
304304

305305
[[autodoc]] LlavaOnevisionImageProcessor
306+
- preprocess
306307

307308
## LlavaOnevisionImageProcessorFast
308309

@@ -313,6 +314,10 @@ model = LlavaOnevisionForConditionalGeneration.from_pretrained(
313314

314315
[[autodoc]] LlavaOnevisionVideoProcessor
315316

317+
## LlavaOnevisionVideoProcessor
318+
319+
[[autodoc]] LlavaOnevisionVideoProcessor
320+
316321
## LlavaOnevisionModel
317322

318323
[[autodoc]] LlavaOnevisionModel

docs/source/en/model_doc/qwen2_vl.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -287,6 +287,11 @@ model = Qwen2VLForConditionalGeneration.from_pretrained(
287287
[[autodoc]] Qwen2VLImageProcessor
288288
- preprocess
289289

290+
## Qwen2VLVideoProcessor
291+
292+
[[autodoc]] Qwen2VLVideoProcessor
293+
- preprocess
294+
290295
## Qwen2VLImageProcessorFast
291296

292297
[[autodoc]] Qwen2VLImageProcessorFast

docs/source/en/model_doc/smolvlm.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -197,6 +197,9 @@ print(generated_texts[0])
197197
[[autodoc]] SmolVLMImageProcessor
198198
- preprocess
199199

200+
## SmolVLMVideoProcessor
201+
[[autodoc]] SmolVLMVideoProcessor
202+
- preprocess
200203

201204
## SmolVLMProcessor
202205
[[autodoc]] SmolVLMProcessor

0 commit comments

Comments
 (0)