huggingface
diff --git a/‎docs/source/en/_toctree.yml‎
Lines changed: 5 additions & 1 deletion b/‎docs/source/en/_toctree.yml‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎docs/source/en/image_processors.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/en/image_processors.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/en/main_classes/video_processor.md‎
Lines changed: 55 additions & 0 deletions b/‎docs/source/en/main_classes/video_processor.md‎
Lines changed: 55 additions & 0 deletions
diff --git a/‎docs/source/en/model_doc/auto.md‎
Lines changed: 4 additions & 0 deletions b/‎docs/source/en/model_doc/auto.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/source/en/model_doc/instructblipvideo.md‎
Lines changed: 6 additions & 0 deletions b/‎docs/source/en/model_doc/instructblipvideo.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎docs/source/en/model_doc/internvl.md‎
Lines changed: 4 additions & 0 deletions b/‎docs/source/en/model_doc/internvl.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/source/en/model_doc/llava_next_video.md‎
Lines changed: 4 additions & 0 deletions b/‎docs/source/en/model_doc/llava_next_video.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎docs/source/en/model_doc/llava_onevision.md‎
Lines changed: 5 additions & 0 deletions b/‎docs/source/en/model_doc/llava_onevision.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/source/en/model_doc/qwen2_vl.md‎
Lines changed: 5 additions & 0 deletions b/‎docs/source/en/model_doc/qwen2_vl.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/source/en/model_doc/smolvlm.md‎
Lines changed: 3 additions & 0 deletions b/‎docs/source/en/model_doc/smolvlm.md‎
Lines changed: 3 additions & 0 deletions
@@ -39,6 +39,8 @@
       title: Tokenizers
     - local: image_processors
       title: Image processors
+    - local: video_processors
+      title: Video processors
     - local: backbones
       title: Backbones
     - local: feature_extractors
@@ -362,7 +364,9 @@
       title: Feature Extractor
     - local: main_classes/image_processor
       title: Image Processor
-    title: Main classes
+    - local: main_classes/video_processor
+      title: Video Processor
+    title: Main Classes
   - sections:
     - sections:
       - local: model_doc/albert
 
@@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
 
 # Image processors
 
-Image processors converts images into pixel values, tensors that represent image colors and size. The pixel values are inputs to a vision or video model. To ensure a pretrained model receives the correct input, an image processor can perform the following operations to make sure an image is exactly like the images a model was pretrained on.
+Image processors converts images into pixel values, tensors that represent image colors and size. The pixel values are inputs to a vision model. To ensure a pretrained model receives the correct input, an image processor can perform the following operations to make sure an image is exactly like the images a model was pretrained on.
 
 - [`~BaseImageProcessor.center_crop`] to resize an image
 - [`~BaseImageProcessor.normalize`] or [`~BaseImageProcessor.rescale`] pixel values
 
@@ -0,0 +1,55 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+
+# Video Processor
+
+A **Video Processor** is a utility responsible for preparing input features for video models, as well as handling the post-processing of their outputs. It provides transformations such as resizing, normalization, and conversion into PyTorch. 
+
+The video processor extends the functionality of image processors by allowing Vision Large Language Models (VLMs) to handle videos with a distinct set of arguments compared to images. It serves as the bridge between raw video data and the model, ensuring that input features are optimized for the VLM.
+
+When adding a new VLM or updating an existing one to enable distinct video preprocessing, saving and reloading the processor configuration will store the video related arguments in a dedicated file named `video_preprocessing_config.json`. Don't worry if you haven't upadted your VLM, the processor will try to load video related configurations from a file named `preprocessing_config.json`.
+
+
+### Usage Example
+Here's an example of how to load a video processor with [`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf) model:
+
+```python
+from transformers import AutoVideoProcessor
+
+processor = AutoVideoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")
+```
+
+Currently, if using base image processor for videos, it processes video data by treating each frame as an individual image and applying transformations frame-by-frame. While functional, this approach is not highly efficient. Using `AutoVideoProcessor` allows us to take advantage of **fast video processors**, leveraging the [torchvision](https://pytorch.org/vision/stable/index.html) library. Fast processors handle the whole batch of videos at once, without iterating over each video or frame. These updates introduce GPU acceleration and significantly enhance processing speed, especially for tasks requiring high throughput.
+
+Fast video processors are available for all models and are loaded by default when an `AutoVideoProcessor` is initialized. When using a fast video processor, you can also set the `device` argument to specify the device on which the processing should be done. By default, the processing is done on the same device as the inputs if the inputs are tensors, or on the CPU otherwise. For even more speed improvement, we can compile the processor when using 'cuda' as device.
+
+```python
+import torch
+from transformers.video_utils import load_video
+from transformers import AutoVideoProcessor
+
+video = load_video("video.mp4")
+processor = AutoVideoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf", device="cuda")
+processor = torch.compile(processor)
+processed_video = processor(video, return_tensors="pt")
+```
+
+
+## BaseVideoProcessor
+
+[[autodoc]] video_processing_utils.BaseVideoProcessor
+
@@ -74,6 +74,10 @@ Likewise, if your `NewModel` is a subclass of [`PreTrainedModel`], make sure its
 
 [[autodoc]] AutoImageProcessor
 
+## AutoVideoProcessor
+
+[[autodoc]] AutoVideoProcessor
+
 ## AutoProcessor
 
 [[autodoc]] AutoProcessor
 
@@ -58,6 +58,12 @@ The attributes can be obtained from model config, as `model.config.num_query_tok
 
 [[autodoc]] InstructBlipVideoProcessor
 
+
+## InstructBlipVideoVideoProcessor
+
+[[autodoc]] InstructBlipVideoVideoProcessor
+    - preprocess
+
 ## InstructBlipVideoImageProcessor
 
 [[autodoc]] InstructBlipVideoImageProcessor
 
@@ -353,3 +353,7 @@ This example showcases how to handle a batch of chat conversations with interlea
 ## InternVLProcessor
 
 [[autodoc]] InternVLProcessor
+
+## InternVLVideoProcessor
+
+[[autodoc]] InternVLVideoProcessor
@@ -262,6 +262,10 @@ model = LlavaNextVideoForConditionalGeneration.from_pretrained(
 
 [[autodoc]] LlavaNextVideoImageProcessor
 
+## LlavaNextVideoVideoProcessor
+
+[[autodoc]] LlavaNextVideoVideoProcessor
+
 ## LlavaNextVideoModel
 
 [[autodoc]] LlavaNextVideoModel
 
@@ -303,6 +303,7 @@ model = LlavaOnevisionForConditionalGeneration.from_pretrained(
 ## LlavaOnevisionImageProcessor
 
 [[autodoc]] LlavaOnevisionImageProcessor
+    - preprocess
 
 ## LlavaOnevisionImageProcessorFast
 
@@ -313,6 +314,10 @@ model = LlavaOnevisionForConditionalGeneration.from_pretrained(
 
 [[autodoc]] LlavaOnevisionVideoProcessor
 
+## LlavaOnevisionVideoProcessor
+
+[[autodoc]] LlavaOnevisionVideoProcessor
+
 ## LlavaOnevisionModel
 
 [[autodoc]] LlavaOnevisionModel
 
@@ -287,6 +287,11 @@ model = Qwen2VLForConditionalGeneration.from_pretrained(
 [[autodoc]] Qwen2VLImageProcessor
     - preprocess
 
+## Qwen2VLVideoProcessor
+
+[[autodoc]] Qwen2VLVideoProcessor
+    - preprocess
+
 ## Qwen2VLImageProcessorFast
 
 [[autodoc]] Qwen2VLImageProcessorFast
 
@@ -197,6 +197,9 @@ print(generated_texts[0])
 [[autodoc]] SmolVLMImageProcessor
     - preprocess
 
+## SmolVLMVideoProcessor
+[[autodoc]] SmolVLMVideoProcessor
+    - preprocess
 
 ## SmolVLMProcessor
 [[autodoc]] SmolVLMProcessor