You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/models/vlm.rst
+8-1Lines changed: 8 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,6 +16,13 @@ The following :ref:`engine arguments <engine_args>` are specific to VLMs:
16
16
:prog: -m vllm.entrypoints.openai.api_server
17
17
:nodefaultconst:
18
18
19
+
.. important::
20
+
Currently, the support for vision language models on vLLM has the following limitations:
21
+
22
+
* Only single image input is supported per text prompt.
23
+
* Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means model output might not exactly match the huggingface implementation.
24
+
We are continuously improving user & developer experience for VLMs. Please raise an issue on GitHub if you have any feedback or feature requests.
25
+
19
26
Offline Batched Inference
20
27
-------------------------
21
28
@@ -31,7 +38,7 @@ To initialize a VLM, the aforementioned arguments must be passed to the ``LLM``
31
38
image_feature_size=576,
32
39
)
33
40
34
-
For now, we only support a single image per text prompt. To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`:
41
+
To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`:
35
42
36
43
* ``prompt``: The prompt should have a number of ``<image>`` tokens equal to ``image_feature_size``.
37
44
* ``multi_modal_data``: This should be an instance of :class:`~vllm.multimodal.image.ImagePixelData` or :class:`~vllm.multimodal.image.ImageFeatureData`.
0 commit comments