[Docs] Add Docs on Limitations of VLM Support (#5383)

ywang96 · web-flow · commit 856c990041bf · 2024-06-10T09:53:50.000-07:00
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -92,6 +92,7 @@ def setup(app):
     "vllm._C",
     "PIL",
     "numpy",
+    'triton'
     "tqdm",
     "tensorizer",
 ]
diff --git a/docs/source/models/vlm.rst b/docs/source/models/vlm.rst
@@ -16,6 +16,13 @@ The following :ref:`engine arguments <engine_args>` are specific to VLMs:
     :prog: -m vllm.entrypoints.openai.api_server
     :nodefaultconst:
 
+.. important::
+    Currently, the support for vision language models on vLLM has the following limitations:
+
+    * Only single image input is supported per text prompt.
+    * Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means model output might not exactly match the huggingface implementation.
+    We are continuously improving user & developer experience for VLMs. Please raise an issue on GitHub if you have any feedback or feature requests.
+
 Offline Batched Inference
 -------------------------
 
@@ -31,7 +38,7 @@ To initialize a VLM, the aforementioned arguments must be passed to the ``LLM``
         image_feature_size=576,
     )
 
-For now, we only support a single image per text prompt. To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`:
+To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`:
 
 * ``prompt``: The prompt should have a number of ``<image>`` tokens equal to ``image_feature_size``.
 * ``multi_modal_data``: This should be an instance of :class:`~vllm.multimodal.image.ImagePixelData` or :class:`~vllm.multimodal.image.ImageFeatureData`.

Original file line number	Diff line number	Diff line change
`@@ -92,6 +92,7 @@ def setup(app):`
`92`	`92`	`"vllm._C",`
`93`	`93`	`"PIL",`
`94`	`94`	`"numpy",`
	`95`	`+ 'triton'`
`95`	`96`	`"tqdm",`
`96`	`97`	`"tensorizer",`
`97`	`98`	`]`