Skip to content

Commit 06386a6

Browse files
[Frontend] Chat-based Embeddings API (#9759)
1 parent d3aa2a8 commit 06386a6

21 files changed

+853
-415
lines changed

docs/requirements-docs.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,5 +13,7 @@ torch
1313
py-cpuinfo
1414
transformers
1515
mistral_common >= 1.3.4
16+
aiohttp
17+
starlette
1618
openai # Required by docs/source/serving/openai_compatible_server.md's vllm.entrypoints.openai.cli_args
1719
partial-json-parser # Required by docs/source/serving/openai_compatible_server.md's vllm.entrypoints.openai.cli_args

docs/source/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,6 @@ def setup(app):
9696

9797
# Mock out external dependencies here, otherwise the autodoc pages may be blank.
9898
autodoc_mock_imports = [
99-
"aiohttp",
10099
"compressed_tensors",
101100
"cpuinfo",
102101
"cv2",
@@ -143,6 +142,7 @@ def add_line(self, line: str, source: str, *lineno: int) -> None:
143142
"python": ("https://docs.python.org/3", None),
144143
"typing_extensions":
145144
("https://typing-extensions.readthedocs.io/en/latest", None),
145+
"aiohttp": ("https://docs.aiohttp.org/en/stable", None),
146146
"pillow": ("https://pillow.readthedocs.io/en/stable", None),
147147
"numpy": ("https://numpy.org/doc/stable", None),
148148
"torch": ("https://pytorch.org/docs/stable", None),

docs/source/dev/pooling_params.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
Pooling Parameters
2+
==================
3+
4+
.. autoclass:: vllm.PoolingParams
5+
:members:

docs/source/getting_started/quickstart.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -138,10 +138,10 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep
138138
139139
A more detailed client example can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`__.
140140

141-
OpenAI Chat API with vLLM
142-
~~~~~~~~~~~~~~~~~~~~~~~~~~
141+
OpenAI Chat Completions API with vLLM
142+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
143143

144-
vLLM is designed to also support the OpenAI Chat API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
144+
vLLM is designed to also support the OpenAI Chat Completions API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
145145

146146
You can use the `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_ endpoint to interact with the model:
147147

@@ -157,7 +157,7 @@ You can use the `create chat completion <https://platform.openai.com/docs/api-re
157157
$ ]
158158
$ }'
159159
160-
Alternatively, you can use the `openai` python package:
160+
Alternatively, you can use the ``openai`` python package:
161161

162162
.. code-block:: python
163163

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,7 @@ Documentation
134134
:caption: Developer Documentation
135135

136136
dev/sampling_params
137+
dev/pooling_params
137138
dev/offline_inference/offline_index
138139
dev/engine/engine_index
139140
dev/kernel/paged_attention

docs/source/models/vlm.rst

Lines changed: 51 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -185,7 +185,7 @@ Below is an example on how to launch the same ``microsoft/Phi-3.5-vision-instruc
185185
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt image=2
186186
187187
.. important::
188-
Since OpenAI Vision API is based on `Chat Completions <https://platform.openai.com/docs/api-reference/chat>`_ API,
188+
Since OpenAI Vision API is based on `Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`_,
189189
a chat template is **required** to launch the API server.
190190

191191
Although Phi-3.5-Vision comes with a chat template, for other models you may have to provide one if the model's tokenizer does not come with it.
@@ -243,6 +243,10 @@ To consume the server, you can use the OpenAI client like in the example below:
243243
244244
A full code example can be found in `examples/openai_api_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_api_client_for_multimodal.py>`_.
245245

246+
.. tip::
247+
There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
248+
In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
249+
246250
.. note::
247251

248252
By default, the timeout for fetching images through http url is ``5`` seconds. You can override this by setting the environment variable:
@@ -251,5 +255,49 @@ A full code example can be found in `examples/openai_api_client_for_multimodal.p
251255
252256
$ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
253257
254-
.. note::
255-
There is no need to format the prompt in the API request since it will be handled by the server.
258+
Chat Embeddings API
259+
^^^^^^^^^^^^^^^^^^^
260+
261+
vLLM's Chat Embeddings API is a superset of OpenAI's `Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`_,
262+
where a list of ``messages`` can be passed instead of batched ``inputs``. This enables multi-modal inputs to be passed to embedding models.
263+
264+
.. tip::
265+
The schema of ``messages`` is exactly the same as in Chat Completions API.
266+
267+
In this example, we will serve the ``TIGER-Lab/VLM2Vec-Full`` model.
268+
269+
.. code-block:: bash
270+
271+
vllm serve TIGER-Lab/VLM2Vec-Full --task embedding \
272+
--trust-remote-code --max-model-len 4096
273+
274+
.. important::
275+
276+
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass ``--task embedding``
277+
to run this model in embedding mode instead of text generation mode.
278+
279+
Since this schema is not defined by OpenAI client, we post a request to the server using the lower-level ``requests`` library:
280+
281+
.. code-block:: python
282+
283+
import requests
284+
285+
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
286+
287+
response = requests.post(
288+
"http://localhost:8000/v1/embeddings",
289+
json={
290+
"model": "TIGER-Lab/VLM2Vec-Full",
291+
"messages": [{
292+
"role": "user",
293+
"content": [
294+
{"type": "image_url", "image_url": {"url": image_url}},
295+
{"type": "text", "text": "Represent the given image."},
296+
],
297+
}],
298+
"encoding_format": "float",
299+
},
300+
)
301+
response.raise_for_status()
302+
response_json = response.json()
303+
print("Embedding output:", response_json["data"][0]["embedding"])

docs/source/serving/openai_compatible_server.md

Lines changed: 44 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -26,13 +26,26 @@ print(completion.choices[0].message)
2626
```
2727

2828
## API Reference
29-
Please see the [OpenAI API Reference](https://platform.openai.com/docs/api-reference) for more information on the API. We support all parameters except:
30-
- Chat: `tools`, and `tool_choice`.
31-
- Completions: `suffix`.
3229

33-
vLLM also provides experimental support for OpenAI Vision API compatible inference. See more details in [Using VLMs](../models/vlm.rst).
30+
We currently support the following OpenAI APIs:
31+
32+
- [Completions API](https://platform.openai.com/docs/api-reference/completions)
33+
- *Note: `suffix` parameter is not supported.*
34+
- [Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
35+
- [Vision](https://platform.openai.com/docs/guides/vision)-related parameters are supported; see [Using VLMs](../models/vlm.rst).
36+
- *Note: `image_url.detail` parameter is not supported.*
37+
- We also support `audio_url` content type for audio files.
38+
- Refer to [vllm.entrypoints.chat_utils](https://github.com/vllm-project/vllm/tree/main/vllm/entrypoints/chat_utils.py) for the exact schema.
39+
- *TODO: Support `input_audio` content type as defined [here](https://github.com/openai/openai-python/blob/v1.52.2/src/openai/types/chat/chat_completion_content_part_input_audio_param.py).*
40+
- *Note: `parallel_tool_calls` and `user` parameters are ignored.*
41+
- [Embeddings API](https://platform.openai.com/docs/api-reference/embeddings)
42+
- Instead of `inputs`, you can pass in a list of `messages` (same schema as Chat Completions API),
43+
which will be treated as a single prompt to the model according to its chat template.
44+
- This enables multi-modal inputs to be passed to embedding models, see [Using VLMs](../models/vlm.rst).
45+
- *Note: You should run `vllm serve` with `--task embedding` to ensure that the model is being run in embedding mode.*
3446

3547
## Extra Parameters
48+
3649
vLLM supports a set of parameters that are not part of the OpenAI API.
3750
In order to use them, you can pass them as extra parameters in the OpenAI client.
3851
Or directly merge them into the JSON payload if you are using HTTP call directly.
@@ -49,7 +62,26 @@ completion = client.chat.completions.create(
4962
)
5063
```
5164

52-
### Extra Parameters for Chat API
65+
### Extra Parameters for Completions API
66+
67+
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.
68+
69+
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
70+
:language: python
71+
:start-after: begin-completion-sampling-params
72+
:end-before: end-completion-sampling-params
73+
```
74+
75+
The following extra parameters are supported:
76+
77+
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
78+
:language: python
79+
:start-after: begin-completion-extra-params
80+
:end-before: end-completion-extra-params
81+
```
82+
83+
### Extra Parameters for Chat Completions API
84+
5385
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.
5486

5587
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
@@ -66,21 +98,22 @@ The following extra parameters are supported:
6698
:end-before: end-chat-completion-extra-params
6799
```
68100

69-
### Extra Parameters for Completions API
70-
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.
101+
### Extra Parameters for Embeddings API
102+
103+
The following [pooling parameters (click through to see documentation)](../dev/pooling_params.rst) are supported.
71104

72105
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
73106
:language: python
74-
:start-after: begin-completion-sampling-params
75-
:end-before: end-completion-sampling-params
107+
:start-after: begin-embedding-pooling-params
108+
:end-before: end-embedding-pooling-params
76109
```
77110

78111
The following extra parameters are supported:
79112

80113
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
81114
:language: python
82-
:start-after: begin-completion-extra-params
83-
:end-before: end-completion-extra-params
115+
:start-after: begin-embedding-extra-params
116+
:end-before: end-embedding-extra-params
84117
```
85118

86119
## Chat Template

tests/entrypoints/openai/test_basic.py

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
from http import HTTPStatus
22
from typing import List
33

4-
import openai
54
import pytest
65
import pytest_asyncio
76
import requests
@@ -83,10 +82,8 @@ async def client(server):
8382
indirect=True,
8483
)
8584
@pytest.mark.asyncio
86-
async def test_show_version(client: openai.AsyncOpenAI):
87-
base_url = str(client.base_url)[:-3].strip("/")
88-
89-
response = requests.get(base_url + "/version")
85+
async def test_show_version(server: RemoteOpenAIServer):
86+
response = requests.get(server.url_for("version"))
9087
response.raise_for_status()
9188

9289
assert response.json() == {"version": VLLM_VERSION}
@@ -102,9 +99,7 @@ async def test_show_version(client: openai.AsyncOpenAI):
10299
indirect=True,
103100
)
104101
@pytest.mark.asyncio
105-
async def test_check_health(client: openai.AsyncOpenAI):
106-
base_url = str(client.base_url)[:-3].strip("/")
107-
108-
response = requests.get(base_url + "/health")
102+
async def test_check_health(server: RemoteOpenAIServer):
103+
response = requests.get(server.url_for("health"))
109104

110105
assert response.status_code == HTTPStatus.OK

0 commit comments

Comments
 (0)