Skip to content

Commit 228cfbd

Browse files
authored
[Doc] Improve quickstart documentation (#9256)
Signed-off-by: Rafael Vasquez <[email protected]>
1 parent ca0d922 commit 228cfbd

File tree

1 file changed

+52
-46
lines changed

1 file changed

+52
-46
lines changed

docs/source/getting_started/quickstart.rst

Lines changed: 52 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,50 @@
11
.. _quickstart:
22

3+
==========
34
Quickstart
45
==========
56

6-
This guide shows how to use vLLM to:
7+
This guide will help you quickly get started with vLLM to:
78

8-
* run offline batched inference on a dataset;
9-
* build an API server for a large language model;
10-
* start an OpenAI-compatible API server.
9+
* :ref:`Run offline batched inference <offline_batched_inference>`
10+
* :ref:`Run OpenAI-compatible inference <openai_compatible_server>`
1111

12-
Be sure to complete the :ref:`installation instructions <installation>` before continuing with this guide.
12+
Prerequisites
13+
--------------
14+
- OS: Linux
15+
- Python: 3.8 - 3.12
16+
- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
1317

14-
.. note::
18+
Installation
19+
--------------
20+
21+
You can install vLLM using pip. It's recommended to use `conda <https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html>`_ to create and manage Python environments.
22+
23+
.. code-block:: console
1524
16-
By default, vLLM downloads model from `HuggingFace <https://huggingface.co/>`_. If you would like to use models from `ModelScope <https://www.modelscope.cn>`_ in the following examples, please set the environment variable:
25+
$ conda create -n myenv python=3.10 -y
26+
$ conda activate myenv
27+
$ pip install vllm
1728
18-
.. code-block:: shell
29+
Please refer to the :ref:`installation documentation <installation>` for more details on installing vLLM.
1930

20-
export VLLM_USE_MODELSCOPE=True
31+
.. _offline_batched_inference:
2132

2233
Offline Batched Inference
2334
-------------------------
2435

25-
We first show an example of using vLLM for offline batched inference on a dataset. In other words, we use vLLM to generate texts for a list of input prompts.
36+
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). The example script for this section can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py>`__.
37+
38+
The first line of this example imports the classes :class:`~vllm.LLM` and :class:`~vllm.SamplingParams`:
2639

27-
Import :class:`~vllm.LLM` and :class:`~vllm.SamplingParams` from vLLM.
28-
The :class:`~vllm.LLM` class is the main class for running offline inference with vLLM engine.
29-
The :class:`~vllm.SamplingParams` class specifies the parameters for the sampling process.
40+
- :class:`~vllm.LLM` is the main class for running offline inference with vLLM engine.
41+
- :class:`~vllm.SamplingParams` specifies the parameters for the sampling process.
3042

3143
.. code-block:: python
3244
3345
from vllm import LLM, SamplingParams
3446
35-
Define the list of input prompts and the sampling parameters for generation. The sampling temperature is set to 0.8 and the nucleus sampling probability is set to 0.95. For more information about the sampling parameters, refer to the `class definition <https://github.com/vllm-project/vllm/blob/main/vllm/sampling_params.py>`_.
47+
The next section defines a list of input prompts and sampling parameters for text generation. The `sampling temperature <https://arxiv.org/html/2402.05201v1>`_ is set to ``0.8`` and the `nucleus sampling probability <https://en.wikipedia.org/wiki/Top-p_sampling>`_ is set to ``0.95``. You can find more information about the sampling parameters `here <https://docs.vllm.ai/en/stable/dev/sampling_params.html>`__.
3648

3749
.. code-block:: python
3850
@@ -44,64 +56,64 @@ Define the list of input prompts and the sampling parameters for generation. The
4456
]
4557
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
4658
47-
Initialize vLLM's engine for offline inference with the :class:`~vllm.LLM` class and the `OPT-125M model <https://arxiv.org/abs/2205.01068>`_. The list of supported models can be found at :ref:`supported models <supported_models>`.
59+
The :class:`~vllm.LLM` class initializes vLLM's engine and the `OPT-125M model <https://arxiv.org/abs/2205.01068>`_ for offline inference. The list of supported models can be found :ref:`here <supported_models>`.
4860

4961
.. code-block:: python
5062
5163
llm = LLM(model="facebook/opt-125m")
5264
53-
Call ``llm.generate`` to generate the outputs. It adds the input prompts to vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of ``RequestOutput`` objects, which include all the output tokens.
65+
.. note::
66+
67+
By default, vLLM downloads models from `HuggingFace <https://huggingface.co/>`_. If you would like to use models from `ModelScope <https://www.modelscope.cn>`_, set the environment variable ``VLLM_USE_MODELSCOPE`` before initializing the engine.
68+
69+
Now, the fun part! The outputs are generated using ``llm.generate``. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of ``RequestOutput`` objects, which include all of the output tokens.
5470

5571
.. code-block:: python
5672
5773
outputs = llm.generate(prompts, sampling_params)
5874
59-
# Print the outputs.
6075
for output in outputs:
6176
prompt = output.prompt
6277
generated_text = output.outputs[0].text
6378
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
6479
65-
66-
The code example can also be found in `examples/offline_inference.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py>`_.
80+
.. _openai_compatible_server:
6781

6882
OpenAI-Compatible Server
6983
------------------------
7084

7185
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
72-
By default, it starts the server at ``http://localhost:8000``. You can specify the address with ``--host`` and ``--port`` arguments. The server currently hosts one model at a time (OPT-125M in the command below) and implements `list models <https://platform.openai.com/docs/api-reference/models/list>`_, `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_, and `create completion <https://platform.openai.com/docs/api-reference/completions/create>`_ endpoints. We are actively adding support for more endpoints.
86+
By default, it starts the server at ``http://localhost:8000``. You can specify the address with ``--host`` and ``--port`` arguments. The server currently hosts one model at a time and implements endpoints such as `list models <https://platform.openai.com/docs/api-reference/models/list>`_, `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_, and `create completion <https://platform.openai.com/docs/api-reference/completions/create>`_ endpoints.
7387

74-
Start the server:
88+
Run the following command to start the vLLM server with the `Qwen2.5-1.5B-Instruct <https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct>`_ model:
7589

7690
.. code-block:: console
7791
78-
$ vllm serve facebook/opt-125m
92+
$ vllm serve Qwen/Qwen2.5-1.5B-Instruct
7993
80-
By default, the server uses a predefined chat template stored in the tokenizer. You can override this template by using the ``--chat-template`` argument:
81-
82-
.. code-block:: console
94+
.. note::
8395

84-
$ vllm serve facebook/opt-125m --chat-template ./examples/template_chatml.jinja
96+
By default, the server uses a predefined chat template stored in the tokenizer. You can learn about overriding it `here <https://github.com/vllm-project/vllm/blob/main/docs/source/serving/openai_compatible_server.md#chat-template>`__.
8597

86-
This server can be queried in the same format as OpenAI API. For example, list the models:
98+
This server can be queried in the same format as OpenAI API. For example, to list the models:
8799

88100
.. code-block:: console
89101
90102
$ curl http://localhost:8000/v1/models
91103
92104
You can pass in the argument ``--api-key`` or environment variable ``VLLM_API_KEY`` to enable the server to check for API key in the header.
93105

94-
Using OpenAI Completions API with vLLM
95-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
106+
OpenAI Completions API with vLLM
107+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
96108

97-
Query the model with input prompts:
109+
Once your server is started, you can query the model with input prompts:
98110

99111
.. code-block:: console
100112
101113
$ curl http://localhost:8000/v1/completions \
102114
$ -H "Content-Type: application/json" \
103115
$ -d '{
104-
$ "model": "facebook/opt-125m",
116+
$ "model": "Qwen/Qwen2.5-1.5B-Instruct",
105117
$ "prompt": "San Francisco is a",
106118
$ "max_tokens": 7,
107119
$ "temperature": 0
@@ -120,36 +132,32 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep
120132
api_key=openai_api_key,
121133
base_url=openai_api_base,
122134
)
123-
completion = client.completions.create(model="facebook/opt-125m",
135+
completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
124136
prompt="San Francisco is a")
125137
print("Completion result:", completion)
126138
127-
For a more detailed client example, refer to `examples/openai_completion_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`_.
128-
129-
Using OpenAI Chat API with vLLM
130-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
139+
A more detailed client example can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`__.
131140

132-
The vLLM server is designed to support the OpenAI Chat API, allowing you to engage in dynamic conversations with the model. The chat interface is a more interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
141+
OpenAI Chat API with vLLM
142+
~~~~~~~~~~~~~~~~~~~~~~~~~~
133143

134-
Querying the model using OpenAI Chat API:
144+
vLLM is designed to also support the OpenAI Chat API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
135145

136-
You can use the `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_ endpoint to communicate with the model in a chat-like interface:
146+
You can use the `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_ endpoint to interact with the model:
137147

138148
.. code-block:: console
139149
140150
$ curl http://localhost:8000/v1/chat/completions \
141151
$ -H "Content-Type: application/json" \
142152
$ -d '{
143-
$ "model": "facebook/opt-125m",
153+
$ "model": "Qwen/Qwen2.5-1.5B-Instruct",
144154
$ "messages": [
145155
$ {"role": "system", "content": "You are a helpful assistant."},
146156
$ {"role": "user", "content": "Who won the world series in 2020?"}
147157
$ ]
148158
$ }'
149159
150-
Python Client Example:
151-
152-
Using the `openai` python package, you can also communicate with the model in a chat-like manner:
160+
Alternatively, you can use the `openai` python package:
153161

154162
.. code-block:: python
155163
@@ -164,12 +172,10 @@ Using the `openai` python package, you can also communicate with the model in a
164172
)
165173
166174
chat_response = client.chat.completions.create(
167-
model="facebook/opt-125m",
175+
model="Qwen/Qwen2.5-1.5B-Instruct",
168176
messages=[
169177
{"role": "system", "content": "You are a helpful assistant."},
170178
{"role": "user", "content": "Tell me a joke."},
171179
]
172180
)
173181
print("Chat response:", chat_response)
174-
175-
For more in-depth examples and advanced features of the chat API, you can refer to the official OpenAI documentation.

0 commit comments

Comments
 (0)