Skip to content

Commit 9a179ca

Browse files
Updated vllm tutorial to use vllm based container from NGC (#65)
* Updated vLLM tutorial's README to use vllm container --------- Co-authored-by: dyastremsky <[email protected]>
1 parent 5283ae5 commit 9a179ca

File tree

7 files changed

+114
-609
lines changed

7 files changed

+114
-609
lines changed

Quick_Deploy/vLLM/Dockerfile

Lines changed: 0 additions & 28 deletions
This file was deleted.

Quick_Deploy/vLLM/README.md

Lines changed: 114 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -31,38 +31,43 @@
3131

3232
The following tutorial demonstrates how to deploy a simple
3333
[facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model on
34-
Triton Inference Server using Triton's [Python backend](https://github.com/triton-inference-server/python_backend) and the
35-
[vLLM](https://github.com/vllm-project/vllm) library.
34+
Triton Inference Server using the Triton's
35+
[Python-based](https://github.com/triton-inference-server/backend/blob/main/docs/python_based_backends.md#python-based-backends)
36+
[vLLM](https://github.com/triton-inference-server/vllm_backend/tree/main)
37+
backend.
3638

3739
*NOTE*: The tutorial is intended to be a reference example only and has [known limitations](#limitations).
3840

3941

40-
## Step 1: Build a Triton Container Image with vLLM
42+
## Step 1: Prepare your model repository
4143

42-
We will build a new container image derived from tritonserver:23.08-py3 with vLLM.
44+
To use Triton, we need to build a model repository. For this tutorial we will
45+
use the model repository, provided in the [samples](https://github.com/triton-inference-server/vllm_backend/tree/main/samples)
46+
folder of the [vllm_backend](https://github.com/triton-inference-server/vllm_backend/tree/main)
47+
repository.
4348

49+
The following set of commands will create a `model_repository/vllm_model/1`
50+
directory and copy 2 files:
51+
[`model.json`](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/1/model.json)
52+
and
53+
[`config.pbtxt`](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/config.pbtxt),
54+
required to serve the [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model.
4455
```
45-
docker build -t tritonserver_vllm .
56+
mkdir -p model_repository/vllm_model/1
57+
wget -P model_repository/vllm_model/1 https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/samples/model_repository/vllm_model/1/model.json
58+
wget -P model_repository/vllm_model/ https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/samples/model_repository/vllm_model/config.pbtxt
4659
```
4760

48-
The above command should create the tritonserver_vllm image with vLLM and all of its dependencies.
49-
50-
51-
## Step 2: Start Triton Inference Server
52-
53-
A sample model repository for deploying `facebook/opt-125m` using vLLM in Triton is
54-
included with this demo as `model_repository` directory.
5561
The model repository should look like this:
5662
```
5763
model_repository/
58-
`-- vllm
59-
|-- 1
60-
| `-- model.py
61-
|-- config.pbtxt
62-
|-- vllm_engine_args.json
64+
└── vllm_model
65+
├── 1
66+
│   └── model.json
67+
└── config.pbtxt
6368
```
6469

65-
The content of `vllm_engine_args.json` is:
70+
The content of `model.json` is:
6671

6772
```json
6873
{
@@ -71,53 +76,116 @@ The content of `vllm_engine_args.json` is:
7176
"gpu_memory_utilization": 0.5
7277
}
7378
```
79+
7480
This file can be modified to provide further settings to the vLLM engine. See vLLM
7581
[AsyncEngineArgs](https://github.com/vllm-project/vllm/blob/32b6816e556f69f1672085a6267e8516bcb8e622/vllm/engine/arg_utils.py#L165)
7682
and
7783
[EngineArgs](https://github.com/vllm-project/vllm/blob/32b6816e556f69f1672085a6267e8516bcb8e622/vllm/engine/arg_utils.py#L11)
78-
for supported key-value pairs.
84+
for supported key-value pairs. Inflight batching and paged attention is handled
85+
by the vLLM engine.
7986

80-
For multi-GPU support, EngineArgs like `tensor_parallel_size` can be specified in [`vllm_engine_args.json`](model_repository/vllm/vllm_engine_args.json).
87+
For multi-GPU support, EngineArgs like `tensor_parallel_size` can be specified
88+
in [`model.json`](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/1/model.json).
8189

8290
*Note*: vLLM greedily consume up to 90% of the GPU's memory under default settings.
8391
This tutorial updates this behavior by setting `gpu_memory_utilization` to 50%.
8492
You can tweak this behavior using fields like `gpu_memory_utilization` and other settings
85-
in [`vllm_engine_args.json`](model_repository/vllm/vllm_engine_args.json).
93+
in [`model.json`](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/1/model.json).
8694

87-
Read through the documentation in [`model.py`](model_repository/vllm/1/model.py) to understand how
88-
to configure this sample for your use-case.
95+
Read through the documentation in [`model.py`](https://github.com/triton-inference-server/vllm_backend/blob/main/src/model.py)
96+
to understand how to configure this sample for your use-case.
8997

90-
Run the following commands to start the server container:
98+
## Step 2: Launch Triton Inference Server
9199

100+
Once you have the model repository setup, it is time to launch the triton server.
101+
Starting with 23.10 release, a dedicated container with vLLM pre-installed
102+
is available on [NGC.](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver)
103+
To use this container to launch Triton, you can use the docker command below.
92104
```
93-
docker run --gpus all -it --rm -p 8001:8001 --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work tritonserver_vllm tritonserver --model-store ./model_repository
105+
docker run --gpus all -it --net=host --rm -p 8001:8001 --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work nvcr.io/nvidia/tritonserver:<xx.yy>-vllm-python-py3 tritonserver --model-store ./model_repository
94106
```
107+
Throughout the tutorial, \<xx.yy\> is the version of Triton
108+
that you want to use. Please note, that Triton's vLLM
109+
container was first published in 23.10 release, so any prior version
110+
will not work.
95111

96-
Upon successful start of the server, you should see the following at the end of the output.
112+
After you start Triton you will see output on the console showing
113+
the server starting up and loading the model. When you see output
114+
like the following, Triton is ready to accept inference requests.
97115

98116
```
99-
I0901 23:39:08.729123 1 grpc_server.cc:2451] Started GRPCInferenceService at 0.0.0.0:8001
100-
I0901 23:39:08.729640 1 http_server.cc:3558] Started HTTPService at 0.0.0.0:8000
101-
I0901 23:39:08.772522 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
117+
I1030 22:33:28.291908 1 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001
118+
I1030 22:33:28.292879 1 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000
119+
I1030 22:33:28.335154 1 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002
102120
```
103121

104-
## Step 3: Use a Triton Client to Query the Server
122+
## Step 3: Use a Triton Client to Send Your First Inference Request
105123

106-
We will run the client within Triton's SDK container to issue multiple async requests using the
124+
In this tutorial, we will show how to send an inference request to the
125+
[facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model in 2 ways:
126+
127+
* [Using the generate endpoint](#using-generate-endpoint)
128+
* [Using the gRPC asyncio client](#using-grpc-asyncio-client)
129+
130+
### Using the Generate Endpoint
131+
After you start Triton with the sample model_repository,
132+
you can quickly run your first inference request with the
133+
[generate](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md)
134+
endpoint.
135+
136+
Start Triton's SDK container with the following command:
137+
```
138+
docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk bash
139+
```
140+
141+
Now, let's send an inference request:
142+
```
143+
curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
144+
```
145+
146+
Upon success, you should see a response from the server like this one:
147+
```
148+
{"model_name":"vllm_model","model_version":"1","text_output":"What is Triton Inference Server?\n\nTriton Inference Server is a server that is used by many"}
149+
```
150+
151+
### Using the gRPC Asyncio Client
152+
Now, we will see how to run the client within Triton's SDK container
153+
to issue multiple async requests using the
107154
[gRPC asyncio client](https://github.com/triton-inference-server/client/blob/main/src/python/library/tritonclient/grpc/aio/__init__.py)
108155
library.
109156

157+
This method requires a
158+
[client.py](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/client.py)
159+
script and a set of
160+
[prompts](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/prompts.txt),
161+
which are provided in the
162+
[samples](https://github.com/triton-inference-server/vllm_backend/tree/main/samples)
163+
folder of
164+
[vllm_backend](https://github.com/triton-inference-server/vllm_backend/tree/main)
165+
repository.
166+
167+
Use the following command to download `client.py` and `prompts.txt` to your
168+
current directory:
110169
```
111-
docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:23.08-py3-sdk bash
170+
wget https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/samples/client.py
171+
wget https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/samples/prompts.txt
112172
```
113173

114-
Within the container, run [`client.py`](client.py) with:
174+
Now, we are ready to start Triton's SDK container:
175+
```
176+
docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk bash
177+
```
115178

179+
Within the container, run
180+
[`client.py`](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/client.py)
181+
with:
116182
```
117183
python3 client.py
118184
```
119185

120-
The client reads prompts from the [prompts.txt](prompts.txt) file, sends them to Triton server for
186+
The client reads prompts from the
187+
[prompts.txt](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/prompts.txt)
188+
file, sends them to Triton server for
121189
inference, and stores the results into a file named `results.txt` by default.
122190

123191
The output of the client should look like below:
@@ -128,15 +196,22 @@ Storing results into `results.txt`...
128196
PASS: vLLM example
129197
```
130198

131-
You can inspect the contents of the `results.txt` for the response from the server. The `--iterations`
132-
flag can be used with the client to increase the load on the server by looping through the list of
133-
provided prompts in [`prompts.txt`](prompts.txt).
199+
You can inspect the contents of the `results.txt` for the response
200+
from the server. The `--iterations` flag can be used with the client
201+
to increase the load on the server by looping through the list of
202+
provided prompts in
203+
[prompts.txt](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/prompts.txt).
134204

135-
When you run the client in verbose mode with the `--verbose` flag, the client will print more details
136-
about the request/response transactions.
205+
When you run the client in verbose mode with the `--verbose` flag,
206+
the client will print more details about the request/response transactions.
137207

138208
## Limitations
139209

140210
- We use decoupled streaming protocol even if there is exactly 1 response for each request.
141211
- The asyncio implementation is exposed to model.py.
142212
- Does not support providing specific subset of GPUs to be used.
213+
- If you are running multiple instances of Triton server with
214+
a Python-based vLLM backend, you need to specify a different
215+
`shm-region-prefix-name` for each server. See
216+
[here](https://github.com/triton-inference-server/python_backend#running-multiple-instances-of-triton-server)
217+
for more information.

0 commit comments

Comments
 (0)