3131
3232The following tutorial demonstrates how to deploy a simple
3333[ facebook/opt-125m] ( https://huggingface.co/facebook/opt-125m ) model on
34- Triton Inference Server using Triton's [ Python backend] ( https://github.com/triton-inference-server/python_backend ) and the
35- [ vLLM] ( https://github.com/vllm-project/vllm ) library.
34+ Triton Inference Server using the Triton's
35+ [ Python-based] ( https://github.com/triton-inference-server/backend/blob/main/docs/python_based_backends.md#python-based-backends )
36+ [ vLLM] ( https://github.com/triton-inference-server/vllm_backend/tree/main )
37+ backend.
3638
3739* NOTE* : The tutorial is intended to be a reference example only and has [ known limitations] ( #limitations ) .
3840
3941
40- ## Step 1: Build a Triton Container Image with vLLM
42+ ## Step 1: Prepare your model repository
4143
42- We will build a new container image derived from tritonserver:23.08-py3 with vLLM.
44+ To use Triton, we need to build a model repository. For this tutorial we will
45+ use the model repository, provided in the [ samples] ( https://github.com/triton-inference-server/vllm_backend/tree/main/samples )
46+ folder of the [ vllm_backend] ( https://github.com/triton-inference-server/vllm_backend/tree/main )
47+ repository.
4348
49+ The following set of commands will create a ` model_repository/vllm_model/1 `
50+ directory and copy 2 files:
51+ [ ` model.json ` ] ( https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/1/model.json )
52+ and
53+ [ ` config.pbtxt ` ] ( https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/config.pbtxt ) ,
54+ required to serve the [ facebook/opt-125m] ( https://huggingface.co/facebook/opt-125m ) model.
4455```
45- docker build -t tritonserver_vllm .
56+ mkdir -p model_repository/vllm_model/1
57+ wget -P model_repository/vllm_model/1 https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/samples/model_repository/vllm_model/1/model.json
58+ wget -P model_repository/vllm_model/ https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/samples/model_repository/vllm_model/config.pbtxt
4659```
4760
48- The above command should create the tritonserver_vllm image with vLLM and all of its dependencies.
49-
50-
51- ## Step 2: Start Triton Inference Server
52-
53- A sample model repository for deploying ` facebook/opt-125m ` using vLLM in Triton is
54- included with this demo as ` model_repository ` directory.
5561The model repository should look like this:
5662```
5763model_repository/
58- `-- vllm
59- |-- 1
60- | `-- model.py
61- |-- config.pbtxt
62- |-- vllm_engine_args.json
64+ └── vllm_model
65+ ├── 1
66+ │ └── model.json
67+ └── config.pbtxt
6368```
6469
65- The content of ` vllm_engine_args .json` is:
70+ The content of ` model .json` is:
6671
6772``` json
6873{
@@ -71,53 +76,116 @@ The content of `vllm_engine_args.json` is:
7176 "gpu_memory_utilization" : 0.5
7277}
7378```
79+
7480This file can be modified to provide further settings to the vLLM engine. See vLLM
7581[ AsyncEngineArgs] ( https://github.com/vllm-project/vllm/blob/32b6816e556f69f1672085a6267e8516bcb8e622/vllm/engine/arg_utils.py#L165 )
7682and
7783[ EngineArgs] ( https://github.com/vllm-project/vllm/blob/32b6816e556f69f1672085a6267e8516bcb8e622/vllm/engine/arg_utils.py#L11 )
78- for supported key-value pairs.
84+ for supported key-value pairs. Inflight batching and paged attention is handled
85+ by the vLLM engine.
7986
80- For multi-GPU support, EngineArgs like ` tensor_parallel_size ` can be specified in [ ` vllm_engine_args.json ` ] ( model_repository/vllm/vllm_engine_args.json ) .
87+ For multi-GPU support, EngineArgs like ` tensor_parallel_size ` can be specified
88+ in [ ` model.json ` ] ( https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/1/model.json ) .
8189
8290* Note* : vLLM greedily consume up to 90% of the GPU's memory under default settings.
8391This tutorial updates this behavior by setting ` gpu_memory_utilization ` to 50%.
8492You can tweak this behavior using fields like ` gpu_memory_utilization ` and other settings
85- in [ ` vllm_engine_args .json` ] ( model_repository/vllm/vllm_engine_args .json ) .
93+ in [ ` model .json` ] ( https://github.com/triton-inference-server/vllm_backend/blob/main/samples/ model_repository/vllm_model/1/model .json) .
8694
87- Read through the documentation in [ ` model.py ` ] ( model_repository/vllm/1/ model.py) to understand how
88- to configure this sample for your use-case.
95+ Read through the documentation in [ ` model.py ` ] ( https://github.com/triton-inference-server/vllm_backend/blob/main/src/ model.py)
96+ to understand how to configure this sample for your use-case.
8997
90- Run the following commands to start the server container:
98+ ## Step 2: Launch Triton Inference Server
9199
100+ Once you have the model repository setup, it is time to launch the triton server.
101+ Starting with 23.10 release, a dedicated container with vLLM pre-installed
102+ is available on [ NGC.] ( https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver )
103+ To use this container to launch Triton, you can use the docker command below.
92104```
93- docker run --gpus all -it --rm -p 8001:8001 --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work tritonserver_vllm tritonserver --model-store ./model_repository
105+ docker run --gpus all -it --net=host -- rm -p 8001:8001 --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work nvcr.io/nvidia/tritonserver:<xx.yy>-vllm-python-py3 tritonserver --model-store ./model_repository
94106```
107+ Throughout the tutorial, \< xx.yy\> is the version of Triton
108+ that you want to use. Please note, that Triton's vLLM
109+ container was first published in 23.10 release, so any prior version
110+ will not work.
95111
96- Upon successful start of the server, you should see the following at the end of the output.
112+ After you start Triton you will see output on the console showing
113+ the server starting up and loading the model. When you see output
114+ like the following, Triton is ready to accept inference requests.
97115
98116```
99- I0901 23:39:08.729123 1 grpc_server.cc:2451 ] Started GRPCInferenceService at 0.0.0.0:8001
100- I0901 23:39:08.729640 1 http_server.cc:3558 ] Started HTTPService at 0.0.0.0:8000
101- I0901 23:39:08.772522 1 http_server.cc:187 ] Started Metrics Service at 0.0.0.0:8002
117+ I1030 22:33:28.291908 1 grpc_server.cc:2513 ] Started GRPCInferenceService at 0.0.0.0:8001
118+ I1030 22:33:28.292879 1 http_server.cc:4497 ] Started HTTPService at 0.0.0.0:8000
119+ I1030 22:33:28.335154 1 http_server.cc:270 ] Started Metrics Service at 0.0.0.0:8002
102120```
103121
104- ## Step 3: Use a Triton Client to Query the Server
122+ ## Step 3: Use a Triton Client to Send Your First Inference Request
105123
106- We will run the client within Triton's SDK container to issue multiple async requests using the
124+ In this tutorial, we will show how to send an inference request to the
125+ [ facebook/opt-125m] ( https://huggingface.co/facebook/opt-125m ) model in 2 ways:
126+
127+ * [ Using the generate endpoint] ( #using-generate-endpoint )
128+ * [ Using the gRPC asyncio client] ( #using-grpc-asyncio-client )
129+
130+ ### Using the Generate Endpoint
131+ After you start Triton with the sample model_repository,
132+ you can quickly run your first inference request with the
133+ [ generate] ( https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md )
134+ endpoint.
135+
136+ Start Triton's SDK container with the following command:
137+ ```
138+ docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk bash
139+ ```
140+
141+ Now, let's send an inference request:
142+ ```
143+ curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "What is Triton Inference Server?", "parameters": {"stream": false, "temperature": 0}}'
144+ ```
145+
146+ Upon success, you should see a response from the server like this one:
147+ ```
148+ {"model_name":"vllm_model","model_version":"1","text_output":"What is Triton Inference Server?\n\nTriton Inference Server is a server that is used by many"}
149+ ```
150+
151+ ### Using the gRPC Asyncio Client
152+ Now, we will see how to run the client within Triton's SDK container
153+ to issue multiple async requests using the
107154[ gRPC asyncio client] ( https://github.com/triton-inference-server/client/blob/main/src/python/library/tritonclient/grpc/aio/__init__.py )
108155library.
109156
157+ This method requires a
158+ [ client.py] ( https://github.com/triton-inference-server/vllm_backend/blob/main/samples/client.py )
159+ script and a set of
160+ [ prompts] ( https://github.com/triton-inference-server/vllm_backend/blob/main/samples/prompts.txt ) ,
161+ which are provided in the
162+ [ samples] ( https://github.com/triton-inference-server/vllm_backend/tree/main/samples )
163+ folder of
164+ [ vllm_backend] ( https://github.com/triton-inference-server/vllm_backend/tree/main )
165+ repository.
166+
167+ Use the following command to download ` client.py ` and ` prompts.txt ` to your
168+ current directory:
110169```
111- docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:23.08-py3-sdk bash
170+ wget https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/samples/client.py
171+ wget https://raw.githubusercontent.com/triton-inference-server/vllm_backend/main/samples/prompts.txt
112172```
113173
114- Within the container, run [ ` client.py ` ] ( client.py ) with:
174+ Now, we are ready to start Triton's SDK container:
175+ ```
176+ docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk bash
177+ ```
115178
179+ Within the container, run
180+ [ ` client.py ` ] ( https://github.com/triton-inference-server/vllm_backend/blob/main/samples/client.py )
181+ with:
116182```
117183python3 client.py
118184```
119185
120- The client reads prompts from the [ prompts.txt] ( prompts.txt ) file, sends them to Triton server for
186+ The client reads prompts from the
187+ [ prompts.txt] ( https://github.com/triton-inference-server/vllm_backend/blob/main/samples/prompts.txt )
188+ file, sends them to Triton server for
121189inference, and stores the results into a file named ` results.txt ` by default.
122190
123191The output of the client should look like below:
@@ -128,15 +196,22 @@ Storing results into `results.txt`...
128196PASS: vLLM example
129197```
130198
131- You can inspect the contents of the ` results.txt ` for the response from the server. The ` --iterations `
132- flag can be used with the client to increase the load on the server by looping through the list of
133- provided prompts in [ ` prompts.txt ` ] ( prompts.txt ) .
199+ You can inspect the contents of the ` results.txt ` for the response
200+ from the server. The ` --iterations ` flag can be used with the client
201+ to increase the load on the server by looping through the list of
202+ provided prompts in
203+ [ prompts.txt] ( https://github.com/triton-inference-server/vllm_backend/blob/main/samples/prompts.txt ) .
134204
135- When you run the client in verbose mode with the ` --verbose ` flag, the client will print more details
136- about the request/response transactions.
205+ When you run the client in verbose mode with the ` --verbose ` flag,
206+ the client will print more details about the request/response transactions.
137207
138208## Limitations
139209
140210- We use decoupled streaming protocol even if there is exactly 1 response for each request.
141211- The asyncio implementation is exposed to model.py.
142212- Does not support providing specific subset of GPUs to be used.
213+ - If you are running multiple instances of Triton server with
214+ a Python-based vLLM backend, you need to specify a different
215+ ` shm-region-prefix-name ` for each server. See
216+ [ here] ( https://github.com/triton-inference-server/python_backend#running-multiple-instances-of-triton-server )
217+ for more information.
0 commit comments