-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Description
Your current environment
- Hardware: Nvidia DGX-2 - 16x32GB V100 GPUs
- Ubuntu 20.04.6
- Docker version 24.0.7
- Docker Image: vllm/vllm-openai:v0.8.2
- Cuda information:
nvidia-smi: "CUDA Version: 12.2nvcc --version: Cuda compilation tools, release 12.8
How would you like to use vllm
Hello and thank you for this awesome tool!
Background:
-
My goal is to get Gemma3-27B running on a completely offline Nvidia DGX-2 GPU cluster (16x32GB V100 GPUs = 512GB VRAM) using vLLM's v0.8.2 Docker Image.
-
The smaller Gemma3-1B on just one of the GPUs runs perfectly with no problems 👍
docker run -d --name vLLM-Gemma3-1B --runtime nvidia \ --gpus='"device=10"' \ -v /raid/models/google/:/root/.cache/huggingface \ -p 8001:8000 \ --ipc=host \ --restart=unless-stopped \ offline-image-repo:8180/vllm-openai:v0.8.2 \ --model /root/.cache/huggingface/gemma-3-1b-it \ --dtype float16 \ --served-model-name google/gemma-3-1b-it
The Problem
-
I can get the Gemma3-27B container start and run with no errors reported in the
docker logs, everything seems good..docker run -d --name vLLM-Gemma3-27B --runtime nvidia \ --gpus='"device=0,1,2,3"' \ -v /raid/models/google:/root/.cache/huggingface \ -p 8000:8000 \ --ipc=host \ --restart=unless-stopped \ offline-image-repo:8180/vllm-openai:v0.8.2 \ --model /root/.cache/huggingface/gemma-3-27b-it \ --dtype float16 \ --served-model-name google/gemma-3-27b-it \ --max-model-len 5000 \ --tensor-parallel-size 4 -
When I attempt a simple inference request I see:
- the server receives the request 👍
- the GPU utilization spike as it is performing inference 👍
- "Inference" (i.e. the spike in GPU utilization) runs for much longer than expected, around 2 minutes instead of seconds 😕
- The logs seem to show everything went fine
- The inference response I receive looks normal except for the "content" is just an empty string 👎
Im stumped as to what is happening here, do you have any suggestions?
A side note if it helps: I was able to get vLLM Docker + Gemma3-27B running across 2 GPUs on a different GPU cluster (lonovo HGX - 4x80GB H100 GPUs) and it works fantastic. This GPU cluster is obviously a couple generations older, but was hoping to get this running on both.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.