- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.8k
Description
Your current environment
- Hardware: Nvidia DGX-2 - 16x32GB V100 GPUs
- Ubuntu 20.04.6
- Docker version 24.0.7
- Docker Image: vllm/vllm-openai:v0.8.2
- Cuda information:
- nvidia-smi: "CUDA Version: 12.2
- nvcc --version: Cuda compilation tools, release 12.8
 
How would you like to use vllm
Hello and thank you for this awesome tool!
Background:
- 
My goal is to get Gemma3-27B running on a completely offline Nvidia DGX-2 GPU cluster (16x32GB V100 GPUs = 512GB VRAM) using vLLM's v0.8.2 Docker Image. 
- 
The smaller Gemma3-1B on just one of the GPUs runs perfectly with no problems 👍 docker run -d --name vLLM-Gemma3-1B --runtime nvidia \ --gpus='"device=10"' \ -v /raid/models/google/:/root/.cache/huggingface \ -p 8001:8000 \ --ipc=host \ --restart=unless-stopped \ offline-image-repo:8180/vllm-openai:v0.8.2 \ --model /root/.cache/huggingface/gemma-3-1b-it \ --dtype float16 \ --served-model-name google/gemma-3-1b-it
The Problem
- 
I can get the Gemma3-27B container start and run with no errors reported in the docker logs, everything seems good..docker run -d --name vLLM-Gemma3-27B --runtime nvidia \ --gpus='"device=0,1,2,3"' \ -v /raid/models/google:/root/.cache/huggingface \ -p 8000:8000 \ --ipc=host \ --restart=unless-stopped \ offline-image-repo:8180/vllm-openai:v0.8.2 \ --model /root/.cache/huggingface/gemma-3-27b-it \ --dtype float16 \ --served-model-name google/gemma-3-27b-it \ --max-model-len 5000 \ --tensor-parallel-size 4
- 
When I attempt a simple inference request I see: - the server receives the request 👍
- the GPU utilization spike as it is performing inference 👍
- "Inference" (i.e. the spike in GPU utilization) runs for much longer than expected, around 2 minutes instead of seconds 😕
- The logs seem to show everything went fine
- The inference response I receive looks normal except for the "content" is just an empty string 👎
 
 
Im stumped as to what is happening here, do you have any suggestions?
A side note if it helps: I was able to get vLLM Docker + Gemma3-27B running across 2 GPUs on a different GPU cluster (lonovo HGX - 4x80GB H100 GPUs) and it works fantastic. This GPU cluster is obviously a couple generations older, but was hoping to get this running on both.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.