[Usage]: Empty "content" when running Gemma3-27B across multiple GPUs

### Your current environment

- Hardware: [Nvidia DGX-2](https://www.nvidia.com/en-in/data-center/dgx-2/) - 16x32GB V100 GPUs
- Ubuntu 20.04.6
- Docker version 24.0.7
- Docker Image: vllm/vllm-openai:v0.8.2
- Cuda information:
    - `nvidia-smi`: "CUDA Version: 12.2
    - `nvcc --version`: Cuda compilation tools, release 12.8



### How would you like to use vllm

Hello and thank you for this awesome tool! 

## Background:
- My goal is to get Gemma3-27B running on a completely offline Nvidia DGX-2 GPU cluster (16x32GB V100 GPUs = 512GB VRAM) using vLLM's v0.8.2 Docker Image. 
- The smaller Gemma3-1B on just one of the GPUs runs perfectly with no problems 👍 

    ``` sh
    docker run -d --name vLLM-Gemma3-1B --runtime nvidia \
        --gpus='"device=10"' \
        -v /raid/models/google/:/root/.cache/huggingface \
        -p 8001:8000 \
        --ipc=host \
        --restart=unless-stopped \
        offline-image-repo:8180/vllm-openai:v0.8.2 \
        --model /root/.cache/huggingface/gemma-3-1b-it \
        --dtype float16 \
        --served-model-name google/gemma-3-1b-it
    ```
 
## The Problem
- I can get the Gemma3-27B container start and run with no errors reported in the `docker logs`, everything seems good..

    ``` sh
    docker run -d --name vLLM-Gemma3-27B --runtime nvidia \
        --gpus='"device=0,1,2,3"' \
        -v /raid/models/google:/root/.cache/huggingface \
        -p 8000:8000 \
        --ipc=host \
        --restart=unless-stopped \
        offline-image-repo:8180/vllm-openai:v0.8.2 \
        --model /root/.cache/huggingface/gemma-3-27b-it \
        --dtype float16 \
        --served-model-name google/gemma-3-27b-it \
        --max-model-len 5000 \
        --tensor-parallel-size 4
    ```

- When I attempt a simple inference request I see: 
    - the server receives the request 👍 
    - the GPU utilization spike as it is performing inference 👍 
    - "Inference" (i.e. the spike in GPU utilization) runs for much longer than expected, around 2 minutes instead of seconds 😕
    - The logs seem to show everything went fine
    - The inference response I receive looks normal except for the "content" is just an empty string 👎 
 
<img width="1440" alt="Image" src="https://github.com/user-attachments/assets/630ab77c-a546-4a67-a0b5-981d6d9bd70a" />

**Im stumped as to what is happening here, do you have any suggestions?**

A side note if it helps: I was able to get vLLM Docker + Gemma3-27B running across 2 GPUs on a different GPU cluster (lonovo HGX - 4x80GB H100 GPUs) and it works fantastic. This GPU cluster is obviously a couple generations older, but was hoping to get this running on both.  


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Usage]: Empty "content" when running Gemma3-27B across multiple GPUs #16489

Your current environment

How would you like to use vllm

Background:

The Problem

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Usage]: Empty "content" when running Gemma3-27B across multiple GPUs #16489

Description

Your current environment

How would you like to use vllm

Background:

The Problem

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions