Skip to content

[Bug]: Wrongly reuse KV, for V1 PD disaggregation with multimodal input #21175

@herotai214

Description

@herotai214

Your current environment

/

🐛 Describe the bug

I use the model Qwen2.5-VL-3B-Instruct for V1 PD disaggregation with 1 image input;
Everything is fine until I sent 2 different requests, with identical text prompt, but different input images in same image size. It wrongly reuses KV for the later request, which cause it wrongly returning an identical output with the previous request.

I use the SharedStorageConnector to faciliate the v1 PD disagg; but I guess there is no design/implementation for this issue yet for other connector as well?

I tried these requests in a single instance (PD mix) case, and a 1P1D disaggregation case in V1.
The outputs in PD disagg are similar/identical to the outputs from PD mix; Prefiller encode the mm_input, then prefill, then Decoder can successfully load KV stored from Prefiller and give decent results. I think we can assume that V1 PD disagg with mm input works fine in general, _except for below bug situation:

# prompt:
#  {"type": "text", "text": "Repeat all below: What is following the sequence 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71? Repeat the whole sequence; What is in the image? Write in around 100 words; Lastly, End your answer with the word `BYE`"},

I use the exact same text prompt for every following request.

There are 3 images: Image A (1200x675), Image B (1200x900), Image C (1200x900)

Results are fine for 1st (Image A) and 2nd (Image B) request, where image size are different;
2 different KV cache folders is generated under the shared_storage_path.

Request 1 with Image A:

Image1

Request 2 with Image B:

Image2

Request 3 with Image C (Abnormal):

However for the 3rd request (Image C), since the image size is also 1200*900, it reuses the KV cached from the 2nd request; And gives the identical output with the 2nd request, even though Image B and C are entirely different image. No new KV cache folder in generated under the shared_storage_path.
Image3

Request 4 with Image C, after emptying shared_storage_path (Abnormal):

This one is even weird; even I emptied the shared_storage_path and send request with Image C again. However, the output is still identical with Image B
I need to restart the PD instances and emptied the shared_storage_path to give proper output for Image C
Image4

This is likely due to the fact that, the key for KV transfer only depends on prompt_token_ids; Where number of placeholder tokens in it would be the same for different images with the same image size. I think this issue happen not only in SharedStorageConnector, but in V1 PD disagg in general? Since I didn't find any design specifically for multimodal V1 PD disagg?

To reproduce:

vllm version: v0.9.2

To start the PD instances:

MODEL_NAME=models/Qwen2.5-VL-3B-Instruct/   # model path here

# a function that waits vLLM server to start
wait_for_server() {
  local port=$1
  timeout 12000 bash -c "
    until curl -s localhost:${port}/v1/chat/completions > /dev/null; do
      sleep 1
    done" && return 0 || return 1
}

# prefilling instance, which is the KV producer
CUDA_VISIBLE_DEVICES=0 vllm serve $MODEL_NAME \
    --port 7104 \
    --gpu-memory-utilization 0.8 \
    --trust-remote-code \
    --kv-transfer-config \
    '{"kv_connector":"SharedStorageConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_connector_extra_config": {"shared_storage_path": "/workspace/test/kv"}}' &

# decoding instance, which is the KV consumer
CUDA_VISIBLE_DEVICES=1 vllm serve $MODEL_NAME \
    --port 7204 \
    --gpu-memory-utilization 0.8 \
    --trust-remote-code \
    --kv-transfer-config \
    '{"kv_connector":"SharedStorageConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_connector_extra_config": {"shared_storage_path": "/workspace/test/kv"}}'  &

# wait until prefill and decode instances are ready
wait_for_server 7104
wait_for_server 7204

# launch the proxy server
python3 examples/online_serving/disaggregated_serving/disagg_proxy_demo.py  \
    --model $MODEL_NAME  \
    --prefill localhost:7104   \
    --decode localhost:7204   \
    --port 9774
# prompt:
#  {"type": "text", "text": "Repeat all below: What is following the sequence 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71? Repeat the whole sequence; What is in the image? Write in around 100 words; Lastly, End your answer with the word `BYE`"},

and send the prompt tgt with the images, 1 by 1.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions