-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Closed as not planned
Labels
Description
Anything you want to discuss about vllm.
in qwen2vl's mrope imple, vllm decide whether input positions is for multimodal with

in RUNTIME. So, when input is text-only, the input positions is (seqlen).
however, vllm's cuda graph use positions shape == (3, seqlen).

Does that means we can not use cuda graph for qwen2vl with text-only input. Otherwise, we get (seqlen) positions shape, but cuda graph deal with it as (3, seqlen)?
However I do some tests, It seems no difference of final results between cuda graph and eager mode with text-only input? So I was wondering why.
PS. I use nsys to profile the whole process, cuda-graph DO have two more kernels than eager mode.
Left is cuda-graph, right is eager.

Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.