Skip to content

[Usage]: Qwen2VL model mrope implemenation in cuda graph  #9546

@gujiewen

Description

@gujiewen

Anything you want to discuss about vllm.

in qwen2vl's mrope imple, vllm decide whether input positions is for multimodal with
image

in RUNTIME. So, when input is text-only, the input positions is (seqlen).
however, vllm's cuda graph use positions shape == (3, seqlen).
image

Does that means we can not use cuda graph for qwen2vl with text-only input. Otherwise, we get (seqlen) positions shape, but cuda graph deal with it as (3, seqlen)?

However I do some tests, It seems no difference of final results between cuda graph and eager mode with text-only input? So I was wondering why.
PS. I use nsys to profile the whole process, cuda-graph DO have two more kernels than eager mode.
Left is cuda-graph, right is eager.
image

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    miscstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions