-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
[Core] Cast multimodal input in hf processor #18862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
Overall this approach does make more sense than what I originally did in #18756, thanks! |
b6d42a2 to
a25576c
Compare
|
Can you fix the failing tests? Looks like you need to import |
Head branch was pushed to by a user without write access
e61670b to
0292449
Compare
Ah sorry, should be fixed in 02924492824112f2f6d43f247ba70c91059d9989. Looks like the type annotation needs to be a string to support older versions of python. |
Head branch was pushed to by a user without write access
|
Looks like not all items in the dict are tensors which breaks CI. 9d0d47c58009ef4dd646daa8a2955280406f65f7 should fix that. |
|
Looks like V1 test is hanging in this PR, can you investigate it? |
9d0d47c to
b48017f
Compare
|
cc @njhill any idea? |
|
The tokenizers warning is a red herring and shouldn't be an issue. I don't think we should change the mp method in the test to workaround. If you can repro locally, you could check where things may be stuck by running with env var |
Here are the dumps from both processes: |
b1344b4 to
5d5c4b4
Compare
|
Looks like it's stuck in transformers here: huggingface/transformers@a31fa21#diff-d3478155ac25ae1107d16a4464001bfd54770bf12cd9bab881233a1b4d216e3fR240 🤔 |
Yes, I also checked with transformers v4.51.3 which is before this change and it still get's stuck |
5d5c4b4 to
04779ea
Compare
|
It seems like not calling |
Signed-off-by: Lukas Geiger <[email protected]>
Signed-off-by: Lukas Geiger <[email protected]>
Signed-off-by: Lukas Geiger <[email protected]>
Signed-off-by: Lukas Geiger <[email protected]>
Signed-off-by: Lukas Geiger <[email protected]>
04779ea to
bf31e39
Compare
|
Looks like I'm now running into #16054 on CI. Rebased on to main to re-trigger CI. |
|
Nice, thanks for looking into the deadlock problem! |
|
@lgeiger |
|
@vadiklyutiy I'm not sure I fully understand, could you elaborate? For context, this PR is a follow up to #18756. In #18756 dtype conversion has already been removed from DeepseekVL2 and Gemma3. These are now unnecessary as the multi modal input will already have the correct dtype when passed to |
|
For Qwen2.5-VL in |
|
I guess it is supposed that image should be converted in |
This is a follow up to #18756 and instead directly does the multimodal input casting as part of the huggingface preprocessing.
I think this is a bit cleaner and has two advantages: It moves the blocking casting/copy off the main thread and now does the conversion before serialisation which also reduces the amount of data that needs to be serialised and de-serialised. @DarkLight1337 Let me know if you see any disadvantages of doing this.
Before:


After:
I've done some quick benchmark and for some models I'm seeing very small improvement in throughput (0.5%-0.7%).