-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Model] enable data parallel for InternVL vision encoder #23909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces data parallelism for the InternVL vision encoder, a valuable optimization for multi-modal models. The implementation correctly uses a use_data_parallel
flag to switch between tensor-parallel and replicated linear layers within the vision transformer's attention and MLP blocks. The forward pass for the data-parallel mode is handled by run_dp_sharded_vision_model
, which correctly shards the input batch across the available GPUs. The changes are well-contained, follow existing design patterns within the vLLM codebase, and include the necessary updates to the model configuration and documentation. The implementation appears solid and I have no major concerns.
Signed-off-by: Yiwen Chen <[email protected]>
Signed-off-by: Yiwen Chen <[email protected]>
Signed-off-by: Yiwen Chen <[email protected]>
Thanks, can you attach benchmark results to show the performance improvement? |
Signed-off-by: Yiwen Chen <[email protected]>
Signed-off-by: Yiwen Chen <[email protected]>
Hi @DarkLight1337 I tested with InternVL2-1B and dp size 2 but did not see a significant improvement. Should I try a larger dp size? Let me know if you have any insights :) Test Plan TP:
DP:
Benchmark:
Test Result TP
DP:
|
@666even666 Could you reduce the |
I lowered 'max-concurrency' to 32 and here are the results TP(baseline)
DP
Total Token throughput slightly improved. I wonder is it because a tp size of 2 is too small to see any significant improvement? |
Here is my benchmark result and I did not see a significant improvement too. It's a little weird that the
DP
TP
When decreasing the Maximum request concurrency: 8 DP
TP
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: YiwenC <[email protected]>
@ZJY0516 Thanks a lot for running the test!! This looks much better than what I had with dp size of 2. Looks like the performance improvement is aligned with other multimodal models (#23168). @DarkLight1337 If the changes look good to you, we can probably merge it? BTW @ZJY0516 what's the dp size you are using? |
|
I will try to profile this today. |
Benchmark Settings
Resultencoder dp
normal
The profile result is a little weird. According profile data below, encoder dp mode have better performane in language model forward. Additionally, my benchmark results indicate that encoder DP mode leads to an end-to-end speedup (i.e., a decrease in benchmark duration), which contradicts the findings from @666even666 . Could you verify it? @666even666 vision model partdp
tp
language partdp
tp
|
Can you try increasing the number of prompts? Usually I use 500 or 1000. I think 128 might not be enough to get reliable results. |
The profile data will be quite large, but I'll try tomorrow |
Random-mm dataset: each request has 3 images. command
dp
tp
|
However, shouldn't the execution time of each operator remain consistent regardless of the number of prompts? The benchmark results I obtained with 128 prompts are similar with those from 1024 prompts. |
From the results @ZJY0516 This is not related to scheduler policy actually, but more of the benchmark script. If you don't set I suggest running this benchmark without |
You are right encoder dp
encoder tp
|
Oh, I usually just set infinite QPS for the benchmark. |
@666even666 Could you run some eval benchmarks to make sure this change does not affect model quality? |
I am also implementing data parallelism for InternVL. My code is similar to yours, but I found that your code seems to be missing one content, which is this processing |
Interesting! thanks for letting us know. I put the same logic under forward() of InternVisionModel https://github.com/vllm-project/vllm/pull/23909/files#diff-2df945bfeea35cb93797dc3b4f8a393a27b06de6d60e2f9b0334920d40721e10R506, which should be executed every time self.vision_model() is called. @DarkLight1337 @ZJY0516 Do you think this could cause the performance difference? |
I tested InternVL3-8B on L20 with reference to (https://github.com/mistralai/mistral-evals), and this is my result: ---------tp2-------------- |
Can you run the performance benchmark with infinite QPS and show the output as well? |
Sorry for the delay! Here is the result of InternVL3-8B tested against chartqa |
hello,Did you get this result by running that program file? |
yeah the numbers are exactly the same. I will double check today. |
Signed-off-by: Yiwen Chen <[email protected]>
Head branch was pushed to by a user without write access
Signed-off-by: Yiwen Chen <[email protected]>
Sorry for the delay, let me merge this |
…litPR into model_register * 'model_register' of https://github.com/dsxsteven/vllm_splitPR: (138 commits) Retrieve `sliding_window` from text config in Gemma3 MM (vllm-project#25085) [Docs] Fix API Reference (vllm-project#25140) [Kernel] Better inf handling for grouped topk cu (vllm-project#24886) [CLI] Use streaming in CLI chat and completion commands (vllm-project#23769) [benchmark] add peak throughput metrics and plot (vllm-project#23867) [Spec Decode] Efficient padded speculation (vllm-project#24539) [V0 Deprecation] Remove more V0 tests (vllm-project#25117) [EPLB] Add EPLB support for hunyuan_v1 (vllm-project#23078) [XPU] Whisper model support on XPU Platform (vllm-project#25123) Mark prompt logprobs as incompatible with prompt embeds at API level (vllm-project#25077) [Model] enable data parallel for InternVL vision encoder (vllm-project#23909) [Kernels] Overlap shared experts with combine instead of dispatch (vllm-project#24254) [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960) [Core][MM] Cleanup `MultiModalCache` (vllm-project#25006) [Docs] Clean up the contributing README (vllm-project#25099) [MM Encoder] Apply DP ViT for Qwen3-VL model series (vllm-project#24955) [Kernels] Enable DeepGEMM by default (vllm-project#24462) [V0 Deprecation] Skip PP test (vllm-project#25128) [V0 Deprecation] Remove misc V0 tests (vllm-project#25118) [V0 Deprecation] Remove V0 Tracing & Metrics tests (vllm-project#25115) ...
…t#23909) Signed-off-by: Yiwen Chen <[email protected]> Signed-off-by: YiwenC <[email protected]> Co-authored-by: Roger Wang <[email protected]>
…t#23909) Signed-off-by: Yiwen Chen <[email protected]> Signed-off-by: YiwenC <[email protected]> Co-authored-by: Roger Wang <[email protected]>
…t#23909) Signed-off-by: Yiwen Chen <[email protected]> Signed-off-by: YiwenC <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: charlifu <[email protected]>
…t#23909) Signed-off-by: Yiwen Chen <[email protected]> Signed-off-by: YiwenC <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Purpose
This PR enables DP for InternVL vision encoder
FIX #23876
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.