-
-
Couldn't load subscription status.
- Fork 10.8k
[Misc] Enable V1 LoRA by default #15320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] Enable V1 LoRA by default #15320
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
@robertgshaw2-redhat @jeejeelee please take a look ! Thanks 🙌 |
|
V0 vs V1 (serving default args)
V0 vs V1 (max_num_seqs = 256, max_num_batched_tokens=4096 - V0 default args)
I have updated the Benchmark link in this PR to the "max_num_seqs=256, max_num_batched_tokens=4096" as the I'll followup the default args argument in another PR / Thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
Failing test :- The test_long_context.py fails with V1. For the test to work we need #14241 . TLDR; This PR is waiting for #15503 to land. |
323f5c0 to
a67b58e
Compare
tests/lora/test_baichuan.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cosmetic changes. moving the run_with_both_engines_lora above the utility functions.
tests/lora/test_chatglm3_tp.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cosmetic changes. moving the run_with_both_engines_lora above the utility functions.
tests/lora/test_gemma.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am checking if the tests pass with latest main 🤞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the CI OOMs still - turning the test off of now.
tests/lora/test_layers.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been a no-op since #14685 landed since both v0 and v1 use the same kernels.
tests/lora/test_llama_tp.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these numeric test failures are resolved. This was added in #13726 and the failure was due to the "can't find tokenizer file error" . Discussion on slack https://vllm-dev.slack.com/archives/C07R5PAL2L9/p1742010357759669
tests/lora/test_transfomers_model.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worked on local testing - turning this on to see if it passed on CI.
|
@jeejeelee - I missed some changes to the lora/tests .
fyi @WoosukKwon Sorry, I am re-requesting review on PR as the changes to tests are somewhat significant and so we don't merge it by mistake. Ill keep an on the tests. Thanks 🙌 |
Ah! sorry I missed it. Created a separate PR for this here #15558 |
|
Test times on this PR: Test times on main: The increase in TP test times are due to the un-skipping some tests fyi @jeejeelee |
|
I think we can consider merging now. I plan to try optimizing the LoRA test time next week. |
|
This pull request has merge conflicts that must be resolved before it can be |
3915e39 to
38329bd
Compare
|
I am looking at what seems to be valid |
Head branch was pushed to by a user without write access
f13f588 to
a4132a8
Compare
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
3fffaa1 to
97773e0
Compare
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: xinyuxiao <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Mu Huai <[email protected]>
|
@varun-sundar-rabindranath Hi Varun, I'm trying to reproduce the results in your google sheet but I'm seeing similar performance for adapters of different rank. Do you have an idea what may be going wrong? If possible, can you share the exact methodology that you used to measure the numbers you quoted in the sheet. Here are my measurements for 1 QPS. I see the same trend in both V1 and V0. This is a little time-sensitive as I'm chasing a deadline. Any hints would be extremely helpful. Thank you. |
Hi @shashwatj07 the methodology is described in the first sheet of the google sheets file. Can you share the command you are using to benchmark. Iirc, the numbers on the sheet are with 1 gpu. At TP 8, it may be that the LoRA matmuls are already too small and ranks 8 vs 128 don't make a lot of difference. Can you try,
I think these experiments should shed some light on what is going on. Thanks. |
|
Thanks for your prompt reply @varun-sundar-rabindranath
I use a simple script. Here.
I see the point. However, I still think getting the exact same TTFT isn't reasonable. I tried running a Llama 3.1 8B on a single A100-40GB and the results are consistent with what I stated earlier.
results are consistent
This is faster as expected.
I will try this and let you know. It may take a while to run. However, I feel something is off that doesn't concern TP. Just to note, I'm not using FlashInfer. Not sure if that's relevant. The server crashes when I use it. Also, What do you think? |
|
Hi @shashwatj07 - Thanks for the information. My benchmarking script is at https://drive.google.com/file/d/14Z4TInncSDALi0Lxcr_gMwRv1JXtthtW/view (this information is on the spreadsheet). I can try out your script to see what is happening - But I am traveling for the next few days and will only be able to get to it next week. The
Thanks for reporting this. Could you make a github issue for this please ? cc: @jeejeelee |
|
Hi @varun-sundar-rabindranath , I finally figured it out. The runtime is dictated by the flag With With I believe we should let the maximum ranked adapter being used in the batch be the bottleneck and not the engine initialization parameters (like it's the case in S-LoRA). What do you think? Is there a hack to achieve this in the current codebase? (This would be helpful for the deadline I'm chasing) |
|
Hi @varun-sundar-rabindranath |
fyi @mgoin |
|
Hi @shashwatj07 - Thank you for the detailed experiments.
I agree. and I think we should be able to accommodate it for prefill atleast (i.e. no cudagraphs) . However, I dont think there is a way to do this without code change. I'll take a stab at it today. |
Thank you so much @varun-sundar-rabindranath ! This will be very helpful.
I wasn't able to reproduce it for some reason. I'll try to figure it out again and open an issue if I get it. |
@varun-sundar-rabindranath @jeejeelee @mgoin |
|
@varun-sundar-rabindranath Hi Varun, any luck with this? |
|
@varun-sundar-rabindranath just wanted to ask if you got a chance to look at this. |
|
Hi @shashwatj07 - Sorry, I haven't been able to look into this. Can you create a github-issue for this task please - so we may request help from the community. Thanks. |


Enable V1 LoRA by default
Benchmarks:
Link - Please check sheets
V0 vs V1 - rank 8andV0 vs V1 - rank 64I see improved TTFT and TPOT in V1.
Tests status : link