Skip to content

Conversation

@varun-sundar-rabindranath
Copy link
Contributor

@varun-sundar-rabindranath varun-sundar-rabindranath commented Mar 22, 2025

Enable V1 LoRA by default

Benchmarks:
Link - Please check sheets V0 vs V1 - rank 8 and V0 vs V1 - rank 64
I see improved TTFT and TPOT in V1.

Tests status : link

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Mar 22, 2025
@varun-sundar-rabindranath
Copy link
Contributor Author

@robertgshaw2-redhat @jeejeelee please take a look ! Thanks 🙌

@varun-sundar-rabindranath
Copy link
Contributor Author

V0 vs V1 (serving default args)
V0 default args - max_num_seqs=256, max_num_batched_tokens=4096
V1 default args - max_num_seqs=1024, max_num_batched_tokens=2048
Look at the sheet V0 vs V1 - No LoRA sheet. It looks like with the default args V0 is better than V1 when server is overloaded.

V0 vs V1 - default args

V0 vs V1 (max_num_seqs = 256, max_num_batched_tokens=4096 - V0 default args)
V1 looks better with this defaults for the case where the server when the server is overloaded.

V0 vs V1 - Use V0 default args

I have updated the Benchmark link in this PR to the "max_num_seqs=256, max_num_batched_tokens=4096" as the V0 vs V1 case looks better.

I'll followup the default args argument in another PR / Thread.

fyi @WoosukKwon @robertgshaw2-redhat

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@varun-sundar-rabindranath
Copy link
Contributor Author

Failing test :- The test_long_context.py fails with V1. For the test to work we need #14241 .
Speaking to @jeejeelee about this, we decided we'd retire this test entirely as we suspect there aren't any users who use this feature. The idea is to add this test back based on user feedback.

TLDR; This PR is waiting for #15503 to land.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cosmetic changes. moving the run_with_both_engines_lora above the utility functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cosmetic changes. moving the run_with_both_engines_lora above the utility functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am checking if the tests pass with latest main 🤞

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the CI OOMs still - turning the test off of now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been a no-op since #14685 landed since both v0 and v1 use the same kernels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these numeric test failures are resolved. This was added in #13726 and the failure was due to the "can't find tokenizer file error" . Discussion on slack https://vllm-dev.slack.com/archives/C07R5PAL2L9/p1742010357759669

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worked on local testing - turning this on to see if it passed on CI.

@varun-sundar-rabindranath
Copy link
Contributor Author

@jeejeelee - I missed some changes to the lora/tests .

  • test_gemma.py
  • test_llama_tp.py
  • test_lora_manager.py
  • test_transformers_model.py
    Changes to other tests are superficial.
    I made this doc to record the state of the tests with this PR.

fyi @WoosukKwon

Sorry, I am re-requesting review on PR as the changes to tests are somewhat significant and so we don't merge it by mistake. Ill keep an on the tests.

Thanks 🙌

@jeejeelee
Copy link
Collaborator

jeejeelee commented Mar 26, 2025

https://github.com/neuralmagic/vllm/tree/varun/v1-lora-enablement/tests/lora/data and https://github.com/neuralmagic/vllm/blob/varun/v1-lora-enablement/tests/lora/conftest.py#L240-L251 should be deleted, #15503 forgot to delete

@varun-sundar-rabindranath
Copy link
Contributor Author

@varun-sundar-rabindranath
Copy link
Contributor Author

Test times on this PR:
LoRA Test 1 : 13m
LoRA Test 2 : 23m
LoRA Test 3 : 25m
LoRA Test 4 : 19m
LoRA TP Test : 46m

Test times on main:
LoRA Test 1: 13m
LoRA Test 2: 22m
LoRA Test 3: 23m
LoRA Test 4: 19m
LoRA TP test: 37m

The increase in TP test times are due to the un-skipping some tests test_llama_tp.py and test_transfomers_model.py for V1.

fyi @jeejeelee

@jeejeelee
Copy link
Collaborator

I think we can consider merging now. I plan to try optimizing the LoRA test time next week.

@mergify
Copy link

mergify bot commented Mar 27, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 27, 2025
@mergify mergify bot removed the needs-rebase label Mar 27, 2025
@jeejeelee jeejeelee enabled auto-merge (squash) March 27, 2025 16:22
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 27, 2025
@varun-sundar-rabindranath
Copy link
Contributor Author

I am looking at what seems to be valid entrypoints-test test failure.

auto-merge was automatically disabled March 28, 2025 17:06

Head branch was pushed to by a user without write access

Varun Sundar Rabindranath added 5 commits March 31, 2025 22:55
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
@jeejeelee jeejeelee merged commit 79455cf into vllm-project:main Apr 1, 2025
40 checks passed
Alex4210987 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Apr 5, 2025
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: xinyuxiao <[email protected]>
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Louis Ulmer <[email protected]>
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Co-authored-by: Varun Sundar Rabindranath <[email protected]>
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Co-authored-by: Varun Sundar Rabindranath <[email protected]>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Mu Huai <[email protected]>
@shashwatj07
Copy link

shashwatj07 commented Jul 11, 2025

@varun-sundar-rabindranath Hi Varun, I'm trying to reproduce the results in your google sheet but I'm seeing similar performance for adapters of different rank. Do you have an idea what may be going wrong? If possible, can you share the exact methodology that you used to measure the numbers you quoted in the sheet.

vllm serve meta-llama/Llama-3.1-70B -tp 8 --chat-template ./template.jinja --no-enable-prefix-caching --enable-lora --max-cpu-loras 8 --max-num-seqs 16 --max-loras 2 --max-lora-rank 128 --lora-modules rank-8=<path1> rank-128=<path2>

Here are my measurements for 1 QPS. I see the same trend in both V1 and V0.

name,prompt_size,decode_size,avg_ttft,avg_tbt
rank-8,10000,128,9.8747,0.1514
rank-128,10000,128,9.8766,0.1514

This is a little time-sensitive as I'm chasing a deadline. Any hints would be extremely helpful. Thank you.

@varun-sundar-rabindranath
Copy link
Contributor Author

@varun-sundar-rabindranath Hi Varun, I'm trying to reproduce the results in your google sheet but I'm seeing similar performance for adapters of different rank. Do you have an idea what may be going wrong? If possible, can you share the exact methodology that you used to measure the numbers you quoted in the sheet.


vllm serve meta-llama/Llama-3.1-70B -tp 8 --chat-template ./template.jinja --no-enable-prefix-caching --enable-lora --max-cpu-loras 8 --max-num-seqs 16 --max-loras 2 --max-lora-rank 128 --lora-modules rank-8=<path1> rank-128=<path2>

Here are my measurements for 1 QPS. I see the same trend in both V1 and V0.


name,prompt_size,decode_size,avg_ttft,avg_tbt

rank-8,10000,128,9.8747,0.1514

rank-128,10000,128,9.8766,0.1514

This is a little time-sensitive as I'm chasing a deadline. Any hints would be extremely helpful. Thank you.

Hi @shashwatj07 the methodology is described in the first sheet of the google sheets file.

Can you share the command you are using to benchmark.

Iirc, the numbers on the sheet are with 1 gpu. At TP 8, it may be that the LoRA matmuls are already too small and ranks 8 vs 128 don't make a lot of difference.

Can you try,

  • benchmarking TP=4
  • benchmarking with non lora requests
  • benchmarking with incremental qps ( 1 , 10, 20, 30, 40, inf)

I think these experiments should shed some light on what is going on. Thanks.

@shashwatj07
Copy link

shashwatj07 commented Jul 11, 2025

Thanks for your prompt reply @varun-sundar-rabindranath

Can you share the command you are using to benchmark.

I use a simple script. Here.

Iirc, the numbers on the sheet are with 1 gpu. At TP 8, it may be that the LoRA matmuls are already too small and ranks 8 vs 128 don't make a lot of difference.

I see the point. However, I still think getting the exact same TTFT isn't reasonable. I tried running a Llama 3.1 8B on a single A100-40GB and the results are consistent with what I stated earlier.

benchmarking TP=4

results are consistent

benchmarking with non lora requests

This is faster as expected.

meta-llama/Llama-3.1-70B,10000,128,4.2373,0.1046

benchmarking with incremental qps ( 1 , 10, 20, 30, 40, inf)

I will try this and let you know. It may take a while to run. However, I feel something is off that doesn't concern TP.

Just to note, I'm not using FlashInfer. Not sure if that's relevant. The server crashes when I use it.

Also, --fully-sharded-loras seems to have a bug, therefore, I'm not using it. It times out while building the graph on startup

What do you think?

@varun-sundar-rabindranath
Copy link
Contributor Author

varun-sundar-rabindranath commented Jul 11, 2025

Hi @shashwatj07 - Thanks for the information. My benchmarking script is at https://drive.google.com/file/d/14Z4TInncSDALi0Lxcr_gMwRv1JXtthtW/view (this information is on the spreadsheet).

I can try out your script to see what is happening - But I am traveling for the next few days and will only be able to get to it next week.

The benchmarking with incremental qps ( 1 , 10, 20, 30, 40, inf) could give us some insights. also, could you try increasing the LoRA ranks (maybe to 256 and 512) -- really I am trying to see if there is some configuration where there is a difference.

Also, --fully-sharded-loras seems to have a bug, therefore, I'm not using it. It times out while building the graph on startup

Thanks for reporting this. Could you make a github issue for this please ?

cc: @jeejeelee

@shashwatj07
Copy link

shashwatj07 commented Jul 12, 2025

Hi @varun-sundar-rabindranath , I finally figured it out.

The runtime is dictated by the flag --max-lora-rank, and not the rank of the adapter used for inference. I was expecting that if vLLM is executing a batch of requests that use adapters of maximum rank 8, the runtime should reflect that. However, just the value of --max-lora-rank dictates performance and not the actual adapter being used. For example, I ran a 1QPS experiment for rank 8 against two servers, one with this flag set to 8 and another with 128. Every other flag was the same. The client was the same. The only adapter being requested was the rank 8 one, in both cases.

With --max-lora-rank 128, I got

model,prompt_size,decode_size,avg_ttft,avg_tbt
meta-llama/Llama-3.1-70B,10000,128,5.0409,0.1146
meta-llama/Llama-3.1-70B-rank-8,10000,128,16.4437,0.1772

With --max-lora-rank 8, I got

model,prompt_size,decode_size,avg_ttft,avg_tbt
meta-llama/Llama-3.1-70B,10000,128,2.3875,0.1393
meta-llama/Llama-3.1-70B-rank-8,10000,128,6.3457,0.1922

I believe we should let the maximum ranked adapter being used in the batch be the bottleneck and not the engine initialization parameters (like it's the case in S-LoRA). What do you think?

Is there a hack to achieve this in the current codebase? (This would be helpful for the deadline I'm chasing)

@shashwatj07
Copy link

Hi @varun-sundar-rabindranath
Any hints on the last question would be extremely helpful. Thank you.

@varun-sundar-rabindranath
Copy link
Contributor Author

Also, --fully-sharded-loras seems to have a bug, therefore, I'm not using it. It times out while building the graph on startup

fyi @mgoin

@varun-sundar-rabindranath
Copy link
Contributor Author

Hi @shashwatj07 - Thank you for the detailed experiments.

I believe we should let the maximum ranked adapter being used in the batch be the bottleneck and not the engine initialization parameters (like it's the case in S-LoRA). What do you think?

I agree. and I think we should be able to accommodate it for prefill atleast (i.e. no cudagraphs) . However, I dont think there is a way to do this without code change. I'll take a stab at it today.
@jeejeelee what do you think ?

@shashwatj07
Copy link

I agree. and I think we should be able to accommodate it for prefill atleast (i.e. no cudagraphs) . However, I dont think there is a way to do this without code change. I'll take a stab at it today.

Thank you so much @varun-sundar-rabindranath ! This will be very helpful.

Also, --fully-sharded-loras seems to have a bug, therefore, I'm not using it. It times out while building the graph on startup

Thanks for reporting this. Could you make a github issue for this please ?

I wasn't able to reproduce it for some reason. I'll try to figure it out again and open an issue if I get it.

@shashwatj07
Copy link

shashwatj07 commented Jul 14, 2025

Also, --fully-sharded-loras seems to have a bug, therefore, I'm not using it. It times out while building the graph on startup

@varun-sundar-rabindranath @jeejeelee @mgoin
I am able to reproduce it now. It only shows up with V1. #20944

@shashwatj07
Copy link

@varun-sundar-rabindranath Hi Varun, any luck with this?

@shashwatj07
Copy link

@varun-sundar-rabindranath just wanted to ask if you got a chance to look at this.

@varun-sundar-rabindranath
Copy link
Contributor Author

Hi @shashwatj07 - Sorry, I haven't been able to look into this. Can you create a github-issue for this task please - so we may request help from the community. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants