[Misc] Enable V1 LoRA by default #15320

varun-sundar-rabindranath · 2025-03-22T04:06:25Z

Enable V1 LoRA by default

Benchmarks:
Link - Please check sheets V0 vs V1 - rank 8 and V0 vs V1 - rank 64
I see improved TTFT and TPOT in V1.

Tests status : link

github-actions · 2025-03-22T04:06:36Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

varun-sundar-rabindranath · 2025-03-22T04:07:46Z

@robertgshaw2-redhat @jeejeelee please take a look ! Thanks 🙌

varun-sundar-rabindranath · 2025-03-25T17:17:25Z

V0 vs V1 (serving default args)
V0 default args - max_num_seqs=256, max_num_batched_tokens=4096
V1 default args - max_num_seqs=1024, max_num_batched_tokens=2048
Look at the sheet V0 vs V1 - No LoRA sheet. It looks like with the default args V0 is better than V1 when server is overloaded.

V0 vs V1 (max_num_seqs = 256, max_num_batched_tokens=4096 - V0 default args)
V1 looks better with this defaults for the case where the server when the server is overloaded.

I have updated the Benchmark link in this PR to the "max_num_seqs=256, max_num_batched_tokens=4096" as the V0 vs V1 case looks better.

I'll followup the default args argument in another PR / Thread.

fyi @WoosukKwon @robertgshaw2-redhat

WoosukKwon

LGTM!

varun-sundar-rabindranath · 2025-03-25T22:02:54Z

Failing test :- The test_long_context.py fails with V1. For the test to work we need #14241 .
Speaking to @jeejeelee about this, we decided we'd retire this test entirely as we suspect there aren't any users who use this feature. The idea is to add this test back based on user feedback.

TLDR; This PR is waiting for #15503 to land.

varun-sundar-rabindranath · 2025-03-26T00:45:01Z

tests/lora/test_baichuan.py

cosmetic changes. moving the run_with_both_engines_lora above the utility functions.

varun-sundar-rabindranath · 2025-03-26T00:45:14Z

tests/lora/test_chatglm3_tp.py

cosmetic changes. moving the run_with_both_engines_lora above the utility functions.

varun-sundar-rabindranath · 2025-03-26T00:45:48Z

tests/lora/test_gemma.py

I am checking if the tests pass with latest main 🤞

the CI OOMs still - turning the test off of now.

varun-sundar-rabindranath · 2025-03-26T00:47:12Z

tests/lora/test_layers.py

This has been a no-op since #14685 landed since both v0 and v1 use the same kernels.

varun-sundar-rabindranath · 2025-03-26T00:50:41Z

tests/lora/test_llama_tp.py

I believe these numeric test failures are resolved. This was added in #13726 and the failure was due to the "can't find tokenizer file error" . Discussion on slack https://vllm-dev.slack.com/archives/C07R5PAL2L9/p1742010357759669

varun-sundar-rabindranath · 2025-03-26T00:51:33Z

tests/lora/test_transfomers_model.py

This worked on local testing - turning this on to see if it passed on CI.

varun-sundar-rabindranath · 2025-03-26T01:01:49Z

@jeejeelee - I missed some changes to the lora/tests .

test_gemma.py
test_llama_tp.py
test_lora_manager.py
test_transformers_model.py
Changes to other tests are superficial.
I made this doc to record the state of the tests with this PR.

fyi @WoosukKwon

Sorry, I am re-requesting review on PR as the changes to tests are somewhat significant and so we don't merge it by mistake. Ill keep an on the tests.

Thanks 🙌

jeejeelee · 2025-03-26T01:27:01Z

https://github.com/neuralmagic/vllm/tree/varun/v1-lora-enablement/tests/lora/data and https://github.com/neuralmagic/vllm/blob/varun/v1-lora-enablement/tests/lora/conftest.py#L240-L251 should be deleted, #15503 forgot to delete

varun-sundar-rabindranath · 2025-03-26T16:28:26Z

https://github.com/neuralmagic/vllm/tree/varun/v1-lora-enablement/tests/lora/data and https://github.com/neuralmagic/vllm/blob/varun/v1-lora-enablement/tests/lora/conftest.py#L240-L251 should be deleted, #15503 forgot to delete

Ah! sorry I missed it. Created a separate PR for this here #15558

varun-sundar-rabindranath · 2025-03-26T19:14:34Z

Test times on this PR:
LoRA Test 1 : 13m
LoRA Test 2 : 23m
LoRA Test 3 : 25m
LoRA Test 4 : 19m
LoRA TP Test : 46m

Test times on main:
LoRA Test 1: 13m
LoRA Test 2: 22m
LoRA Test 3: 23m
LoRA Test 4: 19m
LoRA TP test: 37m

The increase in TP test times are due to the un-skipping some tests test_llama_tp.py and test_transfomers_model.py for V1.

fyi @jeejeelee

jeejeelee · 2025-03-27T14:33:23Z

I think we can consider merging now. I plan to try optimizing the LoRA test time next week.

mergify · 2025-03-27T14:34:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

varun-sundar-rabindranath · 2025-03-27T19:25:44Z

I am looking at what seems to be valid entrypoints-test test failure.

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: xinyuxiao <[email protected]>

Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>

Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Mu Huai <[email protected]>

shashwatj07 · 2025-07-11T00:28:10Z

@varun-sundar-rabindranath Hi Varun, I'm trying to reproduce the results in your google sheet but I'm seeing similar performance for adapters of different rank. Do you have an idea what may be going wrong? If possible, can you share the exact methodology that you used to measure the numbers you quoted in the sheet.

vllm serve meta-llama/Llama-3.1-70B -tp 8 --chat-template ./template.jinja --no-enable-prefix-caching --enable-lora --max-cpu-loras 8 --max-num-seqs 16 --max-loras 2 --max-lora-rank 128 --lora-modules rank-8=<path1> rank-128=<path2>

Here are my measurements for 1 QPS. I see the same trend in both V1 and V0.

name,prompt_size,decode_size,avg_ttft,avg_tbt
rank-8,10000,128,9.8747,0.1514
rank-128,10000,128,9.8766,0.1514

This is a little time-sensitive as I'm chasing a deadline. Any hints would be extremely helpful. Thank you.

varun-sundar-rabindranath · 2025-07-11T00:49:18Z

@varun-sundar-rabindranath Hi Varun, I'm trying to reproduce the results in your google sheet but I'm seeing similar performance for adapters of different rank. Do you have an idea what may be going wrong? If possible, can you share the exact methodology that you used to measure the numbers you quoted in the sheet.
vllm serve meta-llama/Llama-3.1-70B -tp 8 --chat-template ./template.jinja --no-enable-prefix-caching --enable-lora --max-cpu-loras 8 --max-num-seqs 16 --max-loras 2 --max-lora-rank 128 --lora-modules rank-8=<path1> rank-128=<path2>
Here are my measurements for 1 QPS. I see the same trend in both V1 and V0.
name,prompt_size,decode_size,avg_ttft,avg_tbt

rank-8,10000,128,9.8747,0.1514

rank-128,10000,128,9.8766,0.1514
This is a little time-sensitive as I'm chasing a deadline. Any hints would be extremely helpful. Thank you.

Hi @shashwatj07 the methodology is described in the first sheet of the google sheets file.

Can you share the command you are using to benchmark.

Iirc, the numbers on the sheet are with 1 gpu. At TP 8, it may be that the LoRA matmuls are already too small and ranks 8 vs 128 don't make a lot of difference.

Can you try,

benchmarking TP=4
benchmarking with non lora requests
benchmarking with incremental qps ( 1 , 10, 20, 30, 40, inf)

I think these experiments should shed some light on what is going on. Thanks.

shashwatj07 · 2025-07-11T01:52:16Z

Thanks for your prompt reply @varun-sundar-rabindranath

Can you share the command you are using to benchmark.

I use a simple script. Here.

Iirc, the numbers on the sheet are with 1 gpu. At TP 8, it may be that the LoRA matmuls are already too small and ranks 8 vs 128 don't make a lot of difference.

I see the point. However, I still think getting the exact same TTFT isn't reasonable. I tried running a Llama 3.1 8B on a single A100-40GB and the results are consistent with what I stated earlier.

benchmarking TP=4

results are consistent

benchmarking with non lora requests

This is faster as expected.

meta-llama/Llama-3.1-70B,10000,128,4.2373,0.1046

benchmarking with incremental qps ( 1 , 10, 20, 30, 40, inf)

I will try this and let you know. It may take a while to run. However, I feel something is off that doesn't concern TP.

Just to note, I'm not using FlashInfer. Not sure if that's relevant. The server crashes when I use it.

Also, --fully-sharded-loras seems to have a bug, therefore, I'm not using it. It times out while building the graph on startup

What do you think?

varun-sundar-rabindranath · 2025-07-11T16:42:31Z

Hi @shashwatj07 - Thanks for the information. My benchmarking script is at https://drive.google.com/file/d/14Z4TInncSDALi0Lxcr_gMwRv1JXtthtW/view (this information is on the spreadsheet).

I can try out your script to see what is happening - But I am traveling for the next few days and will only be able to get to it next week.

The benchmarking with incremental qps ( 1 , 10, 20, 30, 40, inf) could give us some insights. also, could you try increasing the LoRA ranks (maybe to 256 and 512) -- really I am trying to see if there is some configuration where there is a difference.

Also, --fully-sharded-loras seems to have a bug, therefore, I'm not using it. It times out while building the graph on startup

Thanks for reporting this. Could you make a github issue for this please ?

cc: @jeejeelee

shashwatj07 · 2025-07-12T09:13:42Z

Hi @varun-sundar-rabindranath , I finally figured it out.

The runtime is dictated by the flag --max-lora-rank, and not the rank of the adapter used for inference. I was expecting that if vLLM is executing a batch of requests that use adapters of maximum rank 8, the runtime should reflect that. However, just the value of --max-lora-rank dictates performance and not the actual adapter being used. For example, I ran a 1QPS experiment for rank 8 against two servers, one with this flag set to 8 and another with 128. Every other flag was the same. The client was the same. The only adapter being requested was the rank 8 one, in both cases.

With --max-lora-rank 128, I got

model,prompt_size,decode_size,avg_ttft,avg_tbt
meta-llama/Llama-3.1-70B,10000,128,5.0409,0.1146
meta-llama/Llama-3.1-70B-rank-8,10000,128,16.4437,0.1772

With --max-lora-rank 8, I got

model,prompt_size,decode_size,avg_ttft,avg_tbt
meta-llama/Llama-3.1-70B,10000,128,2.3875,0.1393
meta-llama/Llama-3.1-70B-rank-8,10000,128,6.3457,0.1922

I believe we should let the maximum ranked adapter being used in the batch be the bottleneck and not the engine initialization parameters (like it's the case in S-LoRA). What do you think?

Is there a hack to achieve this in the current codebase? (This would be helpful for the deadline I'm chasing)

shashwatj07 · 2025-07-14T16:24:21Z

Hi @varun-sundar-rabindranath
Any hints on the last question would be extremely helpful. Thank you.

varun-sundar-rabindranath · 2025-07-14T16:30:51Z

Also, --fully-sharded-loras seems to have a bug, therefore, I'm not using it. It times out while building the graph on startup

fyi @mgoin

varun-sundar-rabindranath · 2025-07-14T16:36:22Z

Hi @shashwatj07 - Thank you for the detailed experiments.

I believe we should let the maximum ranked adapter being used in the batch be the bottleneck and not the engine initialization parameters (like it's the case in S-LoRA). What do you think?

I agree. and I think we should be able to accommodate it for prefill atleast (i.e. no cudagraphs) . However, I dont think there is a way to do this without code change. I'll take a stab at it today.
@jeejeelee what do you think ?

shashwatj07 · 2025-07-14T16:42:06Z

I agree. and I think we should be able to accommodate it for prefill atleast (i.e. no cudagraphs) . However, I dont think there is a way to do this without code change. I'll take a stab at it today.

Thank you so much @varun-sundar-rabindranath ! This will be very helpful.

Also, --fully-sharded-loras seems to have a bug, therefore, I'm not using it. It times out while building the graph on startup

Thanks for reporting this. Could you make a github issue for this please ?

I wasn't able to reproduce it for some reason. I'll try to figure it out again and open an issue if I get it.

shashwatj07 · 2025-07-14T21:08:16Z

Also, --fully-sharded-loras seems to have a bug, therefore, I'm not using it. It times out while building the graph on startup

@varun-sundar-rabindranath @jeejeelee @mgoin
I am able to reproduce it now. It only shows up with V1. #20944

shashwatj07 · 2025-07-17T09:52:36Z

@varun-sundar-rabindranath Hi Varun, any luck with this?

shashwatj07 · 2025-07-28T16:47:22Z

@varun-sundar-rabindranath just wanted to ask if you got a chance to look at this.

varun-sundar-rabindranath · 2025-07-28T16:50:12Z

Hi @shashwatj07 - Sorry, I haven't been able to look into this. Can you create a github-issue for this task please - so we may request help from the community. Thanks.

mergify bot added the v1 label Mar 22, 2025

jeejeelee approved these changes Mar 22, 2025

View reviewed changes

WoosukKwon approved these changes Mar 25, 2025

View reviewed changes

varun-sundar-rabindranath force-pushed the varun/v1-lora-enablement branch from 323f5c0 to a67b58e Compare March 26, 2025 00:40

varun-sundar-rabindranath commented Mar 26, 2025

View reviewed changes

varun-sundar-rabindranath requested review from WoosukKwon and jeejeelee March 26, 2025 01:01

mergify bot added the needs-rebase label Mar 27, 2025

varun-sundar-rabindranath force-pushed the varun/v1-lora-enablement branch from 3915e39 to 38329bd Compare March 27, 2025 15:48

mergify bot removed the needs-rebase label Mar 27, 2025

jeejeelee enabled auto-merge (squash) March 27, 2025 16:22

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 27, 2025

auto-merge was automatically disabled March 28, 2025 17:06
Head branch was pushed to by a user without write access

varun-sundar-rabindranath requested a review from DarkLight1337 as a code owner March 28, 2025 17:06

varun-sundar-rabindranath force-pushed the varun/v1-lora-enablement branch from f13f588 to a4132a8 Compare March 31, 2025 18:09

Varun Sundar Rabindranath added 5 commits March 31, 2025 22:55

Enable V1 LoRA by default

e5f96b3

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

test_chat - v0 and v1

533c498

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

isort

eaaea9a

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fixes

3e30ef8

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

skip using sample_json_schema on v1 - unsupported

97773e0

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath force-pushed the varun/v1-lora-enablement branch from 3fffaa1 to 97773e0 Compare April 1, 2025 02:55

jeejeelee merged commit 79455cf into vllm-project:main Apr 1, 2025
40 checks passed

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

varun-sundar-rabindranath mentioned this pull request Apr 24, 2025

[Doc] V1 : Update LoRA status #17133

Merged

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[Misc] Enable V1 LoRA by default (vllm-project#15320)

6cc7af1

Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Misc] Enable V1 LoRA by default (vllm-project#15320)

9919b4f

Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>

Uh oh!

Uh oh!

[Misc] Enable V1 LoRA by default #15320

[Misc] Enable V1 LoRA by default #15320

Uh oh!

Conversation

varun-sundar-rabindranath commented Mar 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 22, 2025

Uh oh!

varun-sundar-rabindranath commented Mar 22, 2025

Uh oh!

varun-sundar-rabindranath commented Mar 25, 2025

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath commented Mar 25, 2025

Uh oh!

varun-sundar-rabindranath Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath commented Mar 26, 2025

Uh oh!

jeejeelee commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

varun-sundar-rabindranath commented Mar 26, 2025

Uh oh!

varun-sundar-rabindranath commented Mar 26, 2025

Uh oh!

jeejeelee commented Mar 27, 2025

Uh oh!

mergify bot commented Mar 27, 2025

Uh oh!

varun-sundar-rabindranath commented Mar 27, 2025

Uh oh!

Uh oh!

shashwatj07 commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

varun-sundar-rabindranath commented Jul 11, 2025

Uh oh!

shashwatj07 commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

varun-sundar-rabindranath commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shashwatj07 commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shashwatj07 commented Jul 14, 2025

Uh oh!

varun-sundar-rabindranath commented Jul 14, 2025

Uh oh!

varun-sundar-rabindranath commented Jul 14, 2025

Uh oh!

shashwatj07 commented Jul 14, 2025

Uh oh!

shashwatj07 commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

varun-sundar-rabindranath commented Mar 22, 2025 •

edited by github-actions bot

Loading

jeejeelee commented Mar 26, 2025 •

edited

Loading

shashwatj07 commented Jul 11, 2025 •

edited

Loading

shashwatj07 commented Jul 11, 2025 •

edited

Loading

varun-sundar-rabindranath commented Jul 11, 2025 •

edited

Loading

shashwatj07 commented Jul 12, 2025 •

edited

Loading

shashwatj07 commented Jul 14, 2025 •

edited

Loading