-
-
Couldn't load subscription status.
- Fork 10.8k
[Core] Support load and unload LoRA in api server #6566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Support load and unload LoRA in api server #6566
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
6f4cefb to
1020409
Compare
1020409 to
140fbcf
Compare
vllm/lora/request.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should mark the dataclass as frozen if we want to use hash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. good catch. I will add @dataclass(frozen=True) then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm/lora/request.py:37: error: Property "lora_path" defined in "LoRARequest" is read-only [misc]
vllm/lora/request.py:72: error: Property "lora_path" defined in "LoRARequest" is read-only [misc]
I meet some issues by adding frozen, the problem is we have compatibility support for deprecated field. At this moment, there's no lora_name modifications, I think it should be safe. Once we remove the deprecated field, we can add frozen back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should allow operators to disable those methods and have them disabled by default; having user specify the path to the adapter themselves is inherently unsafe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. let me add some feature gate for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a feature flag now and leave a TODO here. I think once it's enabled, it's still has the risk. The key problem is if we want to support admin operations, then we have to distinguish the users. This is something not supported yet. In our system, we distinguish the user at proxy level. If the vLLM is public to users directly, then that's a problem
dca2f01 to
0f70009
Compare
|
Hey @Jeffwan, are you still planning on finishing up this branch? LMK if I can help in any way, this is a great feature |
|
@jgordley Yeah, I am still working on it. I was off last week and I will spend some time fixing the UTs. Thanks for offering the help, I will let you know if I meet problems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could become duplicated with multiple edits:
Example [1,2,3,4] -> remove 3 -> [1,2,4] -> add new adapter -> [1,2,4,4]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a known issue and another PR will fix this issue. Even we change to use a global counter, user can still pass the int id and override it, which means the counter may generate duplicate id next time. Technically, we should always use model name from user and id should be managed by the engine.
c559c9b to
e427c0b
Compare
|
@Yard1 can you help review the updated commits? there're some test failures due to connection timeout and I am retrying those at the same time |
|
@jeejeelee can you help review this one? thank you! |
If there's no rush, I'll take some time to look at it this weekend. |
98d9a4c to
9246d65
Compare
|
@Jeffwan Thank you very much for your great work. I've left some comments,thanks~ |
|
@jeejeelee you might need to submit your review in github, as it is not showing here. |
|
I will address all the comments by tomorrow |
9246d65 to
072f946
Compare
072f946 to
7ae737a
Compare
|
@jeejeelee Hi, I address all the comments and please take another look |
|
/ready |
Co-authored-by: Jee Jee Li <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
|
@jeejeelee doc update looks good to me. I accepted the suggestions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stamp given @jeejeelee's comment. thank @Jeffwan for the hardware and thank @jeejeelee for the shepherding!
|
|
||
| self.served_model_names = served_model_names | ||
|
|
||
| self.lora_id_counter = AtomicCounter(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jeffwan this is not required here. asyncio operations all happen in the same thread. Can change this to be a simple int field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njhill yeah, that makes sense. Let me file a follow up PR to improve it.
Co-authored-by: Jee Jee Li <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: Alvant <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: LeiWang1999 <[email protected]>

Address #6275 #3308 #5491
Add the two methods in the apiserver to allow user to load/unload lora adapters explicitly.
serving_engine.pyand addloadandunloadwith request checkingapi_server.pyand add api endpoint and request object__eq__and__hash__. We used to uselora_init_idas the identifier. but the tricky problem we meet is user may not able to persist the id on the client side every time, using model name is more straightforward. what's more, there's no need to load multiple same lora adapters.My proposal is to use
lora_nameas the identifier, the other good thing is we can use consistent names across instances. For example, let's say we have 2-3 instances, we want all of them load a specific adapter and unload a specific adapter. Using name is more consistent in such scenarios.BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE
PR Checklist (Click to Expand)
Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]for bug fixes.[CI/Build]for build or continuous integration improvements.[Doc]for documentation fixes and improvements.[Model]for adding a new model or improving an existing model. Model name should appear in the title.[Frontend]For changes on the vLLM frontend (e.g., OpenAI API server,LLMclass, etc.)[Kernel]for changes affecting CUDA kernels or other compute kernels.[Core]for changes in the core vLLM logic (e.g.,LLMEngine,AsyncLLMEngine,Scheduler, etc.)[Hardware][Vendor]for hardware-specific changes. Vendor name should appear in the prefix (e.g.,[Hardware][AMD]).[Misc]for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
format.shto format your code.docs/source/if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.Notes for Large Changes
Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with
rfc-requiredand might not go through the PR.What to Expect for the Reviews
The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:
action-requiredlabel on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.Thank You
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!