-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Prefix caching and deallocation mechanism #2511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@zhuohan123 I have added the prefix caching and deallocation of prefixes in this PR. Feel free to take a look. I will keep adding test cases to it, but I think that is in a state where we can start to review it. |
2a6ee48
to
6bf3c45
Compare
Pardon the long comment that will follow below, but I am trying to be as precise and thorough as possible, striving to provide a solution that is correct. Below I describe two alternative methodologies for deallocating prefixes. In this PR, we are implementing solution 1. Description of two different methodologies for deallocating prefixes: |
vllm/config.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about profiling the maximum capacity by current max gpu memory if user doesn't set this parameter? And raising error when the server cannot afford larger maximum capacity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@esmeetu I agree with you. I think that we should be more sophisticated and instead of measuring prefix_pool_max_capacity
in terms of the number of prefixes that the prefix pool can contain, measure the max capacity in terms of the percentage of the GPU memory that we are willing to use in the prefix pool.
This makes the implementation of the PrefixPool
class a little more involved, but I think that all the complexity can be contained inside that class, which is a good thing.
For now, I think I am going to keep it the way it is, and then in a future PR I can introduce that improvement. I will move fast on it. As soon as this one is accepted and merged (if accepted), I will then move into implementing that more sophisticated way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@esmeetu Last commit just introduced has changed the PrefixPool
implementation to use number of blocks instead of number of prefixes.
574d9a8
to
7685f71
Compare
This issues have been removed by doing the following in the
|
f19a2f2
to
0fd470c
Compare
@WoosukKwon @simon-mo @zhuohan123 I think that this PR is ready for review. All issues regarding memory leaks and sync with the master branch have been solved. |
vllm/core/scheduler.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: Remember to remove these comments when discussed
…ted after deallocating it
25c0c21
to
8f83072
Compare
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
This pull request has merge conflicts that must be resolved before it can be |
This branch introduces caching of prefixes and deallocation of old prefixes.
I have already added some tests to the PrefixPool class. Will be adding some extra tests for the caching and deallocation mechanisms.