Prefix caching and deallocation mechanism #2511

jadielam · 2024-01-19T22:14:36Z

This branch introduces caching of prefixes and deallocation of old prefixes.

For caching it introduces the concept of max_capacity in the prefix pool.
For deallocation, it deallocates prefixes that have been evicted from the prefix pool (using an LRU policy) and that are no longer in use by any SequenceGroup.

There was a certain complexity with the deallocation, because of the swap_in and swap_out mechanisms for blocks, meaning that the block.ref_count was not a good representation of how many sequences were sharing a prefix. Because of that, a new seq_ref_counter was introduced in the prefix class.

I have already added some tests to the PrefixPool class. Will be adding some extra tests for the caching and deallocation mechanisms.

jadielam · 2024-01-19T22:18:51Z

@zhuohan123 I have added the prefix caching and deallocation of prefixes in this PR. Feel free to take a look. I will keep adding test cases to it, but I think that is in a state where we can start to review it.

vllm/prefix.py

vllm/core/scheduler.py

jadielam · 2024-01-21T03:28:37Z

Pardon the long comment that will follow below, but I am trying to be as precise and thorough as possible, striving to provide a solution that is correct. Below I describe two alternative methodologies for deallocating prefixes. In this PR, we are implementing solution 1.1. But as you will see, solution 1.1 has two issues, which I describe below. I discuss alternatives to solve those issues, and I conclude that solution 1.1.1.1.3 is the simplest and most elegant alternative. I will be following up with a commit to implement that solution, but wanted to share with the group the thought process that took me there.

1. Description of two different methodologies for deallocating prefixes:
1.1. At the end of each step, call the prefix pool and ask it to provide all prefixes that have been evicted from the cache's OrderedDict (and moved to a list of prefix candidates to remove) that have also already been allocated and have a seq_ref_count == 0. When a prefix P has seq_ref_count == 0 means that there is no sequence group that has already been removed of the waiting list, and that has not finished computing, and that shares prefix P, which means that is safe to deallocate prefix P.
1.1.1. But this methodology has a problem:
1.1.1.1: Consider a Sequence Group S that uses a prefix P still in waiting list when prefix P is deallocated because the prefix satisfied the conditions as described in 1.1. When this sequence group S finishes computing or is aborted, at the end of the step (and in all other posterior steps), when we call the prefix pool for all the prefixes that can be deallocated, this prefix (which has been allocated for a second type already, but not added to the prefix pool for a second time) will not be returned for a second time for deallocating by the pool because it has already been removed from it.
1.1.1.1.1: One way of solving this is using methodology 1.2.
1.1.1.1.2: Another way of solving this is only adding a prefix to the prefix pool when it has been allocated, but this means modifying the the current code a little bit too much.
1.1.1.1.3: Another way of solving this is to also keep track, for a given prefix, of Sequence Groups in the waiting list, and, while that count is greater than 0, do not deallocate yet. (I am leaning towards this one, because will be simpler to implement than the previous two solutions.)
1.2. Instead of at the end of each step asking the sequence group to provide a list of prefixes that can be deallocated, simply for each sequence group that either finishes computing or is aborted, check if its prefix has already been removed from the prefix pool's OrderedDict, and if it has, deallocate it.
1.2.1. But this methodology has a problem: Suppose that the last Sequence Group ever sent to the server that uses a given prefix P finishes computing, but prefix P has not yet been removed from the OrderedDict of the prefix pool, then this prefix will never be deallocated from memory.
1.2.1.1. Methodology 1.1 does not have this issue. So one potential solution is to use both methodology 1.2 and methodology 1.1 in conjunction, for deallocating prefixes. But honestly, methodology 1.2 is presented here in the first place as a fix to the issues of methodology 1.1. But I am leaning towards implementing what is described in 1.1.1.1.3 because is simpler to do.

esmeetu · 2024-01-22T14:17:38Z

vllm/config.py

How about profiling the maximum capacity by current max gpu memory if user doesn't set this parameter? And raising error when the server cannot afford larger maximum capacity.

@esmeetu I agree with you. I think that we should be more sophisticated and instead of measuring prefix_pool_max_capacity in terms of the number of prefixes that the prefix pool can contain, measure the max capacity in terms of the percentage of the GPU memory that we are willing to use in the prefix pool.

This makes the implementation of the PrefixPool class a little more involved, but I think that all the complexity can be contained inside that class, which is a good thing.

For now, I think I am going to keep it the way it is, and then in a future PR I can introduce that improvement. I will move fast on it. As soon as this one is accepted and merged (if accepted), I will then move into implementing that more sophisticated way.

@esmeetu Last commit just introduced has changed the PrefixPool implementation to use number of blocks instead of number of prefixes.

jadielam · 2024-01-26T16:26:06Z

Pardon the long comment that will follow below, but I am trying to be as precise and thorough as possible, striving to provide a solution that is correct. Below I describe two alternative methodologies for deallocating prefixes. In this PR, we are implementing solution 1.1. But as you will see, solution 1.1 has two issues, which I describe below. I discuss alternatives to solve those issues, and I conclude that solution 1.1.1.1.3 is the simplest and most elegant alternative. I will be following up with a commit to implement that solution, but wanted to share with the group the thought process that took me there.

1. Description of two different methodologies for deallocating prefixes: 1.1. At the end of each step, call the prefix pool and ask it to provide all prefixes that have been evicted from the cache's OrderedDict (and moved to a list of prefix candidates to remove) that have also already been allocated and have a seq_ref_count == 0. When a prefix P has seq_ref_count == 0 means that there is no sequence group that has already been removed of the waiting list, and that has not finished computing, and that shares prefix P, which means that is safe to deallocate prefix P. 1.1.1. But this methodology has a problem: 1.1.1.1: Consider a Sequence Group S that uses a prefix P still in waiting list when prefix P is deallocated because the prefix satisfied the conditions as described in 1.1. When this sequence group S finishes computing or is aborted, at the end of the step (and in all other posterior steps), when we call the prefix pool for all the prefixes that can be deallocated, this prefix (which has been allocated for a second type already, but not added to the prefix pool for a second time) will not be returned for a second time for deallocating by the pool because it has already been removed from it. 1.1.1.1.1: One way of solving this is using methodology 1.2. 1.1.1.1.2: Another way of solving this is only adding a prefix to the prefix pool when it has been allocated, but this means modifying the the current code a little bit too much. 1.1.1.1.3: Another way of solving this is to also keep track, for a given prefix, of Sequence Groups in the waiting list, and, while that count is greater than 0, do not deallocate yet. (I am leaning towards this one, because will be simpler to implement than the previous two solutions.) 1.2. Instead of at the end of each step asking the sequence group to provide a list of prefixes that can be deallocated, simply for each sequence group that either finishes computing or is aborted, check if its prefix has already been removed from the prefix pool's OrderedDict, and if it has, deallocate it. 1.2.1. But this methodology has a problem: Suppose that the last Sequence Group ever sent to the server that uses a given prefix P finishes computing, but prefix P has not yet been removed from the OrderedDict of the prefix pool, then this prefix will never be deallocated from memory. 1.2.1.1. Methodology 1.1 does not have this issue. So one potential solution is to use both methodology 1.2 and methodology 1.1 in conjunction, for deallocating prefixes. But honestly, methodology 1.2 is presented here in the first place as a fix to the issues of methodology 1.1. But honestly, I am leaning towards implementing what is described in 1.1.1.1.3 because is simpler to do and also provides a way to solve problem 1.1.1.2.

This issues have been removed by doing the following in the [Solves memory leak issue with prefixes in waiting list](https://github.com/vllm-project/vllm/pull/2511/commits/d7ee5498b3b2bb63ed172b2691e08e21b1e90fd0) commit:

Whenever a prefix is removed from the list of candidates to free in the prefix pool, mark such prefix as expired.
Create a @Property method for prefix in SequenceGRoup class that sets self._prefix to None if self._prefix is expired. Return self._prefix. Now, when a sequence group is on the waiting list, but its prefix has been removed from PrefixPool from some reason, and hence expired, when is the turn to allocate this prefix pool, it effectively does not have a prefix, and hence things go the normal way of things without a potential reallocation of the already expired prefix that can cause a memory leak.

jadielam · 2024-01-26T16:55:08Z

@WoosukKwon @simon-mo @zhuohan123 I think that this PR is ready for review. All issues regarding memory leaks and sync with the master branch have been solved.

jadielam · 2024-01-26T17:34:13Z

vllm/core/scheduler.py

Note to self: Remember to remove these comments when discussed

…ted after deallocating it

…s and factories

zhuohan123 · 2024-03-03T23:53:54Z

Thanks for the contribution! However, we are moving to a more automated method as in #2614 and #2762. Feel free to comment more over there!

github-actions · 2024-10-30T02:03:19Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

mergify · 2024-10-30T02:04:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. @jadielam please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

simon-mo requested a review from zhuohan123 January 19, 2024 22:25

jadielam commented Jan 19, 2024

View reviewed changes

vllm/prefix.py Outdated Show resolved Hide resolved

jadielam force-pushed the prefix-caching branch from 2a6ee48 to 6bf3c45 Compare January 19, 2024 23:49

esmeetu mentioned this pull request Jan 20, 2024

Prefix Caching- fix t4 triton error #2517

Merged

esmeetu reviewed Jan 20, 2024

View reviewed changes

vllm/prefix.py Outdated Show resolved Hide resolved

vllm/core/scheduler.py Outdated Show resolved Hide resolved

esmeetu approved these changes Jan 22, 2024

View reviewed changes

jadielam force-pushed the prefix-caching branch from 574d9a8 to 7685f71 Compare January 23, 2024 20:39

jadielam force-pushed the prefix-caching branch from f19a2f2 to 0fd470c Compare January 26, 2024 16:45

jadielam commented Jan 26, 2024

View reviewed changes

vllm/core/scheduler.py Outdated

Copy link

Author

jadielam Jan 26, 2024 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: Remember to remove these comments when discussed

jadielam added 17 commits January 30, 2024 14:14

Added max_capacity to prefix pool, together with tests

075939b

Added a deallocation mechanism for prefixes

ee90c30

Fixing error making tests cases to fail

dac2137

Fixed incorrect function name. Changed status of prefix to not alloca…

eb0ed00

…ted after deallocating it

Removed unnecessary comment

38526fb

Added prefix_pool_max_capacity parameter to the pertinent constructor…

f150812

…s and factories

Fixed formatting issues

2e66162

Added test cases to test prefix pool with max capacity

fe0258f

Fixing Index out of range bug

f770a64

Fixing formatting

58634a7

Capacity of prefix pool now measured in terms of gpu blocks

83cf2b0

Fixing formatting

880c33f

Fixed arg_utils typos

7348f55

Solves memory leak issue with prefixes in waiting list

4e6b020

Added support for lora_int_id

36e71a3

Fixed formtting issues

4566e39

Added documentation for the prefix-caching-memory-utilization parameter

8f83072

jadielam force-pushed the prefix-caching branch from 25c0c21 to 8f83072 Compare January 30, 2024 19:37

Fixing formatting issues

9bd33a6

jadielam mentioned this pull request Feb 8, 2024

Add Automatic Prefix Caching #2762

Merged

github-actions bot added the stale Over 90 days of inactivity label Oct 30, 2024

mergify bot added documentation Improvements or additions to documentation frontend labels Oct 30, 2024

mergify bot added the needs-rebase label Oct 30, 2024

github-actions bot added unstale Recieved activity after being labelled stale and removed stale Over 90 days of inactivity labels Nov 2, 2024

simon-mo requested review from KuntaiDu, alexm-redhat, comaniac, njhill and youkaichao as code owners November 26, 2024 05:49

hmellor closed this Feb 17, 2025

Uh oh!

Prefix caching and deallocation mechanism #2511

Prefix caching and deallocation mechanism #2511

Uh oh!

Conversation

jadielam commented Jan 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jadielam commented Jan 19, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jadielam commented Jan 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

esmeetu Jan 22, 2024

Choose a reason for hiding this comment

Uh oh!

jadielam Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jadielam Jan 25, 2024

Choose a reason for hiding this comment

Uh oh!

jadielam commented Jan 26, 2024

Uh oh!

jadielam commented Jan 26, 2024

Uh oh!

jadielam Jan 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuohan123 commented Mar 3, 2024

Uh oh!

github-actions bot commented Oct 30, 2024

Uh oh!

mergify bot commented Oct 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jadielam commented Jan 19, 2024 •

edited

Loading

jadielam commented Jan 21, 2024 •

edited

Loading

jadielam Jan 22, 2024 •

edited

Loading

jadielam Jan 26, 2024 •

edited

Loading