Skip to content

Conversation

jadielam
Copy link

@jadielam jadielam commented Jan 19, 2024

This branch introduces caching of prefixes and deallocation of old prefixes.

  1. For caching it introduces the concept of max_capacity in the prefix pool.
  2. For deallocation, it deallocates prefixes that have been evicted from the prefix pool (using an LRU policy) and that are no longer in use by any SequenceGroup.
  • There was a certain complexity with the deallocation, because of the swap_in and swap_out mechanisms for blocks, meaning that the block.ref_count was not a good representation of how many sequences were sharing a prefix. Because of that, a new seq_ref_counter was introduced in the prefix class.

I have already added some tests to the PrefixPool class. Will be adding some extra tests for the caching and deallocation mechanisms.

@jadielam
Copy link
Author

@zhuohan123 I have added the prefix caching and deallocation of prefixes in this PR. Feel free to take a look. I will keep adding test cases to it, but I think that is in a state where we can start to review it.

@simon-mo simon-mo requested a review from zhuohan123 January 19, 2024 22:25
@jadielam
Copy link
Author

jadielam commented Jan 21, 2024

Pardon the long comment that will follow below, but I am trying to be as precise and thorough as possible, striving to provide a solution that is correct. Below I describe two alternative methodologies for deallocating prefixes. In this PR, we are implementing solution 1.1. But as you will see, solution 1.1 has two issues, which I describe below. I discuss alternatives to solve those issues, and I conclude that solution 1.1.1.1.3 is the simplest and most elegant alternative. I will be following up with a commit to implement that solution, but wanted to share with the group the thought process that took me there.

1. Description of two different methodologies for deallocating prefixes:
1.1. At the end of each step, call the prefix pool and ask it to provide all prefixes that have been evicted from the cache's OrderedDict (and moved to a list of prefix candidates to remove) that have also already been allocated and have a seq_ref_count == 0. When a prefix P has seq_ref_count == 0 means that there is no sequence group that has already been removed of the waiting list, and that has not finished computing, and that shares prefix P, which means that is safe to deallocate prefix P.
1.1.1. But this methodology has a problem:
1.1.1.1: Consider a Sequence Group S that uses a prefix P still in waiting list when prefix P is deallocated because the prefix satisfied the conditions as described in 1.1. When this sequence group S finishes computing or is aborted, at the end of the step (and in all other posterior steps), when we call the prefix pool for all the prefixes that can be deallocated, this prefix (which has been allocated for a second type already, but not added to the prefix pool for a second time) will not be returned for a second time for deallocating by the pool because it has already been removed from it.
1.1.1.1.1: One way of solving this is using methodology 1.2.
1.1.1.1.2: Another way of solving this is only adding a prefix to the prefix pool when it has been allocated, but this means modifying the the current code a little bit too much.
1.1.1.1.3: Another way of solving this is to also keep track, for a given prefix, of Sequence Groups in the waiting list, and, while that count is greater than 0, do not deallocate yet. (I am leaning towards this one, because will be simpler to implement than the previous two solutions.)
1.2. Instead of at the end of each step asking the sequence group to provide a list of prefixes that can be deallocated, simply for each sequence group that either finishes computing or is aborted, check if its prefix has already been removed from the prefix pool's OrderedDict, and if it has, deallocate it.
1.2.1. But this methodology has a problem: Suppose that the last Sequence Group ever sent to the server that uses a given prefix P finishes computing, but prefix P has not yet been removed from the OrderedDict of the prefix pool, then this prefix will never be deallocated from memory.
1.2.1.1. Methodology 1.1 does not have this issue. So one potential solution is to use both methodology 1.2 and methodology 1.1 in conjunction, for deallocating prefixes. But honestly, methodology 1.2 is presented here in the first place as a fix to the issues of methodology 1.1. But I am leaning towards implementing what is described in 1.1.1.1.3 because is simpler to do.

vllm/config.py Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about profiling the maximum capacity by current max gpu memory if user doesn't set this parameter? And raising error when the server cannot afford larger maximum capacity.

Copy link
Author

@jadielam jadielam Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@esmeetu I agree with you. I think that we should be more sophisticated and instead of measuring prefix_pool_max_capacity in terms of the number of prefixes that the prefix pool can contain, measure the max capacity in terms of the percentage of the GPU memory that we are willing to use in the prefix pool.

This makes the implementation of the PrefixPool class a little more involved, but I think that all the complexity can be contained inside that class, which is a good thing.

For now, I think I am going to keep it the way it is, and then in a future PR I can introduce that improvement. I will move fast on it. As soon as this one is accepted and merged (if accepted), I will then move into implementing that more sophisticated way.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@esmeetu Last commit just introduced has changed the PrefixPool implementation to use number of blocks instead of number of prefixes.

@jadielam
Copy link
Author

Pardon the long comment that will follow below, but I am trying to be as precise and thorough as possible, striving to provide a solution that is correct. Below I describe two alternative methodologies for deallocating prefixes. In this PR, we are implementing solution 1.1. But as you will see, solution 1.1 has two issues, which I describe below. I discuss alternatives to solve those issues, and I conclude that solution 1.1.1.1.3 is the simplest and most elegant alternative. I will be following up with a commit to implement that solution, but wanted to share with the group the thought process that took me there.

1. Description of two different methodologies for deallocating prefixes: 1.1. At the end of each step, call the prefix pool and ask it to provide all prefixes that have been evicted from the cache's OrderedDict (and moved to a list of prefix candidates to remove) that have also already been allocated and have a seq_ref_count == 0. When a prefix P has seq_ref_count == 0 means that there is no sequence group that has already been removed of the waiting list, and that has not finished computing, and that shares prefix P, which means that is safe to deallocate prefix P. 1.1.1. But this methodology has a problem: 1.1.1.1: Consider a Sequence Group S that uses a prefix P still in waiting list when prefix P is deallocated because the prefix satisfied the conditions as described in 1.1. When this sequence group S finishes computing or is aborted, at the end of the step (and in all other posterior steps), when we call the prefix pool for all the prefixes that can be deallocated, this prefix (which has been allocated for a second type already, but not added to the prefix pool for a second time) will not be returned for a second time for deallocating by the pool because it has already been removed from it. 1.1.1.1.1: One way of solving this is using methodology 1.2. 1.1.1.1.2: Another way of solving this is only adding a prefix to the prefix pool when it has been allocated, but this means modifying the the current code a little bit too much. 1.1.1.1.3: Another way of solving this is to also keep track, for a given prefix, of Sequence Groups in the waiting list, and, while that count is greater than 0, do not deallocate yet. (I am leaning towards this one, because will be simpler to implement than the previous two solutions.) 1.2. Instead of at the end of each step asking the sequence group to provide a list of prefixes that can be deallocated, simply for each sequence group that either finishes computing or is aborted, check if its prefix has already been removed from the prefix pool's OrderedDict, and if it has, deallocate it. 1.2.1. But this methodology has a problem: Suppose that the last Sequence Group ever sent to the server that uses a given prefix P finishes computing, but prefix P has not yet been removed from the OrderedDict of the prefix pool, then this prefix will never be deallocated from memory. 1.2.1.1. Methodology 1.1 does not have this issue. So one potential solution is to use both methodology 1.2 and methodology 1.1 in conjunction, for deallocating prefixes. But honestly, methodology 1.2 is presented here in the first place as a fix to the issues of methodology 1.1. But honestly, I am leaning towards implementing what is described in 1.1.1.1.3 because is simpler to do and also provides a way to solve problem 1.1.1.2.

This issues have been removed by doing the following in the [Solves memory leak issue with prefixes in waiting list](https://github.com/vllm-project/vllm/pull/2511/commits/d7ee5498b3b2bb63ed172b2691e08e21b1e90fd0) commit:

  1. Whenever a prefix is removed from the list of candidates to free in the prefix pool, mark such prefix as expired.
  2. Create a @Property method for prefix in SequenceGRoup class that sets self._prefix to None if self._prefix is expired. Return self._prefix. Now, when a sequence group is on the waiting list, but its prefix has been removed from PrefixPool from some reason, and hence expired, when is the turn to allocate this prefix pool, it effectively does not have a prefix, and hence things go the normal way of things without a potential reallocation of the already expired prefix that can cause a memory leak.

@jadielam
Copy link
Author

@WoosukKwon @simon-mo @zhuohan123 I think that this PR is ready for review. All issues regarding memory leaks and sync with the master branch have been solved.

Copy link
Author

@jadielam jadielam Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: Remember to remove these comments when discussed

@zhuohan123
Copy link
Member

Thanks for the contribution! However, we are moving to a more automated method as in #2614 and #2762. Feel free to comment more over there!

Copy link

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Oct 30, 2024
@mergify mergify bot added documentation Improvements or additions to documentation frontend labels Oct 30, 2024
Copy link

mergify bot commented Oct 30, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @jadielam please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 30, 2024
@github-actions github-actions bot added unstale Recieved activity after being labelled stale and removed stale Over 90 days of inactivity labels Nov 2, 2024
@hmellor hmellor closed this Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend needs-rebase unstale Recieved activity after being labelled stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants