-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Open
Labels
performancePerformance-related issuesPerformance-related issues
Description
Proposal to improve performance
Currently, we're consistently using list[int] to represent output_tokens in ModelRunnerOutput which is very inefficient from GC prospective.
The default setup of GC is (700, 10, 10) which means
- if allocated_obj-deallocated_obj>=700 in generation 0, GC0 will be triggered
- GC1 is triggered after 10 GC0
- GC2 is triggered after 10 GC1
In this scenario, large batch size scenarios (small models) each batch could be as large as 1024, which means GC0 will be triggered per decode cycle, GC1 will triggered per 10 decode cycle and GC2 per 100 decode cycle, which is very inefficient!
max_batch_size of list will be created here
vllm/vllm/v1/worker/gpu_model_runner.py
Lines 4504 to 4517 in a38c1bf
def _to_list(self, sampled_token_ids: torch.Tensor) -> list[list[int]]: | |
# This is a short term mitigation for issue mentioned in | |
# https://github.com/vllm-project/vllm/issues/22754. | |
# `tolist` would trigger a cuda wise stream sync, which | |
# would block other copy ops from other cuda streams. | |
# A cuda event sync would avoid such a situation. Since | |
# this is in the critical path of every single model | |
# forward loop, this has caused perf issue for a disagg | |
# setup. | |
pinned = self.sampled_token_ids_pinned_cpu[: sampled_token_ids.shape[0]] | |
pinned.copy_(sampled_token_ids, non_blocking=True) | |
self.transfer_event.record() | |
self.transfer_event.synchronize() | |
return pinned.tolist() |
Proposal #1 (Change OutputToken from list[int] to Union[int, list[int]])
#26368 shows very promising results of this proposal
- 19% throughput boost in infinite request rate scenario for facebook-125m
Proposal #2 (Semi-Hacky but simple)
Increase the GC0 threshold by max_batch_size which ensured GC0 won't be triggered right after the sample output tensor to list[list[int]] conversion. But I think this is hacky, and it sounds more like a short term mitigation over long term solution.
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
noooop
Metadata
Metadata
Assignees
Labels
performancePerformance-related issuesPerformance-related issues