Skip to content

[Performance]: Use int over list[int] as output_tokens to reduce GC overhead #26369

@Jialin

Description

@Jialin

Proposal to improve performance

Currently, we're consistently using list[int] to represent output_tokens in ModelRunnerOutput which is very inefficient from GC prospective.

The default setup of GC is (700, 10, 10) which means

  • if allocated_obj-deallocated_obj>=700 in generation 0, GC0 will be triggered
  • GC1 is triggered after 10 GC0
  • GC2 is triggered after 10 GC1
    In this scenario, large batch size scenarios (small models) each batch could be as large as 1024, which means GC0 will be triggered per decode cycle, GC1 will triggered per 10 decode cycle and GC2 per 100 decode cycle, which is very inefficient!

max_batch_size of list will be created here

def _to_list(self, sampled_token_ids: torch.Tensor) -> list[list[int]]:
# This is a short term mitigation for issue mentioned in
# https://github.com/vllm-project/vllm/issues/22754.
# `tolist` would trigger a cuda wise stream sync, which
# would block other copy ops from other cuda streams.
# A cuda event sync would avoid such a situation. Since
# this is in the critical path of every single model
# forward loop, this has caused perf issue for a disagg
# setup.
pinned = self.sampled_token_ids_pinned_cpu[: sampled_token_ids.shape[0]]
pinned.copy_(sampled_token_ids, non_blocking=True)
self.transfer_event.record()
self.transfer_event.synchronize()
return pinned.tolist()

Proposal #1 (Change OutputToken from list[int] to Union[int, list[int]])

#26368 shows very promising results of this proposal

  • 19% throughput boost in infinite request rate scenario for facebook-125m

Proposal #2 (Semi-Hacky but simple)

Increase the GC0 threshold by max_batch_size which ensured GC0 won't be triggered right after the sample output tensor to list[list[int]] conversion. But I think this is hacky, and it sounds more like a short term mitigation over long term solution.

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions