[Performance]: Use int over list[int] as output_tokens to reduce GC overhead

### Proposal to improve performance

Currently, we're consistently using list[int] to represent output_tokens in ModelRunnerOutput which is very inefficient from GC prospective.

The default setup of GC is (700, 10, 10) which means
- if allocated_obj-deallocated_obj>=700 in generation 0, GC0 will be triggered
- GC1 is triggered after 10 GC0
- GC2 is triggered after 10 GC1
In this scenario, large batch size scenarios (small models) each batch could be as large as 1024, which means GC0 will be triggered per decode cycle, GC1 will triggered per 10 decode cycle and GC2 per 100 decode cycle, which is very inefficient!

max_batch_size of list will be created here
https://github.com/vllm-project/vllm/blob/a38c1bfe09f487cb61b4eddb10d71a8d81cd6f11/vllm/v1/worker/gpu_model_runner.py#L4504-L4517

## Proposal #1 (Change OutputToken from list[int] to Union[int, list[int]])
https://github.com/vllm-project/vllm/pull/26368 shows very promising results of this proposal
- **19%** throughput boost in infinite request rate scenario for facebook-125m

## Proposal #2 (Semi-Hacky but simple)
Increase the GC0 threshold by max_batch_size which ensured GC0 won't be triggered right after the sample output tensor to list[list[int]] conversion. But I think this is hacky, and it sounds more like a short term mitigation over long term solution.

### Report of performance regression

_No response_

### Misc discussion on performance

_No response_

### Your current environment (if you think it is necessary)

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

	def _to_list(self, sampled_token_ids: torch.Tensor) -> list[list[int]]:
	# This is a short term mitigation for issue mentioned in
	# https://github.com/vllm-project/vllm/issues/22754.
	# `tolist` would trigger a cuda wise stream sync, which
	# would block other copy ops from other cuda streams.
	# A cuda event sync would avoid such a situation. Since
	# this is in the critical path of every single model
	# forward loop, this has caused perf issue for a disagg
	# setup.
	pinned = self.sampled_token_ids_pinned_cpu[: sampled_token_ids.shape[0]]
	pinned.copy_(sampled_token_ids, non_blocking=True)
	self.transfer_event.record()
	self.transfer_event.synchronize()
	return pinned.tolist()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: Use int over list[int] as output_tokens to reduce GC overhead #26369

Proposal to improve performance

Proposal #1 (Change OutputToken from list[int] to Union[int, list[int]])

Proposal #2 (Semi-Hacky but simple)

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: Use int over list[int] as output_tokens to reduce GC overhead #26369

Description

Proposal to improve performance

Proposal #1 (Change OutputToken from list[int] to Union[int, list[int]])

Proposal #2 (Semi-Hacky but simple)

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions