[Performance]: Discussion about optimizing _prepare_model_input_tensors

### Misc discussion on performance
Checking #6164, _prepare_model_input_tensors has been refactored for the purpose of performance improvement. I investigated the performance of _prepare_model_input_tensors with respect to different batch sizes, input sequence length, output sequence length and tensor parallel nums through running benchmark_latency.py. I found a directly proportional relationship between the time duration of _prepare_model_input_tensors and batch sizes (aka seq_group), which is a obvious operation that can be speeded up through parallelizing the loop in _prepare_model_input_tensors. Here comes my questions related to the follow up mentioned in #6164:

1. What will the design of  "Parallelize the loop ```for seq_group_metadata in seq_group_metadata_list``` " to speed up? Using threadpool?
2. Are we going to implement a cuda kernel that can "Remove the loop for seq_id in seq_ids in ModelRunnerInputBuilder._add_seq_group()"?
3. When will these follow-up optimizations be available? I would like to know if I can give contributions.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Performance]: Discussion about optimizing _prepare_model_input_tensors #6684

Misc discussion on performance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

[Performance]: Discussion about optimizing _prepare_model_input_tensors #6684

Description

Misc discussion on performance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions