Skip to content

[Performance]: Discussion about optimizing _prepare_model_input_tensors #6684

@phantomlei3

Description

@phantomlei3

Misc discussion on performance

Checking #6164, _prepare_model_input_tensors has been refactored for the purpose of performance improvement. I investigated the performance of _prepare_model_input_tensors with respect to different batch sizes, input sequence length, output sequence length and tensor parallel nums through running benchmark_latency.py. I found a directly proportional relationship between the time duration of _prepare_model_input_tensors and batch sizes (aka seq_group), which is a obvious operation that can be speeded up through parallelizing the loop in _prepare_model_input_tensors. Here comes my questions related to the follow up mentioned in #6164:

  1. What will the design of "Parallelize the loop for seq_group_metadata in seq_group_metadata_list " to speed up? Using threadpool?
  2. Are we going to implement a cuda kernel that can "Remove the loop for seq_id in seq_ids in ModelRunnerInputBuilder._add_seq_group()"?
  3. When will these follow-up optimizations be available? I would like to know if I can give contributions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance-related issuesstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions