-
-
Couldn't load subscription status.
- Fork 10.8k
Description
Misc discussion on performance
Checking #6164, _prepare_model_input_tensors has been refactored for the purpose of performance improvement. I investigated the performance of _prepare_model_input_tensors with respect to different batch sizes, input sequence length, output sequence length and tensor parallel nums through running benchmark_latency.py. I found a directly proportional relationship between the time duration of _prepare_model_input_tensors and batch sizes (aka seq_group), which is a obvious operation that can be speeded up through parallelizing the loop in _prepare_model_input_tensors. Here comes my questions related to the follow up mentioned in #6164:
- What will the design of "Parallelize the loop
for seq_group_metadata in seq_group_metadata_list" to speed up? Using threadpool? - Are we going to implement a cuda kernel that can "Remove the loop for seq_id in seq_ids in ModelRunnerInputBuilder._add_seq_group()"?
- When will these follow-up optimizations be available? I would like to know if I can give contributions.