Commit 6f8720e

committed

[None][chore] Make tile_tokens_dim calculation just in time before kernel launching.

`tile_tokens_dim` directly depends on the num_token, which is a dynamic shape during tuning and inference. When AutoTuner prepares dummy tensors with different num_tokens, it does not update the value of `tile_tokens_dim` automatically. Therefore, the value stored in the AutoTuner cache is misaligned, which will introduce a lot of cache misses during inference, which hurts perf a lot. To avoid this issue, we move the calculation of `tile_tokens_dim` right before kernel launching, so that the value of `tile_tokens_dim` is always up to date with the num_tokens of the current input tensor used for the kernel runner. To avoid extra warmup time costs, the extra autotuning warmup steps for all the CUDA graph batch sizes can be removed. Signed-off-by: Yukun He <[email protected]>

1 parent 7f3f658 commit 6f8720eCopy full SHA for 6f8720e

4 files changed

+282

-216

lines changed

tensorrt_llm/_torch
- custom_ops
  - trtllm_gen_custom_ops.py
- modules/fused_moe
  - fused_moe_trtllm_gen.py
- pyexecutor
  - model_engine.py
tests/unittest/_torch/thop/parallel
- test_moe.py

4 files changed

+282

-216

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 6f8720e

4 files changed

4 files changed

File tree

4 files changed

4 files changed

0 commit comments