Commit 6f8720e
committed
[None][chore] Make tile_tokens_dim calculation just in time before kernel launching.
`tile_tokens_dim` directly depends on the num_token, which is a dynamic shape during tuning and inference. When AutoTuner prepares dummy tensors with different num_tokens, it does not update the value of `tile_tokens_dim` automatically. Therefore, the value stored in the AutoTuner cache is misaligned, which will introduce a lot of cache misses during inference, which hurts perf a lot.
To avoid this issue, we move the calculation of `tile_tokens_dim` right before kernel launching, so that the value of `tile_tokens_dim` is always up to date with the num_tokens of the current input tensor used for the kernel runner. To avoid extra warmup time costs, the extra autotuning warmup steps for all the CUDA graph batch sizes can be removed.
Signed-off-by: Yukun He <[email protected]>1 parent 7f3f658 commit 6f8720e
File tree
4 files changed
+282
-216
lines changed- tensorrt_llm/_torch
- custom_ops
- modules/fused_moe
- pyexecutor
- tests/unittest/_torch/thop/parallel
4 files changed
+282
-216
lines changed
0 commit comments