Commit cd80e0a

authored

[None][fix] Make tile_tokens_dim calculation just in time before kernel launching. (#7529)

tile_tokens_dim directly depends on the num_token, which is a dynamic shape during tuning and inference. When AutoTuner prepares dummy tensors with different num_tokens, it does not update the value of tile_tokens_dim automatically. Therefore, the value stored in the AutoTuner cache is misaligned, which will introduce a lot of cache misses during inference, which hurts perf a lot. To avoid this issue, we move the calculation of tile_tokens_dim right before kernel launching, so that the value of tile_tokens_dim is always up to date with the num_tokens of the current input tensor used for the kernel runner. Also, the tile_tokens_dim is calculated based on the number of tokens of a tuned bucket, instead of the original token number. Because we only tune the value for the buckets, not for the raw input token number, to avoid unexpected misalignment between tile_tokens_dim and the token number. This PR also removes the warmup requests with the extra input shapes, which are triggered in the CUDA graph warmup phase. Signed-off-by: Yukun He <[email protected]>

1 parent 327e5e5 commit cd80e0aCopy full SHA for cd80e0a

4 files changed

+263

-215

lines changed

tensorrt_llm/_torch
- custom_ops
  - trtllm_gen_custom_ops.py
- modules/fused_moe
  - fused_moe_trtllm_gen.py
- pyexecutor
  - model_engine.py
tests/unittest/_torch/thop/parallel
- test_moe.py

4 files changed

+263

-215

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit cd80e0a

4 files changed

4 files changed

File tree

4 files changed

4 files changed

0 commit comments