[TRTLLM-5589] feat: Integrate TRT-LLM Gen FP8 Batched GEMM with Pytorch workflow kernel autotuner #4872

DomBrown · 2025-06-03T16:04:44Z

Description

Extend TRT-LLM Gen fp8 BMM torch operator to integrate with the pytorch workflow kernel autotuner to allow profiling for the best kernel config for a given kernel runner config and matrix shapes.

Slightly modifies the kernel tuner to handle None tensors, as some TRT-LLM Gen kernels have input tensors that are optional based on the kernel configuration.

Test Coverage

tests/unittest/_torch/thop/test_tllmg_bmm.py

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

DomBrown · 2025-06-04T14:12:41Z

All fixed. Might wait on CI as it seems like some jobs might get killed to prioritise others

DomBrown · 2025-06-04T15:25:27Z

/bot run

tensorrt-cicd · 2025-06-04T15:31:29Z

PR_Github #7541 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-04T17:20:07Z

PR_Github #7541 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5471 completed with status: 'FAILURE'

DomBrown · 2025-06-04T17:45:45Z

/bot run

tensorrt-cicd · 2025-06-04T17:52:31Z

PR_Github #7553 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-04T19:14:18Z

PR_Github #7553 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5481 completed with status: 'FAILURE'

DomBrown · 2025-06-05T08:26:39Z

/bot run

tensorrt-cicd · 2025-06-05T08:33:05Z

PR_Github #7666 [ run ] triggered by Bot

DomBrown · 2025-06-05T10:04:15Z

/bot kill

tensorrt-cicd · 2025-06-05T10:10:56Z

PR_Github #7702 [ kill ] triggered by Bot

tensorrt-cicd · 2025-06-05T10:10:57Z

PR_Github #7666 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-06-05T10:11:28Z

PR_Github #7702 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit 41648d0

DomBrown · 2025-06-05T10:29:59Z

/bot run

tensorrt-cicd · 2025-06-05T10:35:56Z

PR_Github #7713 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-05T13:27:06Z

PR_Github #7713 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5593 completed with status: 'FAILURE'

DomBrown · 2025-06-05T13:45:16Z

/bot run

tensorrt-cicd · 2025-06-05T13:50:54Z

PR_Github #7752 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-05T15:22:28Z

PR_Github #7752 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5616 completed with status: 'FAILURE'

…ch workflow kernel autotuner WIP, does not compile yet Signed-off-by: Dom Brown <[email protected]> Fix compile with a slight refactor Signed-off-by: Dom Brown <[email protected]> WIP Signed-off-by: Dom Brown <[email protected]> WIP Signed-off-by: Dom Brown <[email protected]> Further WIP Signed-off-by: Dom Brown <[email protected]> i2/5 tests passing Signed-off-by: Dom Brown <[email protected]> Works when tuning is disabled. Signed-off-by: Dom Brown <[email protected]> Fix tests by specifying correct constraints when use_deepseek_fp8 is true Signed-off-by: Dom Brown <[email protected]> Fix autotuner typo Signed-off-by: Dom Brown <[email protected]> Clean up test Signed-off-by: Dom Brown <[email protected]> Small cleanup Signed-off-by: Dom Brown <[email protected]> Adjust to support None tensor inputs. Signed-off-by: Dom Brown <[email protected]> Small python cleanup Signed-off-by: Dom Brown <[email protected]> Fix type hints Signed-off-by: Dom Brown <[email protected]> refactor plus remove old OP Signed-off-by: Dom Brown <[email protected]> Comment Signed-off-by: Dom Brown <[email protected]> Address reviewer comments Signed-off-by: Dom Brown <[email protected]>

Signed-off-by: Dom Brown <[email protected]>

DomBrown · 2025-06-06T11:11:17Z

/bot run

tensorrt-cicd · 2025-06-06T11:20:47Z

PR_Github #7902 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-06T20:59:20Z

PR_Github #7902 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5710 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

HuiGao-NV · 2025-06-09T09:00:06Z

LGTM

…r kernel configs. The motivation for this PR is NVIDIA#4872, in which AutoTuner is applied to FP8 batched GEMM op with tile_size and epilog_tile_m to be in the argument list. * Encoding different configs into a list of numeric tactic IDs starting from 0. This will be implemented inside kernels and used through get_valid_tactics. * Define each config separately and let AutoTuner iterate over the combinations. This is more readable and flexible. Users can use each part of the config directly. There is no encoding-decoding process. Add a config entry in the tuning config to define the valid candidates for each part of the config. * AutoTuner will loop over a search grid generated from the config combinations. * Each config will be tuned along with the specific input profile. * The best config will be recorded in the cache value (instead of the cache key). And it will be recovered and used in the tunable runner forward. Other enhancement: * Use the decorator to make the tuning config definition more natural and efficient. This is an independent enhancement. * Some code refactoring.

…r kernel configs. The motivation for this PR is NVIDIA#4872, in which AutoTuner is applied to FP8 batched GEMM op with tile_size and epilog_tile_m to be in the argument list. * Encoding different configs into a list of numeric tactic IDs starting from 0. This will be implemented inside kernels and used through get_valid_tactics. * Define each config separately and let AutoTuner iterate over the combinations. This is more readable and flexible. Users can use each part of the config directly. There is no encoding-decoding process. Add a config entry in the tuning config to define the valid candidates for each part of the config. * AutoTuner will loop over a search grid generated from the config combinations. * Each config will be tuned along with the specific input profile. * The best config will be recorded in the cache value (instead of the cache key). And it will be recovered and used in the tunable runner forward. Other enhancement: * Use the decorator to make the tuning config definition more natural and efficient. This is an independent enhancement. * Some code refactoring. Signed-off-by: Yukun He <[email protected]>

…r kernel configs. The motivation for this PR is NVIDIA#4872, in which AutoTuner is applied to FP8 batched GEMM op with tile_size and epilog_tile_m to be in the argument list. * Encoding different configs into a list of numeric tactic IDs starting from 0. This will be implemented inside kernels and used through get_valid_tactics. * Define each config separately and let AutoTuner iterate over the combinations. This is more readable and flexible. Users can use each part of the config directly. There is no encoding-decoding process. Add a config entry in the tuning config to define the valid candidates for each part of the config. * AutoTuner will loop over a search grid generated from the config combinations. * Each config will be tuned along with the specific input profile. * The best config will be recorded in the cache value (instead of the cache key). And it will be recovered and used in the tunable runner forward. Other enhancement: * Use the decorator to make the tuning config definition more natural and efficient. This is an independent enhancement. * Allow the user to not speficy the gen_tuning_buckets or the map_to_tuning_buckets function. * Code refactoring. Signed-off-by: Yukun He <[email protected]>

DomBrown requested a review from nekorobov June 3, 2025 16:04

DomBrown self-assigned this Jun 3, 2025

DomBrown requested a review from a team as a code owner June 3, 2025 16:04

DomBrown requested review from HuiGao-NV and lucaslie June 3, 2025 16:04

nekorobov reviewed Jun 3, 2025

View reviewed changes

DomBrown force-pushed the dev/autotune_bmm_poc branch 3 times, most recently from 3326526 to f9cbe90 Compare June 4, 2025 14:07

DomBrown force-pushed the dev/autotune_bmm_poc branch from 56d723d to ecace65 Compare June 5, 2025 08:03

DomBrown force-pushed the dev/autotune_bmm_poc branch from ecace65 to 41648d0 Compare June 5, 2025 10:04

DomBrown force-pushed the dev/autotune_bmm_poc branch from 41648d0 to 30d43f1 Compare June 5, 2025 13:45

DomBrown added 2 commits June 6, 2025 12:10

Fix pre-commit + only generate tuning config once

a2dab4c

Signed-off-by: Dom Brown <[email protected]>

DomBrown force-pushed the dev/autotune_bmm_poc branch from 30d43f1 to a2dab4c Compare June 6, 2025 11:10

nekorobov approved these changes Jun 6, 2025

View reviewed changes

dcampora approved these changes Jun 9, 2025

View reviewed changes

DomBrown merged commit 9c012d5 into NVIDIA:main Jun 9, 2025
3 checks passed

DomBrown deleted the dev/autotune_bmm_poc branch June 9, 2025 12:30

This was referenced Jun 12, 2025

[TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. #5139

Merged

[TRTLLM-4501][feat] AutoTuner tuning config refactor and add tuning for kernel configs. #5236

Open

[TRTLLM-5589] feat: Integrate TRT-LLM Gen FP8 Batched GEMM with Pytorch workflow kernel autotuner #4872

[TRTLLM-5589] feat: Integrate TRT-LLM Gen FP8 Batched GEMM with Pytorch workflow kernel autotuner #4872

Uh oh!

Conversation

DomBrown commented Jun 3, 2025

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DomBrown commented Jun 4, 2025

Uh oh!

DomBrown commented Jun 4, 2025

Uh oh!

tensorrt-cicd commented Jun 4, 2025

Uh oh!

tensorrt-cicd commented Jun 4, 2025

Uh oh!

DomBrown commented Jun 4, 2025

Uh oh!

tensorrt-cicd commented Jun 4, 2025

Uh oh!

tensorrt-cicd commented Jun 4, 2025

Uh oh!

DomBrown commented Jun 5, 2025

Uh oh!

tensorrt-cicd commented Jun 5, 2025

Uh oh!

DomBrown commented Jun 5, 2025

Uh oh!

tensorrt-cicd commented Jun 5, 2025

Uh oh!

tensorrt-cicd commented Jun 5, 2025

Uh oh!

tensorrt-cicd commented Jun 5, 2025

Uh oh!

DomBrown commented Jun 5, 2025

Uh oh!

tensorrt-cicd commented Jun 5, 2025

Uh oh!

tensorrt-cicd commented Jun 5, 2025

Uh oh!

DomBrown commented Jun 5, 2025

Uh oh!

tensorrt-cicd commented Jun 5, 2025

Uh oh!

tensorrt-cicd commented Jun 5, 2025

Uh oh!

DomBrown commented Jun 6, 2025

Uh oh!

tensorrt-cicd commented Jun 6, 2025

Uh oh!

tensorrt-cicd commented Jun 6, 2025

Uh oh!

HuiGao-NV commented Jun 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants