Skip to content

Conversation

DomBrown
Copy link
Collaborator

@DomBrown DomBrown commented Jun 3, 2025

Description

Extend TRT-LLM Gen fp8 BMM torch operator to integrate with the pytorch workflow kernel autotuner to allow profiling for the best kernel config for a given kernel runner config and matrix shapes.

Slightly modifies the kernel tuner to handle None tensors, as some TRT-LLM Gen kernels have input tensors that are optional based on the kernel configuration.

Test Coverage

tests/unittest/_torch/thop/test_tllmg_bmm.py

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@DomBrown DomBrown requested a review from nekorobov June 3, 2025 16:04
@DomBrown DomBrown self-assigned this Jun 3, 2025
@DomBrown DomBrown requested a review from a team as a code owner June 3, 2025 16:04
@DomBrown DomBrown requested review from HuiGao-NV and lucaslie June 3, 2025 16:04
@DomBrown DomBrown force-pushed the dev/autotune_bmm_poc branch 3 times, most recently from 3326526 to f9cbe90 Compare June 4, 2025 14:07
@DomBrown
Copy link
Collaborator Author

DomBrown commented Jun 4, 2025

All fixed. Might wait on CI as it seems like some jobs might get killed to prioritise others

@DomBrown
Copy link
Collaborator Author

DomBrown commented Jun 4, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7541 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7541 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5471 completed with status: 'FAILURE'

@DomBrown
Copy link
Collaborator Author

DomBrown commented Jun 4, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7553 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7553 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5481 completed with status: 'FAILURE'

@DomBrown DomBrown force-pushed the dev/autotune_bmm_poc branch from 56d723d to ecace65 Compare June 5, 2025 08:03
@DomBrown
Copy link
Collaborator Author

DomBrown commented Jun 5, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7666 [ run ] triggered by Bot

@DomBrown
Copy link
Collaborator Author

DomBrown commented Jun 5, 2025

/bot kill

@DomBrown DomBrown force-pushed the dev/autotune_bmm_poc branch from ecace65 to 41648d0 Compare June 5, 2025 10:04
@tensorrt-cicd
Copy link
Collaborator

PR_Github #7702 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7666 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7702 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit 41648d0

@DomBrown
Copy link
Collaborator Author

DomBrown commented Jun 5, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7713 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7713 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5593 completed with status: 'FAILURE'

@DomBrown DomBrown force-pushed the dev/autotune_bmm_poc branch from 41648d0 to 30d43f1 Compare June 5, 2025 13:45
@DomBrown
Copy link
Collaborator Author

DomBrown commented Jun 5, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7752 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7752 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5616 completed with status: 'FAILURE'

DomBrown added 2 commits June 6, 2025 12:10
…ch workflow kernel autotuner

WIP, does not compile yet

Signed-off-by: Dom Brown <[email protected]>

Fix compile with a slight refactor

Signed-off-by: Dom Brown <[email protected]>

WIP

Signed-off-by: Dom Brown <[email protected]>

WIP

Signed-off-by: Dom Brown <[email protected]>

Further WIP

Signed-off-by: Dom Brown <[email protected]>

i2/5 tests passing

Signed-off-by: Dom Brown <[email protected]>

Works when tuning is disabled.

Signed-off-by: Dom Brown <[email protected]>

Fix tests by specifying correct constraints when use_deepseek_fp8 is true

Signed-off-by: Dom Brown <[email protected]>

Fix autotuner typo

Signed-off-by: Dom Brown <[email protected]>

Clean up test

Signed-off-by: Dom Brown <[email protected]>

Small cleanup

Signed-off-by: Dom Brown <[email protected]>

Adjust to support None tensor inputs.

Signed-off-by: Dom Brown <[email protected]>

Small python cleanup

Signed-off-by: Dom Brown <[email protected]>

Fix type hints

Signed-off-by: Dom Brown <[email protected]>

refactor plus remove old OP

Signed-off-by: Dom Brown <[email protected]>

Comment

Signed-off-by: Dom Brown <[email protected]>

Address reviewer comments

Signed-off-by: Dom Brown <[email protected]>
@DomBrown DomBrown force-pushed the dev/autotune_bmm_poc branch from 30d43f1 to a2dab4c Compare June 6, 2025 11:10
@DomBrown
Copy link
Collaborator Author

DomBrown commented Jun 6, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7902 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7902 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5710 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@HuiGao-NV
Copy link
Collaborator

LGTM

@DomBrown DomBrown merged commit 9c012d5 into NVIDIA:main Jun 9, 2025
3 checks passed
@DomBrown DomBrown deleted the dev/autotune_bmm_poc branch June 9, 2025 12:30
hyukn added a commit to hyukn/TensorRT-LLM that referenced this pull request Jul 4, 2025
…r kernel configs.

The motivation for this PR is NVIDIA#4872, in which AutoTuner is applied to FP8 batched GEMM op with tile_size and epilog_tile_m to be in the argument list.
* Encoding different configs into a list of numeric tactic IDs starting from 0. This will be implemented inside kernels and used through get_valid_tactics.
* Define each config separately and let AutoTuner iterate over the combinations. This is more readable and flexible. Users can use each part of the config directly. There is no encoding-decoding process.

Add a config entry in the tuning config to define the valid candidates for each part of the config.
* AutoTuner will loop over a search grid generated from the config combinations.
* Each config will be tuned along with the specific input profile.
* The best config will be recorded in the cache value (instead of the cache key). And it will be recovered and used in the tunable runner forward.

Other enhancement:
* Use the decorator to make the tuning config definition more natural and efficient. This is an independent enhancement.
* Some code refactoring.
hyukn added a commit to hyukn/TensorRT-LLM that referenced this pull request Jul 4, 2025
…r kernel configs.

The motivation for this PR is NVIDIA#4872, in which AutoTuner is applied to FP8 batched GEMM op with tile_size and epilog_tile_m to be in the argument list.
* Encoding different configs into a list of numeric tactic IDs starting from 0. This will be implemented inside kernels and used through get_valid_tactics.
* Define each config separately and let AutoTuner iterate over the combinations. This is more readable and flexible. Users can use each part of the config directly. There is no encoding-decoding process.

Add a config entry in the tuning config to define the valid candidates for each part of the config.
* AutoTuner will loop over a search grid generated from the config combinations.
* Each config will be tuned along with the specific input profile.
* The best config will be recorded in the cache value (instead of the cache key). And it will be recovered and used in the tunable runner forward.

Other enhancement:
* Use the decorator to make the tuning config definition more natural and efficient. This is an independent enhancement.
* Some code refactoring.

Signed-off-by: Yukun He <[email protected]>
hyukn added a commit to hyukn/TensorRT-LLM that referenced this pull request Jul 4, 2025
…r kernel configs.

The motivation for this PR is NVIDIA#4872, in which AutoTuner is applied to FP8 batched GEMM op with tile_size and epilog_tile_m to be in the argument list.
* Encoding different configs into a list of numeric tactic IDs starting from 0. This will be implemented inside kernels and used through get_valid_tactics.
* Define each config separately and let AutoTuner iterate over the combinations. This is more readable and flexible. Users can use each part of the config directly. There is no encoding-decoding process.

Add a config entry in the tuning config to define the valid candidates for each part of the config.
* AutoTuner will loop over a search grid generated from the config combinations.
* Each config will be tuned along with the specific input profile.
* The best config will be recorded in the cache value (instead of the cache key). And it will be recovered and used in the tunable runner forward.

Other enhancement:
* Use the decorator to make the tuning config definition more natural and efficient. This is an independent enhancement.
* Allow the user to not speficy the gen_tuning_buckets or the map_to_tuning_buckets function.
* Code refactoring.

Signed-off-by: Yukun He <[email protected]>
hyukn added a commit to hyukn/TensorRT-LLM that referenced this pull request Jul 4, 2025
…r kernel configs.

The motivation for this PR is NVIDIA#4872, in which AutoTuner is applied to FP8 batched GEMM op with tile_size and epilog_tile_m to be in the argument list.
* Encoding different configs into a list of numeric tactic IDs starting from 0. This will be implemented inside kernels and used through get_valid_tactics.
* Define each config separately and let AutoTuner iterate over the combinations. This is more readable and flexible. Users can use each part of the config directly. There is no encoding-decoding process.

Add a config entry in the tuning config to define the valid candidates for each part of the config.
* AutoTuner will loop over a search grid generated from the config combinations.
* Each config will be tuned along with the specific input profile.
* The best config will be recorded in the cache value (instead of the cache key). And it will be recovered and used in the tunable runner forward.

Other enhancement:
* Use the decorator to make the tuning config definition more natural and efficient. This is an independent enhancement.
* Allow the user to not speficy the gen_tuning_buckets or the map_to_tuning_buckets function.
* Code refactoring.

Signed-off-by: Yukun He <[email protected]>
hyukn added a commit to hyukn/TensorRT-LLM that referenced this pull request Jul 7, 2025
…r kernel configs.

The motivation for this PR is NVIDIA#4872, in which AutoTuner is applied to FP8 batched GEMM op with tile_size and epilog_tile_m to be in the argument list.
* Encoding different configs into a list of numeric tactic IDs starting from 0. This will be implemented inside kernels and used through get_valid_tactics.
* Define each config separately and let AutoTuner iterate over the combinations. This is more readable and flexible. Users can use each part of the config directly. There is no encoding-decoding process.

Add a config entry in the tuning config to define the valid candidates for each part of the config.
* AutoTuner will loop over a search grid generated from the config combinations.
* Each config will be tuned along with the specific input profile.
* The best config will be recorded in the cache value (instead of the cache key). And it will be recovered and used in the tunable runner forward.

Other enhancement:
* Use the decorator to make the tuning config definition more natural and efficient. This is an independent enhancement.
* Allow the user to not speficy the gen_tuning_buckets or the map_to_tuning_buckets function.
* Code refactoring.

Signed-off-by: Yukun He <[email protected]>
hyukn added a commit to hyukn/TensorRT-LLM that referenced this pull request Jul 7, 2025
…r kernel configs.

The motivation for this PR is NVIDIA#4872, in which AutoTuner is applied to FP8 batched GEMM op with tile_size and epilog_tile_m to be in the argument list.
* Encoding different configs into a list of numeric tactic IDs starting from 0. This will be implemented inside kernels and used through get_valid_tactics.
* Define each config separately and let AutoTuner iterate over the combinations. This is more readable and flexible. Users can use each part of the config directly. There is no encoding-decoding process.

Add a config entry in the tuning config to define the valid candidates for each part of the config.
* AutoTuner will loop over a search grid generated from the config combinations.
* Each config will be tuned along with the specific input profile.
* The best config will be recorded in the cache value (instead of the cache key). And it will be recovered and used in the tunable runner forward.

Other enhancement:
* Use the decorator to make the tuning config definition more natural and efficient. This is an independent enhancement.
* Allow the user to not speficy the gen_tuning_buckets or the map_to_tuning_buckets function.
* Code refactoring.

Signed-off-by: Yukun He <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants