-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-5589] feat: Integrate TRT-LLM Gen FP8 Batched GEMM with Pytorch workflow kernel autotuner #4872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3326526
to
f9cbe90
Compare
All fixed. Might wait on CI as it seems like some jobs might get killed to prioritise others |
/bot run |
PR_Github #7541 [ run ] triggered by Bot |
PR_Github #7541 [ run ] completed with state |
/bot run |
PR_Github #7553 [ run ] triggered by Bot |
PR_Github #7553 [ run ] completed with state |
56d723d
to
ecace65
Compare
/bot run |
PR_Github #7666 [ run ] triggered by Bot |
/bot kill |
ecace65
to
41648d0
Compare
PR_Github #7702 [ kill ] triggered by Bot |
PR_Github #7666 [ run ] completed with state |
PR_Github #7702 [ kill ] completed with state |
/bot run |
PR_Github #7713 [ run ] triggered by Bot |
PR_Github #7713 [ run ] completed with state |
41648d0
to
30d43f1
Compare
/bot run |
PR_Github #7752 [ run ] triggered by Bot |
PR_Github #7752 [ run ] completed with state |
…ch workflow kernel autotuner WIP, does not compile yet Signed-off-by: Dom Brown <[email protected]> Fix compile with a slight refactor Signed-off-by: Dom Brown <[email protected]> WIP Signed-off-by: Dom Brown <[email protected]> WIP Signed-off-by: Dom Brown <[email protected]> Further WIP Signed-off-by: Dom Brown <[email protected]> i2/5 tests passing Signed-off-by: Dom Brown <[email protected]> Works when tuning is disabled. Signed-off-by: Dom Brown <[email protected]> Fix tests by specifying correct constraints when use_deepseek_fp8 is true Signed-off-by: Dom Brown <[email protected]> Fix autotuner typo Signed-off-by: Dom Brown <[email protected]> Clean up test Signed-off-by: Dom Brown <[email protected]> Small cleanup Signed-off-by: Dom Brown <[email protected]> Adjust to support None tensor inputs. Signed-off-by: Dom Brown <[email protected]> Small python cleanup Signed-off-by: Dom Brown <[email protected]> Fix type hints Signed-off-by: Dom Brown <[email protected]> refactor plus remove old OP Signed-off-by: Dom Brown <[email protected]> Comment Signed-off-by: Dom Brown <[email protected]> Address reviewer comments Signed-off-by: Dom Brown <[email protected]>
Signed-off-by: Dom Brown <[email protected]>
30d43f1
to
a2dab4c
Compare
/bot run |
PR_Github #7902 [ run ] triggered by Bot |
PR_Github #7902 [ run ] completed with state |
LGTM |
…r kernel configs. The motivation for this PR is NVIDIA#4872, in which AutoTuner is applied to FP8 batched GEMM op with tile_size and epilog_tile_m to be in the argument list. * Encoding different configs into a list of numeric tactic IDs starting from 0. This will be implemented inside kernels and used through get_valid_tactics. * Define each config separately and let AutoTuner iterate over the combinations. This is more readable and flexible. Users can use each part of the config directly. There is no encoding-decoding process. Add a config entry in the tuning config to define the valid candidates for each part of the config. * AutoTuner will loop over a search grid generated from the config combinations. * Each config will be tuned along with the specific input profile. * The best config will be recorded in the cache value (instead of the cache key). And it will be recovered and used in the tunable runner forward. Other enhancement: * Use the decorator to make the tuning config definition more natural and efficient. This is an independent enhancement. * Some code refactoring.
…r kernel configs. The motivation for this PR is NVIDIA#4872, in which AutoTuner is applied to FP8 batched GEMM op with tile_size and epilog_tile_m to be in the argument list. * Encoding different configs into a list of numeric tactic IDs starting from 0. This will be implemented inside kernels and used through get_valid_tactics. * Define each config separately and let AutoTuner iterate over the combinations. This is more readable and flexible. Users can use each part of the config directly. There is no encoding-decoding process. Add a config entry in the tuning config to define the valid candidates for each part of the config. * AutoTuner will loop over a search grid generated from the config combinations. * Each config will be tuned along with the specific input profile. * The best config will be recorded in the cache value (instead of the cache key). And it will be recovered and used in the tunable runner forward. Other enhancement: * Use the decorator to make the tuning config definition more natural and efficient. This is an independent enhancement. * Some code refactoring. Signed-off-by: Yukun He <[email protected]>
…r kernel configs. The motivation for this PR is NVIDIA#4872, in which AutoTuner is applied to FP8 batched GEMM op with tile_size and epilog_tile_m to be in the argument list. * Encoding different configs into a list of numeric tactic IDs starting from 0. This will be implemented inside kernels and used through get_valid_tactics. * Define each config separately and let AutoTuner iterate over the combinations. This is more readable and flexible. Users can use each part of the config directly. There is no encoding-decoding process. Add a config entry in the tuning config to define the valid candidates for each part of the config. * AutoTuner will loop over a search grid generated from the config combinations. * Each config will be tuned along with the specific input profile. * The best config will be recorded in the cache value (instead of the cache key). And it will be recovered and used in the tunable runner forward. Other enhancement: * Use the decorator to make the tuning config definition more natural and efficient. This is an independent enhancement. * Allow the user to not speficy the gen_tuning_buckets or the map_to_tuning_buckets function. * Code refactoring. Signed-off-by: Yukun He <[email protected]>
…r kernel configs. The motivation for this PR is NVIDIA#4872, in which AutoTuner is applied to FP8 batched GEMM op with tile_size and epilog_tile_m to be in the argument list. * Encoding different configs into a list of numeric tactic IDs starting from 0. This will be implemented inside kernels and used through get_valid_tactics. * Define each config separately and let AutoTuner iterate over the combinations. This is more readable and flexible. Users can use each part of the config directly. There is no encoding-decoding process. Add a config entry in the tuning config to define the valid candidates for each part of the config. * AutoTuner will loop over a search grid generated from the config combinations. * Each config will be tuned along with the specific input profile. * The best config will be recorded in the cache value (instead of the cache key). And it will be recovered and used in the tunable runner forward. Other enhancement: * Use the decorator to make the tuning config definition more natural and efficient. This is an independent enhancement. * Allow the user to not speficy the gen_tuning_buckets or the map_to_tuning_buckets function. * Code refactoring. Signed-off-by: Yukun He <[email protected]>
…r kernel configs. The motivation for this PR is NVIDIA#4872, in which AutoTuner is applied to FP8 batched GEMM op with tile_size and epilog_tile_m to be in the argument list. * Encoding different configs into a list of numeric tactic IDs starting from 0. This will be implemented inside kernels and used through get_valid_tactics. * Define each config separately and let AutoTuner iterate over the combinations. This is more readable and flexible. Users can use each part of the config directly. There is no encoding-decoding process. Add a config entry in the tuning config to define the valid candidates for each part of the config. * AutoTuner will loop over a search grid generated from the config combinations. * Each config will be tuned along with the specific input profile. * The best config will be recorded in the cache value (instead of the cache key). And it will be recovered and used in the tunable runner forward. Other enhancement: * Use the decorator to make the tuning config definition more natural and efficient. This is an independent enhancement. * Allow the user to not speficy the gen_tuning_buckets or the map_to_tuning_buckets function. * Code refactoring. Signed-off-by: Yukun He <[email protected]>
…r kernel configs. The motivation for this PR is NVIDIA#4872, in which AutoTuner is applied to FP8 batched GEMM op with tile_size and epilog_tile_m to be in the argument list. * Encoding different configs into a list of numeric tactic IDs starting from 0. This will be implemented inside kernels and used through get_valid_tactics. * Define each config separately and let AutoTuner iterate over the combinations. This is more readable and flexible. Users can use each part of the config directly. There is no encoding-decoding process. Add a config entry in the tuning config to define the valid candidates for each part of the config. * AutoTuner will loop over a search grid generated from the config combinations. * Each config will be tuned along with the specific input profile. * The best config will be recorded in the cache value (instead of the cache key). And it will be recovered and used in the tunable runner forward. Other enhancement: * Use the decorator to make the tuning config definition more natural and efficient. This is an independent enhancement. * Allow the user to not speficy the gen_tuning_buckets or the map_to_tuning_buckets function. * Code refactoring. Signed-off-by: Yukun He <[email protected]>
Description
Extend TRT-LLM Gen fp8 BMM torch operator to integrate with the pytorch workflow kernel autotuner to allow profiling for the best kernel config for a given kernel runner config and matrix shapes.
Slightly modifies the kernel tuner to handle None tensors, as some TRT-LLM Gen kernels have input tensors that are optional based on the kernel configuration.
Test Coverage
tests/unittest/_torch/thop/test_tllmg_bmm.py
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]
Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.