-
Notifications
You must be signed in to change notification settings - Fork 1.8k
refactoring: port customized kernels with public cutlass version #5027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactoring: port customized kernels with public cutlass version #5027
Conversation
fd44de3
to
7bcd459
Compare
/bot run |
PR_Github #8271 [ run ] triggered by Bot |
… kernels Signed-off-by: yunruis <[email protected]> moe_gemm passed Signed-off-by: yunruis <[email protected]> fix license bug Signed-off-by: yunruis <[email protected]> waive debug mode Signed-off-by: yunruis <[email protected]> fix debug mode compile bug Signed-off-by: yunruis <[email protected]> open source GEMM+AR kernels contains blackwell fixes support all reduce_gemm cutlass kernel Signed-off-by: yunruis <[email protected]> fix bug Signed-off-by: yunruis <[email protected]> fix credential symbol Signed-off-by: yunruis <[email protected]> drop credential symbol Signed-off-by: yunruis <[email protected]> add debug info and test ok Signed-off-by: yunruis <[email protected]> fix loraparams namespace bug Signed-off-by: yunruis <[email protected]> fix rebase bug Signed-off-by: yunruis <[email protected]> fix moe gemm bug on sm90 Signed-off-by: yunruis <[email protected]> fix low_latency_gemm internal error Signed-off-by: yunruis <[email protected]>
…upport FP8xMXFP4. And add open-sourced moe_gemm micro-benchmark and unittest Signed-off-by: yunruis <[email protected]>
7bcd459
to
d0da265
Compare
PR_Github #8271 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #8280 [ run ] triggered by Bot |
Signed-off-by: yunruis <[email protected]>
d0da265
to
6d27e20
Compare
/bot run --disable-fail-fast |
PR_Github #8282 [ run ] triggered by Bot |
PR_Github #8280 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hard working, Yunrui.
Let's merge this PR ASAP to unblock the dependency and keep refining it in the subsequent PRs.
Signed-off-by: yunruis <[email protected]>
/bot run --disable-fail-fast |
PR_Github #8317 [ run ] triggered by Bot |
PR_Github #8282 [ run ] completed with state |
Signed-off-by: yunruis <[email protected]>
0bcd83d
to
06f28ae
Compare
PR_Github #8317 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #8324 [ run ] triggered by Bot |
/bot kill |
PR_Github #8324 [ run ] completed with state |
Signed-off-by: yunruis <[email protected]>
79ec409
to
53fb6b6
Compare
/bot run |
PR_Github #8594 [ run ] triggered by Bot |
PR_Github #8594 [ run ] completed with state |
/bot help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand.
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1" |
PR_Github #8730 [ run ] triggered by Bot |
PR_Github #8730 [ run ] completed with state |
/bot skip --comment "PR_Github #8730 and PR_Github #8594 run a full pre-merge CI Pipeline with multi GPU test" |
PR_Github #8768 [ skip ] triggered by Bot |
PR_Github #8768 [ skip ] completed with state |
Signed-off-by: Tailing Yuan <[email protected]>
pytest -s tests/unittest/_torch/modules/test_fused_moe.py# refactoring: port customized kernels with public cutlass version
In this PR we have open-sourced some internal Cutlass kernels. Meanwhile, to ensure stability and provide an even more optimized performance experience, we have retained the previous method of calling these kernels via static libraries, as another choice.
The purpose of this document is to introduce how to use these newly open-sourced Cutlass kernels, while also supporting the option to switch back to the previously supported internal Cutlass kernels via static libraries.
Compilation
These open-sourced cutlass kernels are
low_latency_gemm
,moe-gemm
,fp4_gemm
andallreduce_gemm
. The switch between using open-sourced Cutlass kernels and static library Cutlass kernels can be made using the macroUSING_OSS_CUTLASS_*
achieving kernel-level control. By default, the open-source Cutlass kernels are used. For example:python3 ./scripts/build_wheel.py --skip_building_wheel --linking_install_binary --use_ccache --cuda_architectures "90-real;100-real" --python_bindings --install --micro_benchmarks
This will using open-sourced cutlass kernels.
If users prefer to use the internal Cutlass kernels from the static library, they can control this during compilation by setting marco
USING_OSS_CUTLASS_*
to OFF. For instance, if a user wants to use the static library implementation forlow_latency_gemm
andfused_moe_gemm
, the following compilation command can be used: