-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
dynamic distpatch of fp8 kernels #14245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This mostly affects ROCm with hardware that can support one or the other FP8 type. All fp8 kernels are now templated on fp8_type instead of assuming a single fp8_type via `using FP8_TYPE = `. CUDA is largely unaffected. For CUDA, fp8 kernels only instantiate the OCP type; no binary bloat. For ROCm, two kernel templates are instantiated, one for each fp8 type. - FP8_TYPE is removed. - All fp8 kernels have additional fp8_type template param. - The FP8_E4M3_MAX is replaced with templated class FP8_E4M3_ADJUSTED_MAX. - is_fp8_ocp() C++ function for host runtime query of preferred fp8 type. - fp8 kernel launches for ROCm get both OCP and FNUZ templated variants. - add python APIs current_platform.supports_fp8() and is_fp8_fnuz() Signed-off-by: Jeff Daily <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Jeff Daily <[email protected]>
Signed-off-by: Jeff Daily <[email protected]>
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR! Thanks for removing TODOs and cleaning up a variety of nested if statements etc.
I left a few comments for further improving the code quality - this is a pretty messy part of our codebase and we should really keep it from getting more convoluted - thanks for doing your part already!
Signed-off-by: Jeff Daily <[email protected]>
- FP8_E4M3_ADJUSTED_MAX -> fp8_e4m3_adjusted_max_v - struct fp8_e4m3_adjusted_max - ScaledQuant collapse duplicate fp8 definition Signed-off-by: Jeff Daily <[email protected]>
- added template-specialized utility function fp8::cvt_c10 - revert update by reference back to returning value Signed-off-by: Jeff Daily <[email protected]>
Signed-off-by: Jeff Daily <[email protected]>
|
@ProExpertProg I really appreciate your thoughtful review. It greatly improved the quality. |
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments! It looks great now.
Added couple more nits, and I think it would be nice to add a comment on the C++ side somewhere that explains there are two FP8 types on ROCm (not sure where the best place for that is). Otherwise LGTM!
Signed-off-by: Jeff Daily <[email protected]>
|
@ProExpertProg Getting some cuda build failures: Looks like fp8_e4m3_adjusted_max_v is the problem, at least for nvcc. Any suggestions? |
|
Maybe add |
try to fix fp8_e4m3_adjusted_max_v is undefined in device code on cuda Signed-off-by: Jeff Daily <[email protected]>
|
@jeffdaily it looks like you missed an import in Could you get that fixed, and we can enable the full CI? |
Signed-off-by: Jeff Daily <[email protected]>
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be good to go!
robertgshaw2-redhat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent PR
|
@jeffdaily I am planning to look into it more but do you know if this is a ROCm version issue? This PR is breaking the build on |
|
Looks like a ROCm version issue. |
|
Could you help with the fix? I assume there's an alternative for ROCm 6.2? It's not the CI, it's a bug in a local build |
|
Working on a fix now. Can we forward-fix with a new PR or do you need to revert this one? |
|
I think forward fixing is fine, the revert wouldn't be immediate either anyway. Unless you think the fix will be complicated? |
|
Fix shouldn't be complicated. Will need to isolate the use of |
|
Okay ping me on vLLM Slack once done so we can merge ASAP |
|
@ProExpertProg I'm not on slack yet. Here's the PR fix. |
This mostly affects ROCm with hardware that can support one or the other FP8 type. All fp8 kernels are now templated on fp8_type instead of assuming a single fp8_type via
using FP8_TYPE =. CUDA is largely unaffected. For CUDA, fp8 kernels only instantiate the OCP type; no binary bloat. For ROCm, two kernel templates are instantiated, one for each fp8 type.