-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class… #25607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…vllm-project#24666)" This reverts commit 6340025. Signed-off-by: Tyler Michael Smith <[email protected]>
For posterity; this fails when capturing the first cudagraph with a token count that is not a multiple of 4 (in this case 2) This commit made it so we started using the cutlass kernel instead of triton hence the introduction of padding |
@tlrmchlsmth I assume this is on Hopper, can you post repro instructions?
@LucasWilkinson are you saying we were dynamically dispatching to Triton based on
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Tyler Michael Smith <[email protected]>
EDIT: unrelated issue below I extracted the issue into #25623 - I got a IMA without using the fp8 block quant path and confirmed it still happens even with this revert:
|
vllm-project#25607) Signed-off-by: Tyler Michael Smith <[email protected]>
#25607) Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: yewentao256 <[email protected]>
Revert #24666 due to #25623, reapplied in #25696.
This reverts commit 6340025.
That commit is causing an illegal memory access when torch.compile is used with decode DBO:
Repro instructions:
I'm deploying vLLM using the llm-d WideEP well-lit-path
See the decoder manifest here:
https://github.com/llm-d/llm-d/blob/4970c7c2703dc23605719491c4fb380973b13517/guides/wide-ep-lws/manifests/modelserver/base/decode.yaml
In particular this is the vLLM launch command.
From investigations of @LucasWilkinson:
the weird part is it is failing in triton_poi_fused.to_copy_add_constant_pad_nd_mean_mul_pow_rsqrt_2
and DBO isnt even running
the fishy thing is torch compile appears to be rounding the input up to 4?
possibly related to
vllm/vllm/model_executor/layers/quantization/utils/fp8_utils.py
Lines 229 to 233 in e6750d0
This is failing on a store: