-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: fix cuda graph padding for spec decoding #4853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: fix cuda graph padding for spec decoding #4853
Conversation
05bd143
to
3e2a4ca
Compare
/bot run |
PR_Github #7289 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR_Github #7289 [ run ] completed with state |
/bot run |
PR_Github #7307 [ run ] triggered by Bot |
PR_Github #7307 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #7344 [ run ] triggered by Bot |
PR_Github #7344 [ run ] completed with state |
/bot run |
PR_Github #7413 [ run ] triggered by Bot |
PR_Github #7413 [ run ] completed with state |
3e2a4ca
to
34d8474
Compare
/bot run |
PR_Github #7441 [ run ] triggered by Bot |
PR_Github #7441 [ run ] completed with state |
/bot run |
PR_Github #7469 [ run ] triggered by Bot |
PR_Github #7469 [ run ] completed with state |
34d8474
to
ebf52ff
Compare
/bot run --disable-fail-fast |
PR_Github #7542 [ run ] triggered by Bot |
PR_Github #7542 [ run ] completed with state |
/bot kill |
PR_Github #7691 [ kill ] triggered by Bot |
PR_Github #7688 [ run ] completed with state |
PR_Github #7691 [ kill ] completed with state |
08fa91c
to
270c7c2
Compare
/bot run --disable-fail-fast |
PR_Github #7699 [ run ] triggered by Bot |
PR_Github #7699 [ run ] completed with state |
270c7c2
to
b7a6c8c
Compare
/bot run --disable-fail-fast |
PR_Github #7773 [ run ] triggered by Bot |
PR_Github #7773 [ run ] completed with state |
b7a6c8c
to
48e1bf2
Compare
/bot run --disable-fail-fast |
PR_Github #7831 [ run ] triggered by Bot |
PR_Github #7831 [ run ] completed with state |
Signed-off-by: Fanrong Li <[email protected]>
Signed-off-by: Fanrong Li <[email protected]>
48e1bf2
to
e48a027
Compare
/bot run --disable-fail-fast |
PR_Github #7849 [ run ] triggered by Bot |
PR_Github #7849 [ run ] completed with state |
Signed-off-by: Fanrong Li <[email protected]>
Description
Root cause:
if request.py_batch_idx is None
to identify dummy requests when enabling overlap scheduler. However, after the CUDA Graph padding changes in [TRTLLM-5516] perf: replicate dummy request for cuda graph padding #4729, the same dummy request is reused across all model forward passes, causingrequest.py_batch_idx
to be non-None. This leads to an error.In this PR:
if request.is_dummy
to identify dummy requests.Test Coverage
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales_cuda_graph_padding[mtp_nextn=2]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales_cuda_graph_padding_4gpus[attention_dp=True-mtp_nextn=0]
accuracy/test_llm_api_pytorch.py::TestDeepSeekV3Lite::test_fp8_block_scales_cuda_graph_padding_4gpus[attention_dp=True-mtp_nextn=2]