[TRTLLM-5847][feat] Support n-gram speculative decoding with disagg #5732

raayandhar · 2025-07-03T21:00:22Z

[TRTLLM-5847][feat] Support n-gram speculative decoding with disagg

Description

Currently, we do not support disagg n-gram speculative decoding. In disagg setting, prepare_draft_tokens is called before _forward_step, so py_batch_idx is not set for the requests and on subsequent generation, the sort in prepare_draft_tokens fails. If we sort by request_id when py_batch_idx is None, this prevents this issue and maintains the same sort, and on subsequent iterations, we will use py_batch_idx correctly.

Previously, when is_keep_all=False, n-gram spec decoding would fail due to some small typos, this is fixed here. Additionally, MMLU as an integration test seems a bit weak for testing; I was able to get ~28% on GSM8K due to a bug, but MMLU would still pass. GSM8K seems to be more robust.

Test Coverage

Added disagg n-gram integration unit test.
Added disagg n-gram accuracy test (using GSM8K)
Changed agg n-gram accuracy test to also use GSM8K instead of just MMLU.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

raayandhar · 2025-07-03T21:00:51Z

cc: @pcastonguay

raayandhar · 2025-07-03T21:40:16Z

/bot run

tensorrt-cicd · 2025-07-03T21:45:16Z

PR_Github #10871 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-03T23:53:31Z

PR_Github #10871 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8036 completed with status: 'FAILURE'

pcastonguay · 2025-07-04T16:49:50Z

/bot run --disable-fail-fast --add-multi-gpu-test

pcastonguay · 2025-07-04T16:53:41Z

@raayandhar could you rebase now that #5558 is merged? Thanks.

pcastonguay · 2025-07-04T16:54:42Z

@Tabrizian @SimengLiu-nv could you review the changes to py_executor.py and ngram.py respectively? Thank you

tensorrt_llm/_torch/pyexecutor/py_executor.py

tensorrt_llm/_torch/speculative/ngram.py

tests/integration/defs/accuracy/test_llm_api_pytorch.py

Signed-off-by: raayandhar <[email protected]>

tensorrt_llm/_torch/speculative/ngram.py

Signed-off-by: raayandhar <[email protected]>

raayandhar · 2025-07-07T18:11:31Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2025-07-07T18:17:02Z

PR_Github #11176 [ run ] triggered by Bot

Signed-off-by: raayandhar <[email protected]>

tensorrt-cicd · 2025-07-07T21:34:16Z

PR_Github #11176 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8267 completed with status: 'FAILURE'

raayandhar · 2025-07-07T22:26:56Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2025-07-07T22:32:25Z

PR_Github #11184 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-08T03:18:41Z

PR_Github #11184 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8273 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

…VIDIA#5732) Signed-off-by: raayandhar <[email protected]> Signed-off-by: Yuxin <[email protected]>

raayandhar requested review from a team as code owners July 3, 2025 21:00

raayandhar requested review from achartier, litaotju and dongxuy04 July 3, 2025 21:00

raayandhar force-pushed the disagg-ngram branch 2 times, most recently from 36fc4e4 to 83ac339 Compare July 3, 2025 21:33

pcastonguay self-requested a review July 4, 2025 16:47

pcastonguay requested review from Tabrizian and SimengLiu-nv and removed request for achartier, litaotju and dongxuy04 July 4, 2025 16:50