feat: Add non-streaming support for trtllm serve bench script & fixed prompt and output token length #4971

yizhang-nv · 2025-06-06T01:59:53Z

feat: Add non-streaming support for trtllm serve bench script & fixed prompt and output token length

Updated async_request_trt_llm, async_request_openai_completions, and async_request_openai_chat_completions to accept a streaming flag, allowing for flexible response handling.
Generate more robust prompts that can keep the input_ids unchanged after detokenize -> tokenize
Allow pass input ids directly to server
Add streaming config through CLI

PR title

Please write the PR title by following template:

[JIRA ticket link/nvbug link/github issue link][fix/feat/doc/infra/...] <summary of this PR>

For example, assume I have a PR hope to support a new feature about cache manager of Jira TRTLLM-1000 ticket, it would be like

[TRTLLM-1000][feat] Support a new feature about cache manager

Description

Please explain the issue and the solution in short.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

yizhang-nv · 2025-06-06T02:00:57Z

/bot run

tensorrt-cicd · 2025-06-06T02:07:03Z

PR_Github #7816 [ run ] triggered by Bot

yizhang-nv · 2025-06-06T02:10:46Z

/bot kill

yizhang-nv · 2025-06-06T02:10:59Z

/bot run

tensorrt-cicd · 2025-06-06T02:16:49Z

PR_Github #7818 [ kill ] triggered by Bot

tensorrt-cicd · 2025-06-06T02:16:50Z

PR_Github #7819 [ ] completed with state ABORTED

tensorrt-cicd · 2025-06-06T02:17:21Z

PR_Github #7816 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-06-06T02:17:51Z

PR_Github #7818 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit ed51cb8

yizhang-nv · 2025-06-09T02:22:43Z

/bot run

tensorrt-cicd · 2025-06-09T02:28:58Z

PR_Github #8078 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-09T08:39:11Z

PR_Github #8078 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5858 completed with status: 'SUCCESS'

yizhang-nv · 2025-06-10T01:48:52Z

/bot run

tensorrt-cicd · 2025-06-10T01:55:25Z

PR_Github #8181 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-10T10:24:50Z

PR_Github #8181 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #5932 completed with status: 'FAILURE'

LinPoly

Not sure if tokenizing random dataset is a good WAR, from my previous benchmark result, tokenization was not CPU bottleneck for max through scenario, (detokenization with streaming maybe though), but I am not sure how TTFT will be affected by omitting tokenization, a recent PR that omitted tokenization for decoding server performance maybe relevant, @kaiyux may know more context.

tensorrt_llm/serve/scripts/backend_request_func.py

tensorrt_llm/serve/scripts/benchmark_dataset.py

tensorrt-cicd · 2025-06-17T02:48:53Z

PR_Github #9103 [ run ] triggered by Bot

yizhang-nv · 2025-06-17T03:16:41Z

/bot run

yizhang-nv · 2025-06-17T03:18:44Z

/bot run

tensorrt-cicd · 2025-06-17T03:23:38Z

PR_Github #9111 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-17T03:23:42Z

PR_Github #9103 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-06-17T03:24:28Z

PR_Github #9112 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-17T03:24:31Z

PR_Github #9111 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-06-17T04:03:27Z

PR_Github #9112 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #6666 completed with status: 'FAILURE'

yizhang-nv · 2025-06-17T04:39:28Z

/bot run

tensorrt-cicd · 2025-06-17T04:46:13Z

PR_Github #9128 [ run ] triggered by Bot

LinPoly

LGTM, leave two comments for potential improvement.

tensorrt_llm/serve/scripts/benchmark_dataset.py

tensorrt_llm/serve/scripts/benchmark_serving.py

tensorrt-cicd · 2025-06-17T07:41:08Z

PR_Github #9128 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6681 completed with status: 'SUCCESS'

yizhang-nv · 2025-06-17T10:37:20Z

/bot run

tensorrt-cicd · 2025-06-17T10:43:04Z

PR_Github #9191 [ run ] triggered by Bot

…unctions - Introduced `no_kv_cache_reuse` parameter in `get_llm_args` and `serve` functions for better cache management. - Updated `async_request_trt_llm`, `async_request_openai_completions`, and `async_request_openai_chat_completions` to accept a `streaming` flag, allowing for flexible response handling. - Modified benchmark scripts to incorporate streaming functionality, enhancing performance testing capabilities. Signed-off-by: Yi Zhang <[email protected]>

Signed-off-by: Yi Zhang <[email protected]>

yizhang-nv · 2025-06-18T02:44:25Z

/bot run

tensorrt-cicd · 2025-06-18T02:49:57Z

PR_Github #9293 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-18T05:37:14Z

PR_Github #9293 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6819 completed with status: 'SUCCESS'

samuellees · 2025-08-06T09:01:43Z

... but I am not sure how TTFT will be affected by omitting tokenization, a recent PR that omitted tokenization for decoding server performance maybe relevant, @kaiyux may know more context.

I think tokenization has a very slight impact on TTFT.

The key is, the random token ids will be de-tokenized into meaningless prompt before sending to TRTLLM server, then the server will re-tokenize the meaningless prompt and get a new sequence of token ids, which will have a different length (longer for most case) than the random token ids at the beginning, thus making the benchmark loss its value (mainly affecting).

yizhang-nv requested review from litaotju and kaiyux June 6, 2025 01:59

yizhang-nv marked this pull request as ready for review June 6, 2025 02:00

yizhang-nv force-pushed the fix-serve-bench branch from e608ce7 to ed7a9b3 Compare June 6, 2025 02:00

yizhang-nv enabled auto-merge (squash) June 6, 2025 02:01

yizhang-nv force-pushed the fix-serve-bench branch from ed7a9b3 to 7e01ce2 Compare June 6, 2025 02:10

yizhang-nv force-pushed the fix-serve-bench branch from 7e01ce2 to ed51cb8 Compare June 6, 2025 02:10

yizhang-nv force-pushed the fix-serve-bench branch from ed51cb8 to 20e51f1 Compare June 9, 2025 02:22

yizhang-nv force-pushed the fix-serve-bench branch 4 times, most recently from 62abe19 to 7e0bd8e Compare June 10, 2025 01:36

yizhang-nv changed the title ~~feat: Add no_kv_cache_reuse option and streaming support for trtllm serve bench~~ feat: Add non-streaming support for trtllm serve bench script & fixed prompt and output token length Jun 10, 2025

kaiyux requested a review from LinPoly June 11, 2025 05:18

LinPoly reviewed Jun 11, 2025

View reviewed changes

tensorrt_llm/serve/scripts/backend_request_func.py Outdated Show resolved Hide resolved

tensorrt_llm/serve/scripts/benchmark_dataset.py Outdated Show resolved Hide resolved

yizhang-nv force-pushed the fix-serve-bench branch from 7e0bd8e to 4694214 Compare June 12, 2025 13:15

yizhang-nv force-pushed the fix-serve-bench branch from 8b19246 to 0b01369 Compare June 17, 2025 03:18

yizhang-nv force-pushed the fix-serve-bench branch from 0b01369 to 3e78f58 Compare June 17, 2025 04:39

LinPoly approved these changes Jun 17, 2025

View reviewed changes

tensorrt_llm/serve/scripts/benchmark_dataset.py Outdated Show resolved Hide resolved

tensorrt_llm/serve/scripts/benchmark_serving.py Outdated Show resolved Hide resolved

yizhang-nv force-pushed the fix-serve-bench branch from 209f67f to 028dc94 Compare June 17, 2025 10:36

yizhang-nv added 5 commits June 18, 2025 10:44

Fix chat

a3e7ba8

Signed-off-by: Yi Zhang <[email protected]>

Better random prompt generator

3954307

Signed-off-by: Yi Zhang <[email protected]>

Better error log

9e79de3

Signed-off-by: Yi Zhang <[email protected]>

Fix typo

94bbd52

Signed-off-by: Yi Zhang <[email protected]>

yizhang-nv force-pushed the fix-serve-bench branch from 028dc94 to 94bbd52 Compare June 18, 2025 02:44

yizhang-nv merged commit e44f768 into NVIDIA:main Jun 18, 2025
3 checks passed

yizhang-nv deleted the fix-serve-bench branch June 18, 2025 06:04

samuellees mentioned this pull request Aug 18, 2025

[None][fix] Fix bug of prompt and output token length of the RandomDataset under --random-token-ids option #6974

Open

feat: Add non-streaming support for trtllm serve bench script & fixed prompt and output token length #4971

feat: Add non-streaming support for trtllm serve bench script & fixed prompt and output token length #4971

Uh oh!

Conversation

yizhang-nv commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR title

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

yizhang-nv commented Jun 6, 2025

Uh oh!

tensorrt-cicd commented Jun 6, 2025

Uh oh!

yizhang-nv commented Jun 6, 2025

Uh oh!

yizhang-nv commented Jun 6, 2025

Uh oh!

tensorrt-cicd commented Jun 6, 2025

Uh oh!

tensorrt-cicd commented Jun 6, 2025

Uh oh!

tensorrt-cicd commented Jun 6, 2025

Uh oh!

tensorrt-cicd commented Jun 6, 2025

Uh oh!

yizhang-nv commented Jun 9, 2025

Uh oh!

tensorrt-cicd commented Jun 9, 2025

Uh oh!

tensorrt-cicd commented Jun 9, 2025

Uh oh!

yizhang-nv commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

tensorrt-cicd commented Jun 10, 2025

Uh oh!

LinPoly left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 17, 2025

Uh oh!

yizhang-nv commented Jun 17, 2025

Uh oh!

yizhang-nv commented Jun 17, 2025

Uh oh!

tensorrt-cicd commented Jun 17, 2025

Uh oh!

tensorrt-cicd commented Jun 17, 2025

Uh oh!

tensorrt-cicd commented Jun 17, 2025

Uh oh!

tensorrt-cicd commented Jun 17, 2025

Uh oh!

tensorrt-cicd commented Jun 17, 2025

Uh oh!

yizhang-nv commented Jun 17, 2025

Uh oh!

tensorrt-cicd commented Jun 17, 2025

Uh oh!

LinPoly left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 17, 2025

Uh oh!

yizhang-nv commented Jun 17, 2025

Uh oh!

tensorrt-cicd commented Jun 17, 2025

Uh oh!

yizhang-nv commented Jun 18, 2025

Uh oh!

yizhang-nv commented Jun 6, 2025 •

edited

Loading