test(perf): Pt.2 Add `Llama-3_3-Nemotron-Super-49B-v1` integration-perf-tests (cpp) #4499

venkywonka · 2025-05-20T16:57:56Z

Description

This PR adds the remaining llama_v3.3_nemotron_super_49b tests that previous timed-out on high-request settings.

These tests add four new coverage points in the low-concurrency setting. (2 ctx/gen shapes × BF16/FP8)
This time, the total requests was reduced from 512->4 to fit the time budget.

Test-plan overview

Invariant	Value
GPUs	4
Backend	cpp (TRT engine are built)
benchmarking runtime	trtllm-bench
max batch size	64
Requests	4
Concurrency	1

Shmoo-ed parameters

Quant mode: native BF16 → post-quantised FP8
Sequence shape:
- Long-ctx / short-gen – input=5 000, output=500
- Short-ctx / long-gen – input=500, output=2 000

Throughput & latency summary

Shape	Quant	Req/s	Output TPS (tok/s)	Token TPS (tok/s)	Total Latency (ms)	Avg Latency (ms)	TPS/GPU	TPS/User
5 k × 500	BF16	0.1327	66.35	729.89	30 141	7 535	16.59	66.35
5 k × 500	FP8	0.1902	95.10	1 046.14	21 030	5 257	23.78	95.10
500 × 2 k	BF16	0.0356	71.24	89.04	112 304	28 076	17.81	71.24
500 × 2 k	FP8	0.0507	101.35	126.69	78 931	19 733	25.34	101.35

Latency percentiles (ms)

Shape	Quant	P50	P90	P95	P99	Min	Max
5 k × 500	BF16	7 535.5	7 536.4	7 536.4	7 536.4	7 534.4	7 536.4
5 k × 500	FP8	5 258.7	5 258.9	5 258.9	5 258.9	5 254.9	5 258.9
500 × 2 k	BF16	28 076.6	28 079.0	28 079.0	28 079.0	28 072.2	28 079.0
500 × 2 k	FP8	19 734.7	19 735.8	19 735.8	19 735.8	19 730.2	19 735.8

Signed-off-by: Venky <[email protected]>

venkywonka · 2025-05-20T16:59:22Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-05-20T17:04:44Z

PR_Github #5898 [ run ] triggered by Bot

Copilot

Pull Request Overview

This PR enables the integration performance tests for the Llama-3.3-Nemotron-Super-49B-v1 model by adding new test configurations with adjusted parameters, reducing the number of total requests to meet the time budget.

Enabled four new performance test cases with increased max batch sizes (64) and a reduced request count (4).
Removed the previously commented-out timeout test cases.

Comments suppressed due to low confidence (1)

tests/integration/test_lists/qa/trt_llm_release_perf_test.yml:290

Verify that the new 'reqs:4' setting provides adequate coverage for the intended low-concurrency performance scenarios and aligns with the overall test plan.

  - perf/test_perf.py::test_perf[llama_v3.3_nemotron_super_49b-bench-bfloat16-maxbs:64-input_output_len:500,2000-reqs:4-con:1-gpus:4]

tests/integration/test_lists/qa/trt_llm_release_perf_test.yml

tensorrt-cicd · 2025-05-20T20:17:11Z

PR_Github #5898 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4322 completed with status: 'SUCCESS'

…rf-tests (cpp) (NVIDIA#4499) add low concurrency perf tests Signed-off-by: Venky <[email protected]>

…tegration-perf-tests (cpp) (#4499) (#4588) Signed-off-by: Venky <[email protected]>

…rf-tests (cpp) (NVIDIA#4499) add low concurrency perf tests Signed-off-by: Venky <[email protected]> Signed-off-by: darraghdog <[email protected]>

venkywonka changed the title ~~test(perf): Pt. 2 - Add Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (cpp)~~ test(perf): Pt.2 Add Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (cpp) May 20, 2025

venkywonka marked this pull request as ready for review May 20, 2025 16:59

add low concurrency perf tests

de538af

Signed-off-by: Venky <[email protected]>

venkywonka force-pushed the user/venky/ll-nemo-super-cpp-low-con-perf-tests branch from 9e4e51a to de538af Compare May 20, 2025 16:59

venkywonka requested review from Copilot, tijyojwad, schetlur-nv, LarryXFly and ruodil May 20, 2025 17:12

Copilot AI reviewed May 20, 2025

View reviewed changes

tests/integration/test_lists/qa/trt_llm_release_perf_test.yml Show resolved Hide resolved

tijyojwad approved these changes May 20, 2025

View reviewed changes

tests/integration/test_lists/qa/trt_llm_release_perf_test.yml Show resolved Hide resolved

LarryXFly approved these changes May 21, 2025

View reviewed changes

MartinMarciniszyn merged commit 0a8461d into NVIDIA:main May 21, 2025
3 checks passed

venkywonka added a commit to venkywonka/TensorRT-LLM that referenced this pull request May 22, 2025

test(perf): Pt.2 Add Llama-3_3-Nemotron-Super-49B-v1 integration-pe…

65e8e81

…rf-tests (cpp) (NVIDIA#4499) add low concurrency perf tests Signed-off-by: Venky <[email protected]>

venkywonka mentioned this pull request May 22, 2025

[cherry-pick] test(perf): Pt.2 Add Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (cpp) (#4499) #4588

Merged

chzblych pushed a commit that referenced this pull request May 28, 2025

[cherry-pick] test(perf): Pt.2 Add Llama-3_3-Nemotron-Super-49B-v1 in…

1a989a8

…tegration-perf-tests (cpp) (#4499) (#4588) Signed-off-by: Venky <[email protected]>

venkywonka mentioned this pull request Jun 2, 2025

test: shorten reqs in con:1 cases and add streaming cases, add l2 perf test #4796

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(perf): Pt.2 Add `Llama-3_3-Nemotron-Super-49B-v1` integration-perf-tests (cpp) #4499

test(perf): Pt.2 Add `Llama-3_3-Nemotron-Super-49B-v1` integration-perf-tests (cpp) #4499

Uh oh!

venkywonka commented May 20, 2025

Uh oh!

venkywonka commented May 20, 2025

Uh oh!

tensorrt-cicd commented May 20, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented May 20, 2025

Uh oh!

Uh oh!

Uh oh!

test(perf): Pt.2 Add Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (cpp) #4499

test(perf): Pt.2 Add Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (cpp) #4499

Uh oh!

Conversation

venkywonka commented May 20, 2025

Description

Test-plan overview

Throughput & latency summary

Latency percentiles (ms)

Uh oh!

venkywonka commented May 20, 2025

Uh oh!

tensorrt-cicd commented May 20, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented May 20, 2025

Uh oh!

Uh oh!

Uh oh!

test(perf): Pt.2 Add `Llama-3_3-Nemotron-Super-49B-v1` integration-perf-tests (cpp) #4499

test(perf): Pt.2 Add `Llama-3_3-Nemotron-Super-49B-v1` integration-perf-tests (cpp) #4499