Skip to content

Conversation

venkywonka
Copy link
Collaborator

Description

This PR adds the remaining llama_v3.3_nemotron_super_49b tests that previous timed-out on high-request settings.

These tests add four new coverage points in the low-concurrency setting. (2 ctx/gen shapes × BF16/FP8)
This time, the total requests was reduced from 512->4 to fit the time budget.

Test-plan overview

Invariant Value
GPUs 4
Backend cpp (TRT engine are built)
benchmarking runtime trtllm-bench
max batch size 64
Requests 4
Concurrency 1

Shmoo-ed parameters

  • Quant mode: native BF16 → post-quantised FP8
  • Sequence shape:
    • Long-ctx / short-geninput=5 000, output=500
    • Short-ctx / long-geninput=500, output=2 000

Throughput & latency summary

Shape Quant Req/s Output TPS
(tok/s)
Token TPS
(tok/s)
Total Latency
(ms)
Avg Latency
(ms)
TPS/GPU TPS/User
5 k × 500 BF16 0.1327 66.35 729.89 30 141 7 535 16.59 66.35
5 k × 500 FP8 0.1902 95.10 1 046.14 21 030 5 257 23.78 95.10
500 × 2 k BF16 0.0356 71.24 89.04 112 304 28 076 17.81 71.24
500 × 2 k FP8 0.0507 101.35 126.69 78 931 19 733 25.34 101.35

Latency percentiles (ms)

Shape Quant P50 P90 P95 P99 Min Max
5 k × 500 BF16 7 535.5 7 536.4 7 536.4 7 536.4 7 534.4 7 536.4
5 k × 500 FP8 5 258.7 5 258.9 5 258.9 5 258.9 5 254.9 5 258.9
500 × 2 k BF16 28 076.6 28 079.0 28 079.0 28 079.0 28 072.2 28 079.0
500 × 2 k FP8 19 734.7 19 735.8 19 735.8 19 735.8 19 730.2 19 735.8

@venkywonka venkywonka changed the title test(perf): Pt. 2 - Add Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (cpp) test(perf): Pt.2 Add Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (cpp) May 20, 2025
@venkywonka venkywonka marked this pull request as ready for review May 20, 2025 16:59
@venkywonka venkywonka force-pushed the user/venky/ll-nemo-super-cpp-low-con-perf-tests branch from 9e4e51a to de538af Compare May 20, 2025 16:59
@venkywonka
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5898 [ run ] triggered by Bot

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enables the integration performance tests for the Llama-3.3-Nemotron-Super-49B-v1 model by adding new test configurations with adjusted parameters, reducing the number of total requests to meet the time budget.

  • Enabled four new performance test cases with increased max batch sizes (64) and a reduced request count (4).
  • Removed the previously commented-out timeout test cases.
Comments suppressed due to low confidence (1)

tests/integration/test_lists/qa/trt_llm_release_perf_test.yml:290

  • Verify that the new 'reqs:4' setting provides adequate coverage for the intended low-concurrency performance scenarios and aligns with the overall test plan.
  - perf/test_perf.py::test_perf[llama_v3.3_nemotron_super_49b-bench-bfloat16-maxbs:64-input_output_len:500,2000-reqs:4-con:1-gpus:4]

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5898 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4322 completed with status: 'SUCCESS'

@MartinMarciniszyn MartinMarciniszyn merged commit 0a8461d into NVIDIA:main May 21, 2025
3 checks passed
venkywonka added a commit to venkywonka/TensorRT-LLM that referenced this pull request May 22, 2025
…rf-tests (cpp) (NVIDIA#4499)

add low concurrency perf tests

Signed-off-by: Venky <[email protected]>
chzblych pushed a commit that referenced this pull request May 28, 2025
darraghdog pushed a commit to darraghdog/TensorRT-LLM that referenced this pull request Jun 3, 2025
…rf-tests (cpp) (NVIDIA#4499)

add low concurrency perf tests

Signed-off-by: Venky <[email protected]>
Signed-off-by: darraghdog <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants