Skip to content

Conversation

venkywonka
Copy link
Collaborator

@venkywonka venkywonka commented May 16, 2025

Expand PyT llama_v3.1_nemotron_nano_8b perf tests coverage

Description

This PR adds end-to-end performance results for the llama_v3.1_nemotron_nano_8b bfloat16 engine on 1 H100.
Two broad load patterns were evaluated on PyT backend for various ISL/OSL combos:

  • Low concurrency: concurrency = 1, requests = 8
  • High concurrency: concurrency = 250, requests = 500

All tests use max_batch_size = 512.

Performance Summary
Concurrency Input Len Output Len #Reqs Req Throughput
(req/s)
Per GPU Output TPS
(tps/gpu)
Avg Latency
(ms)
1 500 2000 8 0.0629 125.79 15 898.9
1 1000 1000 8 0.1660 166.00 6 023.7
1 5000 500 8 0.2971 148.54 3 365.91
1 20000 2000 8 0.0639 127.72 15 659.59
250 5000 500 500 2.7919 1 395.94 77 524.8
250 500 2000 500 3.2334 6 466.84 67 673.7
250 1000 1000 500 6.0589 6 058.94 40 414.9
250 20000 2000 500 0.2835 566.96 686 971.0

NOTE: the above numbers were generated with prefill chunking disabled (which is the default behavior)

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR expands end-to-end performance test coverage for the llama_v3.1_nemotron_nano_8b engine on the PyTorch backend, evaluating low and high concurrency patterns across various input/output lengths.

  • Adds 8 new pytest entries under a new “torch backend” section for both low (concurrency=1, requests=8) and high (concurrency=250, requests=500) loads.
  • Removes two outdated PyTorch backend tests using default input lengths.
  • Ensures max batch size is set to 512 in all new scenarios.
Comments suppressed due to low confidence (1)

tests/integration/test_lists/qa/trt_llm_release_perf_test.yml:26

  • [nitpick] Consider renaming the section comment '# torch backend' to '# pytorch backend' for consistency and clarity in labeling.
# torch backend

@venkywonka venkywonka force-pushed the user/venky/ll-nemo-nano-pyt-perf-ext branch from a94c959 to e8b3fee Compare May 22, 2025 13:36
@venkywonka
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6156 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6156 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4501 completed with status: 'SUCCESS'

@venkywonka venkywonka changed the base branch from main to release/0.20 May 22, 2025 19:54
@venkywonka venkywonka requested review from a team as code owners May 22, 2025 19:54
@venkywonka venkywonka changed the base branch from release/0.20 to main May 22, 2025 19:55
This is because the test harness default to no prefill chunking, that means the isl specified is the true context.
When explicitly unspecified in the test harness, the `maxnt` passed down to `trtllm-bench` is 2048.
This means trtllm-bench gets conflicting inputs when isl>2048 but maxnt=2048; hence overriding maxnt to be consistent with isl for such cases.

Signed-off-by: Venky <[email protected]>
@venkywonka venkywonka force-pushed the user/venky/ll-nemo-nano-pyt-perf-ext branch from e8b3fee to 67260d3 Compare May 22, 2025 19:59
@venkywonka venkywonka requested a review from a team as a code owner May 22, 2025 19:59
@venkywonka venkywonka requested review from dcampora and litaotju May 22, 2025 19:59
@venkywonka venkywonka changed the base branch from main to release/0.20 May 22, 2025 19:59
@LarryXFly LarryXFly merged commit d15ceae into NVIDIA:release/0.20 May 23, 2025
1 of 2 checks passed
amirkl94 pushed a commit to amirkl94/TensorRT-LLM that referenced this pull request May 28, 2025
…(pyt) (NVIDIA#4407)

* extend pyt nano tests perf coverage

Signed-off-by: Venky <[email protected]>

* explicitly set maxnt for some cases

This is because the test harness default to no prefill chunking, that means the isl specified is the true context.
When explicitly unspecified in the test harness, the `maxnt` passed down to `trtllm-bench` is 2048.
This means trtllm-bench gets conflicting inputs when isl>2048 but maxnt=2048; hence overriding maxnt to be consistent with isl for such cases.

Signed-off-by: Venky <[email protected]>

---------

Signed-off-by: Venky <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants