-
Notifications
You must be signed in to change notification settings - Fork 1.8k
test(perf): Extend the Llama-Nemotron-Nano-8B perf-integration-tests (pyt) #4407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(perf): Extend the Llama-Nemotron-Nano-8B perf-integration-tests (pyt) #4407
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR expands end-to-end performance test coverage for the llama_v3.1_nemotron_nano_8b
engine on the PyTorch backend, evaluating low and high concurrency patterns across various input/output lengths.
- Adds 8 new pytest entries under a new “torch backend” section for both low (concurrency=1, requests=8) and high (concurrency=250, requests=500) loads.
- Removes two outdated PyTorch backend tests using default input lengths.
- Ensures max batch size is set to 512 in all new scenarios.
Comments suppressed due to low confidence (1)
tests/integration/test_lists/qa/trt_llm_release_perf_test.yml:26
- [nitpick] Consider renaming the section comment '# torch backend' to '# pytorch backend' for consistency and clarity in labeling.
# torch backend
a94c959
to
e8b3fee
Compare
/bot run --disable-fail-fast |
PR_Github #6156 [ run ] triggered by Bot |
PR_Github #6156 [ run ] completed with state |
Signed-off-by: Venky <[email protected]>
This is because the test harness default to no prefill chunking, that means the isl specified is the true context. When explicitly unspecified in the test harness, the `maxnt` passed down to `trtllm-bench` is 2048. This means trtllm-bench gets conflicting inputs when isl>2048 but maxnt=2048; hence overriding maxnt to be consistent with isl for such cases. Signed-off-by: Venky <[email protected]>
e8b3fee
to
67260d3
Compare
…(pyt) (NVIDIA#4407) * extend pyt nano tests perf coverage Signed-off-by: Venky <[email protected]> * explicitly set maxnt for some cases This is because the test harness default to no prefill chunking, that means the isl specified is the true context. When explicitly unspecified in the test harness, the `maxnt` passed down to `trtllm-bench` is 2048. This means trtllm-bench gets conflicting inputs when isl>2048 but maxnt=2048; hence overriding maxnt to be consistent with isl for such cases. Signed-off-by: Venky <[email protected]> --------- Signed-off-by: Venky <[email protected]>
Expand PyT
llama_v3.1_nemotron_nano_8b
perf tests coverageDescription
This PR adds end-to-end performance results for the llama_v3.1_nemotron_nano_8b bfloat16 engine on 1 H100.
Two broad load patterns were evaluated on PyT backend for various ISL/OSL combos:
concurrency = 1
,requests = 8
concurrency = 250
,requests = 500
All tests use
max_batch_size = 512
.Performance Summary
(req/s)
(tps/gpu)
(ms)
NOTE: the above numbers were generated with prefill chunking disabled (which is the default behavior)