-
Notifications
You must be signed in to change notification settings - Fork 1.8k
test(perf): Pt.2 Add Llama-3_3-Nemotron-Super-49B-v1
integration-perf-tests (cpp)
#4499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(perf): Pt.2 Add Llama-3_3-Nemotron-Super-49B-v1
integration-perf-tests (cpp)
#4499
Conversation
Llama-3_3-Nemotron-Super-49B-v1
integration-perf-tests (cpp)Llama-3_3-Nemotron-Super-49B-v1
integration-perf-tests (cpp)
Signed-off-by: Venky <[email protected]>
9e4e51a
to
de538af
Compare
/bot run --disable-fail-fast |
PR_Github #5898 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enables the integration performance tests for the Llama-3.3-Nemotron-Super-49B-v1 model by adding new test configurations with adjusted parameters, reducing the number of total requests to meet the time budget.
- Enabled four new performance test cases with increased max batch sizes (64) and a reduced request count (4).
- Removed the previously commented-out timeout test cases.
Comments suppressed due to low confidence (1)
tests/integration/test_lists/qa/trt_llm_release_perf_test.yml:290
- Verify that the new 'reqs:4' setting provides adequate coverage for the intended low-concurrency performance scenarios and aligns with the overall test plan.
- perf/test_perf.py::test_perf[llama_v3.3_nemotron_super_49b-bench-bfloat16-maxbs:64-input_output_len:500,2000-reqs:4-con:1-gpus:4]
PR_Github #5898 [ run ] completed with state |
…rf-tests (cpp) (NVIDIA#4499) add low concurrency perf tests Signed-off-by: Venky <[email protected]>
…tegration-perf-tests (cpp) (#4499) (#4588) Signed-off-by: Venky <[email protected]>
…rf-tests (cpp) (NVIDIA#4499) add low concurrency perf tests Signed-off-by: Venky <[email protected]> Signed-off-by: darraghdog <[email protected]>
Description
This PR adds the remaining
llama_v3.3_nemotron_super_49b
tests that previous timed-out on high-request settings.These tests add four new coverage points in the low-concurrency setting. (2 ctx/gen shapes × BF16/FP8)
This time, the total requests was reduced from 512->4 to fit the time budget.
Test-plan overview
Shmoo-ed parameters
input=5 000, output=500
input=500, output=2 000
Throughput & latency summary
(tok/s)
(tok/s)
(ms)
(ms)
Latency percentiles (ms)