nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mistral Small multimodal for BS=8 #5453

brb-nv · 2025-06-24T23:00:19Z

Description

Recently, we updated Mistral Small multimodal test to run with BS=8 instead of BS=1 in post-merge. This seems to run into a couple of issues intermittently:

CUDA OOM with frequency 2/33 runs
PIL.UnidentifiedImageError with frequency 1/33 times. This could happen when image downloaded is invalid / corrupt.

Unfortunately, logs mentioned in the bugs are no longer available.

While I'm not able to reproduce the issue locally on H100 PCIe (ran 5 times), this MR makes following changes to continue running the test while hopefully stabilizing CI:

Clearing CUDA cache before all multimodal tests like it's being done for unittests: 09929bd
Catch exceptions for each URL so that we know the problematic URLs
Run the post-merge test with BS=1 (which never failed in 80 runs previously) while keeping test for BS=8 in example_test_lists.txt. This will hopefully stabilize CI.

Test Coverage

$ pytest tests/integration/defs/examples/test_multimodal.py::test_llm_multimodal_general[Mistral-Small-3.1-24B-Instruct-2503-pp:1-tp:1-bfloat16-bs:8-cpp_e2e:False-nb:1] -s -v
$ pytest tests/integration/defs/examples/test_multimodal.py::test_llm_multimodal_general[Mistral-Small-3.1-24B-Instruct-2503-pp:1-tp:1-bfloat16-bs:1-cpp_e2e:False-nb:1] -s -v

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

…tral Small multimodal for BS=8 Signed-off-by: Balaram Buddharaju <[email protected]>

brb-nv · 2025-06-24T23:37:38Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-24T23:46:46Z

PR_Github #9767 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-25T02:30:56Z

PR_Github #9767 [ run ] completed with state SUCCESS
/LLM/release-0.21/L0_MergeRequest_PR pipeline #22 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

…tral Small multimodal for BS=8 (NVIDIA#5453) Signed-off-by: Balaram Buddharaju <[email protected]>

…tral Small multimodal for BS=8 (#5453) Signed-off-by: Balaram Buddharaju <[email protected]>

…tral Small multimodal for BS=8 (NVIDIA#5453) Signed-off-by: Balaram Buddharaju <[email protected]>

brb-nv requested a review from a team as a code owner June 24, 2025 23:00

brb-nv requested review from yiqingy0 and omera-nv June 24, 2025 23:32

nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mis…

456d5cd

…tral Small multimodal for BS=8 Signed-off-by: Balaram Buddharaju <[email protected]>

brb-nv force-pushed the user/brb/fix-mistral-small-intermittent-oom branch from fa3ae53 to 456d5cd Compare June 24, 2025 23:37

yiqingy0 approved these changes Jun 25, 2025

View reviewed changes

chzblych approved these changes Jun 25, 2025

View reviewed changes

chzblych merged commit 32f50de into NVIDIA:release/0.21 Jun 25, 2025
3 checks passed

dc3671 pushed a commit to dc3671/TensorRT-LLM that referenced this pull request Jul 1, 2025

nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mis…

fefda43

…tral Small multimodal for BS=8 (NVIDIA#5453) Signed-off-by: Balaram Buddharaju <[email protected]>

dc3671 pushed a commit to dc3671/TensorRT-LLM that referenced this pull request Jul 1, 2025

nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mis…

4cf3a2b

…tral Small multimodal for BS=8 (NVIDIA#5453) Signed-off-by: Balaram Buddharaju <[email protected]>

dc3671 pushed a commit to dc3671/TensorRT-LLM that referenced this pull request Jul 1, 2025

nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mis…

93e6b48

…tral Small multimodal for BS=8 (NVIDIA#5453) Signed-off-by: Balaram Buddharaju <[email protected]>

dc3671 pushed a commit that referenced this pull request Jul 1, 2025

nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mis…

4ef60d5

…tral Small multimodal for BS=8 (#5453) Signed-off-by: Balaram Buddharaju <[email protected]>

Shunkangz pushed a commit to Shunkangz/TensorRT-LLM that referenced this pull request Jul 2, 2025

nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mis…

2a9189f

…tral Small multimodal for BS=8 (NVIDIA#5453) Signed-off-by: Balaram Buddharaju <[email protected]>

brb-nv deleted the user/brb/fix-mistral-small-intermittent-oom branch July 11, 2025 23:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mistral Small multimodal for BS=8 #5453

nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mistral Small multimodal for BS=8 #5453

Uh oh!

brb-nv commented Jun 24, 2025

Uh oh!

brb-nv commented Jun 24, 2025

Uh oh!

tensorrt-cicd commented Jun 24, 2025

Uh oh!

tensorrt-cicd commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mistral Small multimodal for BS=8 #5453

nvbugs-5331031; nvbugs-5344203 - address intermittent issues with Mistral Small multimodal for BS=8 #5453

Uh oh!

Conversation

brb-nv commented Jun 24, 2025

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

brb-nv commented Jun 24, 2025

Uh oh!

tensorrt-cicd commented Jun 24, 2025

Uh oh!

tensorrt-cicd commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!