Skip to content

Conversation

brb-nv
Copy link
Collaborator

@brb-nv brb-nv commented Jun 24, 2025

Description

Recently, we updated Mistral Small multimodal test to run with BS=8 instead of BS=1 in post-merge. This seems to run into a couple of issues intermittently:

  1. CUDA OOM with frequency 2/33 runs
  2. PIL.UnidentifiedImageError with frequency 1/33 times. This could happen when image downloaded is invalid / corrupt.

Unfortunately, logs mentioned in the bugs are no longer available.

While I'm not able to reproduce the issue locally on H100 PCIe (ran 5 times), this MR makes following changes to continue running the test while hopefully stabilizing CI:

  1. Clearing CUDA cache before all multimodal tests like it's being done for unittests: 09929bd
  2. Catch exceptions for each URL so that we know the problematic URLs
  3. Run the post-merge test with BS=1 (which never failed in 80 runs previously) while keeping test for BS=8 in example_test_lists.txt. This will hopefully stabilize CI.

Test Coverage

$ pytest tests/integration/defs/examples/test_multimodal.py::test_llm_multimodal_general[Mistral-Small-3.1-24B-Instruct-2503-pp:1-tp:1-bfloat16-bs:8-cpp_e2e:False-nb:1] -s -v
$ pytest tests/integration/defs/examples/test_multimodal.py::test_llm_multimodal_general[Mistral-Small-3.1-24B-Instruct-2503-pp:1-tp:1-bfloat16-bs:1-cpp_e2e:False-nb:1] -s -v

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@brb-nv brb-nv requested a review from a team as a code owner June 24, 2025 23:00
@brb-nv brb-nv requested review from yiqingy0 and omera-nv June 24, 2025 23:32
…tral Small multimodal for BS=8

Signed-off-by: Balaram Buddharaju <[email protected]>
@brb-nv brb-nv force-pushed the user/brb/fix-mistral-small-intermittent-oom branch from fa3ae53 to 456d5cd Compare June 24, 2025 23:37
@brb-nv
Copy link
Collaborator Author

brb-nv commented Jun 24, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9767 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9767 [ run ] completed with state SUCCESS
/LLM/release-0.21/L0_MergeRequest_PR pipeline #22 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@chzblych chzblych merged commit 32f50de into NVIDIA:release/0.21 Jun 25, 2025
3 checks passed
dc3671 pushed a commit to dc3671/TensorRT-LLM that referenced this pull request Jul 1, 2025
…tral Small multimodal for BS=8 (NVIDIA#5453)

Signed-off-by: Balaram Buddharaju <[email protected]>
dc3671 pushed a commit to dc3671/TensorRT-LLM that referenced this pull request Jul 1, 2025
…tral Small multimodal for BS=8 (NVIDIA#5453)

Signed-off-by: Balaram Buddharaju <[email protected]>
dc3671 pushed a commit to dc3671/TensorRT-LLM that referenced this pull request Jul 1, 2025
…tral Small multimodal for BS=8 (NVIDIA#5453)

Signed-off-by: Balaram Buddharaju <[email protected]>
dc3671 pushed a commit that referenced this pull request Jul 1, 2025
…tral Small multimodal for BS=8 (#5453)

Signed-off-by: Balaram Buddharaju <[email protected]>
Shunkangz pushed a commit to Shunkangz/TensorRT-LLM that referenced this pull request Jul 2, 2025
…tral Small multimodal for BS=8 (NVIDIA#5453)

Signed-off-by: Balaram Buddharaju <[email protected]>
@brb-nv brb-nv deleted the user/brb/fix-mistral-small-intermittent-oom branch July 11, 2025 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants