Skip to content

Conversation

brb-nv
Copy link
Collaborator

@brb-nv brb-nv commented May 30, 2025

Description

This addresses https://nvbugspro.nvidia.com/bug/5301221.

We unset WORLD_SIZE env variable while running tests in specific cluster nodes to deal with a bug in transformers library. Trainer initialization in get_dummy_spec_decoding_heads() function fails if WORLD_SIZE is unset. Preemptively skip tests if WORLD_SIZE is unset.

Alternative:
Setting WORLD_SIZE during Trainer's init (#4742) - I'm avoiding this as it could pollute the env.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@brb-nv brb-nv requested a review from a team as a code owner May 30, 2025 03:14
@brb-nv brb-nv force-pushed the user/brb/skip-tests-when-worldsize-missing branch from 82feaee to 581fbb0 Compare May 30, 2025 03:14
@brb-nv
Copy link
Collaborator Author

brb-nv commented May 30, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7008 [ run ] triggered by Bot

@brb-nv brb-nv requested a review from xinhe-nv May 30, 2025 03:33
@tensorrt-cicd
Copy link
Collaborator

PR_Github #7008 [ run ] completed with state SUCCESS
/LLM/release-0.20/L0_MergeRequest_PR pipeline #132 completed with status: 'FAILURE'

@brb-nv
Copy link
Collaborator Author

brb-nv commented May 30, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7087 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7087 [ run ] completed with state SUCCESS
/LLM/release-0.20/L0_MergeRequest_PR pipeline #139 completed with status: 'FAILURE'

@brb-nv
Copy link
Collaborator Author

brb-nv commented May 30, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7096 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7099 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7099 [ run ] completed with state SUCCESS
/LLM/release-0.20/L0_MergeRequest_PR pipeline #145 completed with status: 'FAILURE'

@SimengLiu-nv
Copy link
Collaborator

/bot skip --comment "Fail on an unknown error that's unrelated to the PR."

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7107 Bot args parsing error: Failed to parse bot args

@SimengLiu-nv
Copy link
Collaborator

/bot skip --comment "Pipeline passes for related tests. The failure is on a known CI issue."

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7109 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7109 [ skip ] completed with state SUCCESS
Skipping testing for commit 581fbb0

@brb-nv brb-nv force-pushed the user/brb/skip-tests-when-worldsize-missing branch from 83aa855 to 0c51b3f Compare May 31, 2025 21:58
@brb-nv
Copy link
Collaborator Author

brb-nv commented May 31, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7145 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7145 [ run ] completed with state SUCCESS
/LLM/release-0.20/L0_MergeRequest_PR pipeline #150 completed with status: 'FAILURE'

@brb-nv
Copy link
Collaborator Author

brb-nv commented Jun 1, 2025

/bot skip --comment "Failing tests unrelated to incoming changes."

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7150 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7150 [ skip ] completed with state SUCCESS
Skipping testing for commit 0c51b3f

@MartinMarciniszyn MartinMarciniszyn merged commit 7a2cd25 into NVIDIA:release/0.20 Jun 2, 2025
3 checks passed
omera-nv pushed a commit to omera-nv/TensorRT-LLM that referenced this pull request Jun 7, 2025
@brb-nv brb-nv deleted the user/brb/skip-tests-when-worldsize-missing branch July 11, 2025 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants