Skip to content

Conversation

@ixlmar
Copy link
Collaborator

@ixlmar ixlmar commented May 20, 2025

feat: forward exceptions to Python and catch OOMs

Description

This prints some additional information when PyExecutor creation fails due to insufficient GPU memory.

Note: Commit 609e1bc is from PR #4493.

For instance, in the case that ncclCommInitRank fails in AllgatherOp.initialize, the following error information is emitted.

RuntimeError: Executor creation failed with an error which might indicate insufficient GPU memory.

The following component could not be created: Additional executor resources (temporary for KV cache size estimation)
Total GPU memory (GiB): 95.09
Free GPU memory before component creation attempt (GiB): 0.32

Previously created components and free GPU memory before/after creation (GiB):
Model: 93.99 / 10.60
Sampler: 10.60 / 6.38
Initial KV cache (temporary for KV cache size estimation): 6.38 / 0.32

Please refer to the TensorRT-LLM documentation for information on how to control the memory usage through TensorRT-LLM configuration options. Possible options include:
  Model: reduce max_num_tokens and/or shard the model weights across GPUs by enabling pipeline and/or tensor parallelism
  Sampler: reduce max_seq_len and/or max_attention_window_size
  Initial KV cache (temporary for KV cache size estimation): reduce max_num_tokens

The message says might indicate, because the NCCL error text is "unhandled cuda error (run with NCCL_DEBUG=INFO for details)" and only running with NCCL_DEBUG=WARN reveals:

... [0] include/alloc.h:228 NCCL WARN Cuda failure 2 'out of memory'
... [0] include/alloc.h:346 NCCL WARN Failed to CUDA calloc async 32 bytes

But currently there is no NCCL API to get this extra information (with ncclCommInitRank failing, ncclGetLastError cannot be used).

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@ixlmar ixlmar requested review from DomBrown and dcampora May 20, 2025 16:13
@ixlmar
Copy link
Collaborator Author

ixlmar commented May 20, 2025

/bot run

@ixlmar ixlmar marked this pull request as ready for review May 20, 2025 16:47
@tensorrt-cicd
Copy link
Collaborator

PR_Github #5897 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5897 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4321 completed with status: 'FAILURE'

@ixlmar ixlmar requested a review from a team as a code owner May 22, 2025 08:56
@ixlmar ixlmar requested a review from hyukn May 22, 2025 08:56
@ixlmar
Copy link
Collaborator Author

ixlmar commented May 22, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6131 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6131 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4480 completed with status: 'FAILURE'

@ixlmar
Copy link
Collaborator Author

ixlmar commented May 23, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6256 [ run ] triggered by Bot

@ixlmar
Copy link
Collaborator Author

ixlmar commented May 23, 2025

/bot kill

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6271 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6256 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6271 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit 502eaed

@ixlmar
Copy link
Collaborator Author

ixlmar commented May 23, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6275 [ run ] triggered by Bot

@ixlmar ixlmar requested a review from hyukn May 23, 2025 09:10
@tensorrt-cicd
Copy link
Collaborator

PR_Github #6275 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #4584 completed with status: 'FAILURE'

@ixlmar
Copy link
Collaborator Author

ixlmar commented May 23, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6453 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4722 completed with status: 'FAILURE'

@ixlmar
Copy link
Collaborator Author

ixlmar commented May 27, 2025

/bot run

@ixlmar
Copy link
Collaborator Author

ixlmar commented May 27, 2025

/bot kill

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6556 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6559 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6556 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6559 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit 8d759da

@ixlmar
Copy link
Collaborator Author

ixlmar commented May 27, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6572 [ run ] triggered by Bot

@ixlmar
Copy link
Collaborator Author

ixlmar commented May 27, 2025

/bot kill

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6583 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6572 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6583 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit 8d759da

@ixlmar
Copy link
Collaborator Author

ixlmar commented May 27, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6588 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6588 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4815 completed with status: 'SUCCESS'

@ixlmar
Copy link
Collaborator Author

ixlmar commented May 28, 2025

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6738 [ reuse-pipeline ] triggered by Bot

@ixlmar
Copy link
Collaborator Author

ixlmar commented May 28, 2025

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6748 [ reuse-pipeline ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6748 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #6588 for commit 24ab553

@dcampora dcampora merged commit fbe4db2 into NVIDIA:main May 28, 2025
3 checks passed
@ixlmar ixlmar deleted the feat/catch-oom branch May 28, 2025 09:59
darraghdog pushed a commit to darraghdog/TensorRT-LLM that referenced this pull request Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants