Skip to content

Conversation

omera-nv
Copy link
Collaborator

Clear torch CUDA cache before unittests

We've encountered a case in which a test failed due to OOM errors, that were resolved by adding torch.cuda.empty_cache at the start of the test. This PR adds this to all unittests, so each one starts with an empty torch cuda cache and can make full use of the available device memory.

@omera-nv omera-nv requested review from kaiyux and tomeras91 June 11, 2025 07:42
@omera-nv
Copy link
Collaborator Author

/bot run

@omera-nv omera-nv changed the title [fix] clear cuda cache before unittests automatically [fix][test] clear cuda cache before unittests automatically Jun 11, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #8436 [ run ] triggered by Bot

@omera-nv
Copy link
Collaborator Author

/bot kill

@omera-nv omera-nv force-pushed the fix/clear_cuda_cache_before_unittests branch from c676235 to b031afa Compare June 11, 2025 07:51
@omera-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8441 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8436 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8441 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit b031afa

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8445 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8445 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #6116 completed with status: 'FAILURE'

@omera-nv
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8501 [ run ] triggered by Bot

@omera-nv omera-nv force-pushed the fix/clear_cuda_cache_before_unittests branch from b031afa to 8102c47 Compare June 11, 2025 19:38
@omera-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8537 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8537 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6190 completed with status: 'SUCCESS'

@omera-nv omera-nv force-pushed the fix/clear_cuda_cache_before_unittests branch from 8102c47 to 647f992 Compare June 16, 2025 20:51
@omera-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9095 [ run ] triggered by Bot

@omera-nv
Copy link
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9104 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9095 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9104 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit 647f992

@omera-nv omera-nv force-pushed the fix/clear_cuda_cache_before_unittests branch from 647f992 to 85e6939 Compare June 17, 2025 10:44
@omera-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9193 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9193 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #6737 completed with status: 'FAILURE'

@omera-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9202 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9202 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #6744 completed with status: 'FAILURE'

@omera-nv omera-nv force-pushed the fix/clear_cuda_cache_before_unittests branch from 5efca1d to 1abca40 Compare June 17, 2025 19:23
@omera-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9244 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9244 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #6780 completed with status: 'FAILURE'

@omera-nv omera-nv force-pushed the fix/clear_cuda_cache_before_unittests branch from 1abca40 to 38eab7d Compare June 17, 2025 23:51
@omera-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9250 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9250 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #6786 completed with status: 'FAILURE'

@omera-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9306 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9306 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #6829 completed with status: 'FAILURE'

@omera-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9410 [ run ] triggered by Bot

@omera-nv omera-nv force-pushed the fix/clear_cuda_cache_before_unittests branch from ed2b4ab to 09929bd Compare June 18, 2025 18:15
@omera-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9411 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9410 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #9411 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6904 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@omera-nv omera-nv merged commit 0b6d005 into NVIDIA:main Jun 18, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants