Skip to content

Conversation

@HuiGao-NV
Copy link
Collaborator

Use runtime total gpu memory to calculate kv cache memory and log more memory information.
This can avoid mis-reporting of less kv memory.

@HuiGao-NV HuiGao-NV requested review from a team as code owners May 26, 2025 08:51
@HuiGao-NV HuiGao-NV changed the base branch from main to release/0.20 May 26, 2025 08:51
@HuiGao-NV HuiGao-NV requested a review from a team as a code owner May 26, 2025 08:51
@HuiGao-NV HuiGao-NV requested a review from litaotju May 26, 2025 08:51
@HuiGao-NV
Copy link
Collaborator Author

/bot run

1 similar comment
@HuiGao-NV
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6467 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6467 [ run ] completed with state SUCCESS
/LLM/release-0.20/L0_MergeRequest_PR pipeline #70 completed with status: 'FAILURE'

@HuiGao-NV
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6510 [ run ] triggered by Bot

@HuiGao-NV
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6514 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6510 [ run ] completed with state ABORTED
/LLM/release-0.20/L0_MergeRequest_PR pipeline #74 completed with status: 'FAILURE'

@HuiGao-NV HuiGao-NV changed the title Use runtime total gpu memory to calculate kv cache memory and log more memory information fix: [nvbug5300494] Use runtime total gpu memory to calculate kv cache memory and log more memory information May 27, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #6514 [ run ] completed with state SUCCESS
/LLM/release-0.20/L0_MergeRequest_PR pipeline #75 completed with status: 'FAILURE'

@HuiGao-NV
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6586 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6586 [ run ] completed with state SUCCESS
/LLM/release-0.20/L0_MergeRequest_PR pipeline #82 completed with status: 'FAILURE'

@HuiGao-NV
Copy link
Collaborator Author

/bot run --stage-list="H100_PCIe-PyTorch-3"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6651 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6651 [ run ] completed with state SUCCESS
/LLM/release-0.20/L0_MergeRequest_PR pipeline #87 (Partly Tested) completed with status: 'SUCCESS'

@HuiGao-NV HuiGao-NV enabled auto-merge (squash) May 27, 2025 23:41
Change method to compute peak memory
Set new peak memory for case test_ptq_quickstart_advanced_mtp
Get non-torch memory of starttime of kv memory estimation

Signed-off-by: Hui Gao <[email protected]>
@HuiGao-NV
Copy link
Collaborator Author

/bot skip --comment="CI has passed."

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6687 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6687 [ skip ] completed with state SUCCESS
Skipping testing for commit d97ee2e

@HuiGao-NV HuiGao-NV merged commit 1bfc7d4 into NVIDIA:release/0.20 May 28, 2025
3 checks passed
shaharmor98 pushed a commit to shaharmor98/tekit that referenced this pull request May 28, 2025
…e memory and log more memory information (NVIDIA#4660)

Signed-off-by: Hui Gao <[email protected]>
@HuiGao-NV HuiGao-NV deleted the oom_testing branch June 3, 2025 00:58
omera-nv pushed a commit to omera-nv/TensorRT-LLM that referenced this pull request Jun 7, 2025
…e memory and log more memory information (NVIDIA#4660)

Signed-off-by: Hui Gao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants