Skip to content

Conversation

hyukn
Copy link
Collaborator

@hyukn hyukn commented Jul 3, 2025

The output shapes of the fusedLayerNormplugin for nvFP4 are mismatched. This pollutes the barrier buffer for the one-shot allreduce kernel, which causes the hanging issue.
This can also be the root cause of the accuracy issue as @zihaok recently mentioned. Because other data buffers are also messed up due to the out-of-range memory write.

@hyukn hyukn requested review from liji-nv and zihaok July 3, 2025 06:25
@hyukn hyukn requested a review from a team as a code owner July 3, 2025 06:25
@hyukn hyukn force-pushed the fix/5321981 branch 2 times, most recently from 6de8fa2 to d694659 Compare July 3, 2025 06:39
@hyukn
Copy link
Collaborator Author

hyukn commented Jul 3, 2025

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10770 [ run ] triggered by Bot

@hyukn hyukn requested a review from litaotju July 3, 2025 06:58
@hyukn hyukn changed the title [5321981] fix: Fix the Llama-405B hanging issue. [5321981] fix: Fix the Llama 3.1-405B hanging issue. Jul 3, 2025
@hyukn hyukn changed the title [5321981] fix: Fix the Llama 3.1-405B hanging issue. [5321981] fix: Fix the Llama3.1 405B hanging issue. Jul 3, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #10770 [ run ] completed with state SUCCESS
/LLM/release-0.21/L0_MergeRequest_PR pipeline #143 completed with status: 'FAILURE'

@hyukn
Copy link
Collaborator Author

hyukn commented Jul 3, 2025

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10827 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10827 [ run ] completed with state SUCCESS
/LLM/release-0.21/L0_MergeRequest_PR pipeline #148 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@hyukn
Copy link
Collaborator Author

hyukn commented Jul 4, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10887 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #10887 [ run ] completed with state SUCCESS
/LLM/release-0.21/L0_MergeRequest_PR pipeline #157 completed with status: 'SUCCESS'

@hyukn hyukn merged commit b0354ef into NVIDIA:release/0.21 Jul 4, 2025
3 checks passed
dc3671 pushed a commit to dc3671/TensorRT-LLM that referenced this pull request Jul 10, 2025
Correct the output shape of the fusedLayerNormPlugin.

Signed-off-by: Yukun He <[email protected]>
dc3671 pushed a commit to dc3671/TensorRT-LLM that referenced this pull request Jul 10, 2025
Correct the output shape of the fusedLayerNormPlugin.

Signed-off-by: Yukun He <[email protected]>
nvzhihanj pushed a commit to nvzhihanj/TensorRT-LLM that referenced this pull request Jul 10, 2025
Correct the output shape of the fusedLayerNormPlugin.

Signed-off-by: Yukun He <[email protected]>
hyukn added a commit that referenced this pull request Jul 10, 2025
nvzhihanj added a commit that referenced this pull request Jul 11, 2025
zhou-yuxin pushed a commit to zhou-yuxin/TensorRT-LLM that referenced this pull request Jul 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants