Skip to content

Conversation

yuxianq
Copy link
Collaborator

@yuxianq yuxianq commented May 19, 2025

Since DecoderModel's __pp_init__ is called before DecoderModelForCausalLM's __post_init__, it fails to skip weights for those weights defined in create_weights, which is created inside __post_init__.
We call DecoderModel's __pp_init__ inside DecoderModelForCausalLM's __pp_init__ to fix it, since DecoderModelForCausalLM's __pp_init__ is called after its __post_init__.

@yuxianq
Copy link
Collaborator Author

yuxianq commented May 19, 2025

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5739 [ run ] triggered by Bot

Copy link
Collaborator

@Barry-Delaney Barry-Delaney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Local tests passed.

Copy link
Collaborator

@amukkara amukkara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does any CI test fail before this change?

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5739 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4195 completed with status: 'FAILURE'

@yuxianq
Copy link
Collaborator Author

yuxianq commented May 20, 2025

does any CI test fail before this change?

@amukkara No, @Barry-Delaney get OOM issue when running python examples/pytorch/quickstart_advanced.py --model_dir /llm-models/DeepSeek-R1/DeepSeek-R1-W4AFP8 --tp_size 2 --pp_size 2 --moe_ep_size 1 --moe_tp_size 2 on H200x4. After this PR, this test can pass. Our CI does not contain any DeepSeek-R1 test now.

@yuxianq
Copy link
Collaborator Author

yuxianq commented May 20, 2025

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5812 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5812 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4257 completed with status: 'FAILURE'

@yuxianq
Copy link
Collaborator Author

yuxianq commented May 20, 2025

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5871 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #5871 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #4302 completed with status: 'SUCCESS'

@Barry-Delaney Barry-Delaney merged commit 62c16b6 into NVIDIA:main May 21, 2025
3 checks passed
yuxianq added a commit to yuxianq/TensorRT-LLM that referenced this pull request May 21, 2025
yuxianq added a commit that referenced this pull request May 21, 2025
fix: skip weights defined in create_weights for pp. (#4447)

Signed-off-by: Yuxian Qiu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants