Skip to content

Conversation

amukkara
Copy link
Collaborator

@amukkara amukkara commented Apr 24, 2025

Description

For some model and PP size combinations, num_hidden_layers % pp_size != 0. This PR creates a balanced assignment of layers to PP ranks in such cases, with few ranks assigned just one extra layer.

For example, Deepseek-V3 with 61 layers, pp size = 8, 61 % 8 = 5:
First 5 ranks get 8 layers each, last 3 ranks get 7 layers each.

Before this change, first 7 ranks get 7 layers each, and last rank get 12 layers causing OOM on last rank.

@amukkara
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3247 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3247 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2255 completed with status: 'FAILURE'

@amukkara
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3305 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3305 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2303 completed with status: 'FAILURE'

@amukkara
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3339 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3339 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #2330 completed with status: 'FAILURE'

@amukkara
Copy link
Collaborator Author

/bot run

@amukkara amukkara requested a review from yuxianq April 28, 2025 21:14
@amukkara
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3648 [ run ] triggered by Bot

Copy link
Collaborator

@chang-l chang-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use tensor_split (ref) here for simplicity? I think it should do the same thing.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3648 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2581 completed with status: 'FAILURE'

@amukkara amukkara requested a review from a team as a code owner April 29, 2025 18:58
@amukkara
Copy link
Collaborator Author

Can you use tensor_split (ref) here for simplicity? I think it should do the same thing.

@chang-l done in 1c9388eecbba058020795df4901d5338c56d1389

@chang-l chang-l self-requested a review April 29, 2025 19:57
@amukkara amukkara removed the request for review from a team May 1, 2025 15:17
@amukkara amukkara force-pushed the pp-layer-balance branch from 1c9388e to 05ebebb Compare May 1, 2025 15:17
@amukkara
Copy link
Collaborator Author

amukkara commented May 1, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3928 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3928 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2787 completed with status: 'FAILURE'

@amukkara
Copy link
Collaborator Author

amukkara commented May 2, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3952 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3952 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2802 completed with status: 'FAILURE'

@amukkara
Copy link
Collaborator Author

amukkara commented May 2, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3974 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #3974 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2819 completed with status: 'FAILURE'

@amukkara
Copy link
Collaborator Author

Change included in #4399 and #4034

@amukkara amukkara closed this May 16, 2025
@amukkara amukkara deleted the pp-layer-balance branch August 14, 2025 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants