[https://nvbugs/5154414][fix] Balanced layer to PP rank assignment #3827

amukkara · 2025-04-24T02:19:43Z

Description

For some model and PP size combinations, num_hidden_layers % pp_size != 0. This PR creates a balanced assignment of layers to PP ranks in such cases, with few ranks assigned just one extra layer.

For example, Deepseek-V3 with 61 layers, pp size = 8, 61 % 8 = 5:
First 5 ranks get 8 layers each, last 3 ranks get 7 layers each.

Before this change, first 7 ranks get 7 layers each, and last rank get 12 layers causing OOM on last rank.

amukkara · 2025-04-24T04:21:25Z

/bot run

tensorrt-cicd · 2025-04-24T04:27:29Z

PR_Github #3247 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-24T05:53:36Z

PR_Github #3247 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2255 completed with status: 'FAILURE'

amukkara · 2025-04-24T14:35:25Z

/bot run

tensorrt-cicd · 2025-04-24T14:41:05Z

PR_Github #3305 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-25T02:17:40Z

PR_Github #3305 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2303 completed with status: 'FAILURE'

amukkara · 2025-04-25T02:25:04Z

/bot run

tensorrt-cicd · 2025-04-25T02:31:01Z

PR_Github #3339 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-25T02:59:52Z

PR_Github #3339 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #2330 completed with status: 'FAILURE'

amukkara · 2025-04-25T20:25:01Z

/bot run

amukkara · 2025-04-28T21:14:14Z

/bot run

tensorrt-cicd · 2025-04-28T21:20:06Z

PR_Github #3648 [ run ] triggered by Bot

chang-l

Can you use tensor_split (ref) here for simplicity? I think it should do the same thing.

tensorrt-cicd · 2025-04-29T07:31:41Z

PR_Github #3648 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2581 completed with status: 'FAILURE'

amukkara · 2025-04-29T19:01:08Z

Can you use tensor_split (ref) here for simplicity? I think it should do the same thing.

@chang-l done in 1c9388eecbba058020795df4901d5338c56d1389

Signed-off-by: Anurag Mukkara <[email protected]>

amukkara · 2025-05-01T15:18:26Z

/bot run

tensorrt-cicd · 2025-05-01T15:24:13Z

PR_Github #3928 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-02T00:19:26Z

PR_Github #3928 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2787 completed with status: 'FAILURE'

amukkara · 2025-05-02T03:03:19Z

/bot run

tensorrt-cicd · 2025-05-02T03:08:54Z

PR_Github #3952 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-02T04:02:54Z

PR_Github #3952 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2802 completed with status: 'FAILURE'

amukkara · 2025-05-02T14:39:03Z

/bot run

tensorrt-cicd · 2025-05-02T14:44:48Z

PR_Github #3974 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-03T01:51:56Z

PR_Github #3974 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #2819 completed with status: 'FAILURE'

amukkara · 2025-05-16T16:29:28Z

Change included in #4399 and #4034

amukkara requested review from achartier and chang-l April 24, 2025 02:20

amukkara force-pushed the pp-layer-balance branch from 9255220 to c33ca15 Compare April 24, 2025 02:23

achartier approved these changes Apr 24, 2025

View reviewed changes

amukkara force-pushed the pp-layer-balance branch from c33ca15 to 7697b73 Compare April 24, 2025 14:35

amukkara requested a review from yuxianq April 28, 2025 21:14

chang-l reviewed Apr 29, 2025

View reviewed changes

yuxianq approved these changes Apr 29, 2025

View reviewed changes

amukkara force-pushed the pp-layer-balance branch from a8d63e0 to 1c9388e Compare April 29, 2025 18:58

amukkara requested a review from a team as a code owner April 29, 2025 18:58

chang-l self-requested a review April 29, 2025 19:57

chang-l approved these changes Apr 29, 2025

View reviewed changes

amukkara removed the request for review from a team May 1, 2025 15:17

Balanced layer to pp rank assignment

05ebebb

Signed-off-by: Anurag Mukkara <[email protected]>

amukkara force-pushed the pp-layer-balance branch from 1c9388e to 05ebebb Compare May 1, 2025 15:17

amukkara mentioned this pull request May 3, 2025

feat: [TRTLLM-5623][nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism #4034

Merged

yuxianq mentioned this pull request May 16, 2025

fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. #4399

Merged

amukkara closed this May 16, 2025

amukkara deleted the pp-layer-balance branch August 14, 2025 01:20

[https://nvbugs/5154414][fix] Balanced layer to PP rank assignment #3827

[https://nvbugs/5154414][fix] Balanced layer to PP rank assignment #3827

Uh oh!

Conversation

amukkara commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

amukkara commented Apr 24, 2025

Uh oh!

tensorrt-cicd commented Apr 24, 2025

Uh oh!

tensorrt-cicd commented Apr 24, 2025

Uh oh!

amukkara commented Apr 24, 2025

Uh oh!

tensorrt-cicd commented Apr 24, 2025

Uh oh!

tensorrt-cicd commented Apr 25, 2025

Uh oh!

amukkara commented Apr 25, 2025

Uh oh!

tensorrt-cicd commented Apr 25, 2025

Uh oh!

tensorrt-cicd commented Apr 25, 2025

Uh oh!

amukkara commented Apr 25, 2025

Uh oh!

amukkara commented Apr 28, 2025

Uh oh!

tensorrt-cicd commented Apr 28, 2025

Uh oh!

chang-l left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 29, 2025

Uh oh!

amukkara commented Apr 29, 2025

Uh oh!

amukkara commented May 1, 2025

Uh oh!

tensorrt-cicd commented May 1, 2025

Uh oh!

tensorrt-cicd commented May 2, 2025

Uh oh!

amukkara commented May 2, 2025

Uh oh!

tensorrt-cicd commented May 2, 2025

Uh oh!

tensorrt-cicd commented May 2, 2025

Uh oh!

amukkara commented May 2, 2025

Uh oh!

tensorrt-cicd commented May 2, 2025

Uh oh!

tensorrt-cicd commented May 3, 2025

Uh oh!

amukkara commented May 16, 2025

Uh oh!

Uh oh!

amukkara commented Apr 24, 2025 •

edited

Loading