Skip to content

Conversation

@NickLucche
Copy link
Collaborator

@NickLucche NickLucche commented May 13, 2025

What this PR does:

  • Adds support for heterogenous TP sizes. So far tested with P_TP=1 and D_TP=2/4.
  • It does so by splitting the remote kv cache among D_TP/P_TP D workers along the kv_head dim. block_size and kv precision must match.
  • (JIT) Discovery is done by first querying the rank0 of the dst_engine_id to get the TP size of destination group. After that, every D TP worker will only pull from a single remote P TP worker, hence the setup only requires two exchanges on the side channel (rank0 and target rank_j).

How to test

DECODER_TP_SIZE=2 PREFILLER_TP_SIZE=2 NUM_DECODE_INSTANCES=1 bash tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh 

DP TP worker rank and kv split assignment:

image

@NickLucche NickLucche changed the title [P/D] Heterogenous tp [P/D] Heterogenous TP May 13, 2025
@mergify mergify bot added the v1 label May 13, 2025
async def send_request_to_service(client_info: dict, endpoint: str,
req_data: dict, request_id: str):
"""
Send a request to a service using a client from the pool.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled this from the old nm/disagg_pd_dev branch. The one on was triggering some TP communication issues between workers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's hold until I figure this out

}

set_cli_args() {
PREFILLER_TP_SIZE=1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this could be a simpler pair of env vars

@mergify
Copy link

mergify bot commented May 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 15, 2025
PORT=$((8200 + i))
# Calculate side channel port
SIDE_CHANNEL_PORT=$((5659 + i))
SIDE_CHANNEL_PORT=$((5659 + i * $DECODER_TP_SIZE))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is bugged only works with 1P 1D btw

NickLucche added 10 commits May 20, 2025 07:45
fix descr indexing

change remote worker selection indexing; test ptp2-dtp4

Signed-off-by: nicklucche <[email protected]>
Signed-off-by: nicklucche <[email protected]>
Signed-off-by: nicklucche <[email protected]>
Signed-off-by: nicklucche <[email protected]>
Signed-off-by: nicklucche <[email protected]>
Signed-off-by: nicklucche <[email protected]>
Signed-off-by: nicklucche <[email protected]>
@NickLucche
Copy link
Collaborator Author

Update:
this PR was tested on lm-eval (run_accuracy_test.sh ) on nixl 0.2.1 (4f37f07) .

Unfortunately it does not work with nixl 0.1.1, as during lm-eval the decoder will hang, triggering TimeoutError.
I've spent some time trying to nail this down, but I am still unsure about the root cause.

Signed-off-by: nicklucche <[email protected]>
@mergify
Copy link

mergify bot commented May 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 21, 2025
@robertgshaw2-redhat robertgshaw2-redhat changed the title [P/D] Heterogenous TP [P/D] Heterogeneous TP May 25, 2025
@NickLucche
Copy link
Collaborator Author

Closing in favor of #18833.
The higher number of (small-size) descriptors this PR proposed didn't work with nixl, causing significant slowdowns.

@NickLucche NickLucche closed this Jun 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant