-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
[P/D] Heterogeneous TP #18079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[P/D] Heterogeneous TP #18079
Conversation
| async def send_request_to_service(client_info: dict, endpoint: str, | ||
| req_data: dict, request_id: str): | ||
| """ | ||
| Send a request to a service using a client from the pool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled this from the old nm/disagg_pd_dev branch. The one on was triggering some TP communication issues between workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's hold until I figure this out
| } | ||
|
|
||
| set_cli_args() { | ||
| PREFILLER_TP_SIZE=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this could be a simpler pair of env vars
|
This pull request has merge conflicts that must be resolved before it can be |
| PORT=$((8200 + i)) | ||
| # Calculate side channel port | ||
| SIDE_CHANNEL_PORT=$((5659 + i)) | ||
| SIDE_CHANNEL_PORT=$((5659 + i * $DECODER_TP_SIZE)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is bugged only works with 1P 1D btw
fix descr indexing change remote worker selection indexing; test ptp2-dtp4 Signed-off-by: nicklucche <[email protected]>
Signed-off-by: nicklucche <[email protected]>
Signed-off-by: nicklucche <[email protected]>
Signed-off-by: nicklucche <[email protected]>
Signed-off-by: nicklucche <[email protected]>
Signed-off-by: nicklucche <[email protected]>
Signed-off-by: nicklucche <[email protected]>
Signed-off-by: nicklucche <[email protected]>
Signed-off-by: nicklucche <[email protected]>
Signed-off-by: nicklucche <[email protected]>
64a89b8 to
355c2f1
Compare
|
Update: Unfortunately it does not work with nixl 0.1.1, as during lm-eval the decoder will hang, triggering TimeoutError. |
Signed-off-by: nicklucche <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
|
Closing in favor of #18833. |
What this PR does:
How to test
DP TP worker rank and kv split assignment: