- 
                Notifications
    You must be signed in to change notification settings 
- Fork 6.9k
[Core] Enable Scaling Down for Multi-Host TPU Replicas #43470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
17b52b1    to
    df0994b      
    Compare
  
    d4756ed    to
    99c2ff5      
    Compare
  
    | This is on a critical code path. We should have more testing. Let's discuss it in today's sync. | 
627fcb2    to
    7cfe9db      
    Compare
  
    Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
| This PR was manually tested as follows: Prerequisites:
 Testing:
 
 
 
 
 
 
 
 | 
Signed-off-by: Ryan O'Leary <[email protected]>
| 
 Sure, I edited the comment to include more detail. | 
| @ryanaoleary could you also rebase your branch to fix the CI error? Thanks! | 
Signed-off-by: Ryan O'Leary <[email protected]>
| @can-anyscale could you retry the failed test? It is unrelated to this PR. Thanks! | 
| The RLLib tests fail after retry, but I don't think that is related to this PR because this PR is only for KubeRay. cc @jjyao @can-anyscale | 
Why are these changes needed?
Adds support for Ray autoscaler and Kuberay NodeProvider to scale-down TPU podslices. TPU podslices are atomic, so it is necessary to scale down all Ray nodes belonging to a TPU podslice together. This PR associates nodes with the
replica(representing a podslice) of the TPU worker group they belong to using areplicaIndexPod label which is set through a GKE webhook. When a TPU node is deleted, other nodes in that replica (tracked through a mapping) are scheduled to delete as well.Related PR: #45105
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.