-
Notifications
You must be signed in to change notification settings - Fork 49
Regressive resource scaling and accelerators validation #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Merge upstream changes
…tests to account for new default values
| "ml.i3en.24xlarge": {"cpu": 96, "gpu": 0, "trainium": 0, "memory": 768} | ||
| } | ||
|
|
||
| MAX_MEMORY_PROPORTION = 0.85 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious what is the source of these magic numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a previous iteration's hard cap on max memory request values as a proportion of total capacity that I didn't realize was still included which I can remove. The latest iteration uses a regressively scaled cap which is better.
| ) | ||
|
|
||
| MAX_MEMORY_PROPORTION = 0.85 | ||
| MAX_CPU_PROPORTION = 0.92 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit : to avoid redundancy, these constants can be moved to some common location to be used by both src and tst.
| /hyperpod-cluster-stack-template/build | ||
| /hyperpod-pytorch-job-template/build | ||
| /hyperpod-custom-inference-template/build | ||
| /hyperpod-jumpstart-inference-template/build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to keep these lines in gitignore ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes; I'm not sure why/how these were removed. I'll add these back in.
|
Known flaky tests failing, were passing in previous revision. |
What's changing and why?
Adds validation logic for accelerator parameters and implements resource request caps for CPU and memory to prevent scheduling failures.
Before/After UX
Scenario 1: Mismatched Accelerator Parameters
Before:
After:
Scenario 2: Missing Accelerator Parameter
Before:
After:
Resources: Limits: Memory: 96Gi nvidia.com/gpu: 2 Requests: Cpu: 24 Memory: 96Gi nvidia.com/gpu: 2Scenario 3: Default Resource Allocation
Before: CLI requested max capacity, causing resource contention failures
hyp create hyp-pytorch-job --version 1.1 --job-name resource-contention-failure --image 162705258397.dkr.ecr.us-west-2.amazonaws.com/ptjob:latest --pull-policy "Always" --tasks-per-node 1 --max-retry 1 --namespace aws-hyperpod --instance-type ml.g5.12xlargeAfter: Automatically caps at safe thresholds (85% memory, 92% CPU)
Spec: Containers: Image: 162705258397.dkr.ecr.us-west-2.amazonaws.com/ptjob:latest Image Pull Policy: Always Name: pytorch-job-container Resources: Limits: Memory: 164Gi nvidia.com/gpu: 4 Requests: Cpu: 44 Memory: 164Gi nvidia.com/gpu: 4 Node Selector: node.kubernetes.io/instance-type: ml.g5.12xlargeHow was this change tested?
Are unit tests added?
Yes
Are integration tests added?
Yes
Test results:
https://quip-amazon.com/1UHaAG17sDMq/Test-Results-for-HP-CLI-Updates
Reviewer Guidelines
One of the following must be true: