Regressive resource scaling and accelerators validation #277

Sean1783 · 2025-09-24T17:41:18Z

What's changing and why?

Adds validation logic for accelerator parameters and implements resource request caps for CPU and memory to prevent scheduling failures.

Added validation for accelerator parameters consistency
Implemented regressively scaled resource utilization caps
Automated parameter matching when only one accelerator parameter is provided
Aligned memory limits with Kubernetes best practices

Before/After UX

Scenario 1: Mismatched Accelerator Parameters

Before:

hyp create hyp-pytorch-job 
--accelerators-limit 3
--accelerators 2

Fails silently due to accelerators parameters mismatch

After:

# Same command now fails with:
"Error: Accelerator request must equal accelerator limit"

Scenario 2: Missing Accelerator Parameter

Before:

hyp create hyp-pytorch-job \
  --instance-type ml.g5.12xlarge \
  --accelerators-limit 2

Silent failure

After:

Successfully creates job with:

Resources:
  Limits:
    Memory: 96Gi
    nvidia.com/gpu: 2
  Requests:
    Cpu: 24
    Memory: 96Gi
    nvidia.com/gpu: 2

Scenario 3: Default Resource Allocation

Before: CLI requested max capacity, causing resource contention failures

hyp create hyp-pytorch-job 
--version 1.1 
--job-name resource-contention-failure
--image 162705258397.dkr.ecr.us-west-2.amazonaws.com/ptjob:latest 
--pull-policy "Always" 
--tasks-per-node 1 
--max-retry 1 
--namespace aws-hyperpod 
--instance-type ml.g5.12xlarge

After: Automatically caps at safe thresholds (85% memory, 92% CPU)

      Spec:
        Containers:
          Image:              162705258397.dkr.ecr.us-west-2.amazonaws.com/ptjob:latest
          Image Pull Policy:  Always
          Name:               pytorch-job-container
          Resources:
            Limits:
              Memory:          164Gi
              nvidia.com/gpu:  4
            Requests:
              Cpu:             44
              Memory:          164Gi
              nvidia.com/gpu:  4
        Node Selector:
          node.kubernetes.io/instance-type:  ml.g5.12xlarge

How was this change tested?

Verified with ml.g5.12xlarge and ml.g5.8xlarge instances
Added unit tests for new validation functions
Updated existing unit tests for new default caps
Added integration tests following existing patterns

Are unit tests added?

Yes

Are integration tests added?

Yes

Test results:

https://quip-amazon.com/1UHaAG17sDMq/Test-Results-for-HP-CLI-Updates

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

All automated PR checks pass
Failed tests include local run results/screenshots proving they work
Changes are documentation-only

Merge upstream changes

…tests to account for new default values

jam-jee · 2025-09-29T16:49:15Z

src/sagemaker/hyperpod/training/quota_allocation_util.py

    "ml.i3en.24xlarge": {"cpu": 96, "gpu": 0, "trainium": 0, "memory": 768}
 }

+MAX_MEMORY_PROPORTION = 0.85


just curious what is the source of these magic numbers?

This was a previous iteration's hard cap on max memory request values as a proportion of total capacity that I didn't realize was still included which I can remove. The latest iteration uses a regressively scaled cap which is better.

jam-jee · 2025-09-29T16:52:28Z

test/unit_tests/cli/test_quota_allocation_util.py

 )

+MAX_MEMORY_PROPORTION = 0.85
+MAX_CPU_PROPORTION = 0.92


nit : to avoid redundancy, these constants can be moved to some common location to be used by both src and tst.

jam-jee · 2025-09-29T16:53:44Z

.gitignore

-/hyperpod-cluster-stack-template/build
-/hyperpod-pytorch-job-template/build
-/hyperpod-custom-inference-template/build
-/hyperpod-jumpstart-inference-template/build


I think we need to keep these lines in gitignore ?

Yes; I'm not sure why/how these were removed. I'll add these back in.

…1783/sagemaker-hyperpod-cli-searche into regressive-resource-scaling

jam-jee · 2025-09-30T23:22:25Z

Known flaky tests failing, were passing in previous revision.

Sean Archer added 20 commits September 10, 2025 12:24

Added .venv to .gitignore

76b34d2

Add venv/ to .gitignore

37e6c09

Merge remote-tracking branch 'upstream/main'

759e871

Merge upstream changes

Added .venv to .gitignore

57ee44f

Default values for accelerators implemented

e404dc0

Added memory validation and default accelerators values

ef9cf9d

Setting default values for memory, vcpu, and accelerators

5ad1c94

Unit tests for new quota_allocation functions

3b839e2

Refactoring default values

8d408a3

Unit and integration tests complete

806e74a

Fix for default cpu values

01a1534

Refactoring and clean up

334a3e3

Accounting for accelerators when min values provided

d7f28af

Refactoring and clean up

2caebdc

Increased default buffer for mem and cpu. Refactored _resolve_ functions

9337c0d

Refactored and added more unit tests

67f7e5c

Additional function for default values created. Refactored some unit …

e3e0dbc

…tests to account for new default values

Refactoring and test additions

afa14eb

Implemented regressive resource scaling for cpu and memory

582efdf

Refactoring of unit and integ tests

496e351

Sean1783 requested a review from a team as a code owner September 24, 2025 17:41

Sean1783 temporarily deployed to auto-approve September 24, 2025 17:41 — with GitHub Actions Inactive

Small change for a unit test

54954c7

Sean1783 temporarily deployed to auto-approve September 24, 2025 20:00 — with GitHub Actions Inactive

Merge branch 'main' into regressive-resource-scaling

1346de8

Sean1783 temporarily deployed to auto-approve September 24, 2025 21:11 — with GitHub Actions Inactive

mx26pol approved these changes Sep 25, 2025

View reviewed changes

jam-jee reviewed Sep 29, 2025

View reviewed changes

Sean Archer added 3 commits September 29, 2025 17:49

Increasd reserved resources amounts

6ed98af

Small refactoring

2736889

Merge branch 'regressive-resource-scaling' of https://github.com/Sean…

a65b29a

…1783/sagemaker-hyperpod-cli-searche into regressive-resource-scaling

Sean1783 temporarily deployed to auto-approve September 30, 2025 05:44 — with GitHub Actions Inactive

jam-jee approved these changes Sep 30, 2025

View reviewed changes

jam-jee merged commit 160cd80 into aws:main Sep 30, 2025
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regressive resource scaling and accelerators validation #277

Regressive resource scaling and accelerators validation #277

Uh oh!

Sean1783 commented Sep 24, 2025

Uh oh!

jam-jee Sep 29, 2025

Uh oh!

Sean1783 Sep 29, 2025

Uh oh!

jam-jee Sep 29, 2025

Uh oh!

jam-jee Sep 29, 2025

Uh oh!

Sean1783 Sep 29, 2025

Uh oh!

jam-jee commented Sep 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Regressive resource scaling and accelerators validation #277

Regressive resource scaling and accelerators validation #277

Uh oh!

Conversation

Sean1783 commented Sep 24, 2025

What's changing and why?

Before/After UX

Scenario 1: Mismatched Accelerator Parameters

Scenario 2: Missing Accelerator Parameter

Scenario 3: Default Resource Allocation

How was this change tested?

Are unit tests added?

Are integration tests added?

Test results:

Reviewer Guidelines

Uh oh!

jam-jee Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Sean1783 Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

jam-jee Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

jam-jee Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Sean1783 Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

jam-jee commented Sep 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants