Skip to content

Conversation

Sean1783
Copy link
Collaborator

What's changing and why?

Adds validation logic for accelerator parameters and implements resource request caps for CPU and memory to prevent scheduling failures.

  • Added validation for accelerator parameters consistency
  • Implemented regressively scaled resource utilization caps
  • Automated parameter matching when only one accelerator parameter is provided
  • Aligned memory limits with Kubernetes best practices

Before/After UX

Scenario 1: Mismatched Accelerator Parameters

Before:

hyp create hyp-pytorch-job 
--accelerators-limit 3
--accelerators 2
  • Fails silently due to accelerators parameters mismatch

After:

# Same command now fails with:
"Error: Accelerator request must equal accelerator limit"

Scenario 2: Missing Accelerator Parameter

Before:

hyp create hyp-pytorch-job \
  --instance-type ml.g5.12xlarge \
  --accelerators-limit 2
  • Silent failure

After:

  • Successfully creates job with:
Resources:
  Limits:
    Memory: 96Gi
    nvidia.com/gpu: 2
  Requests:
    Cpu: 24
    Memory: 96Gi
    nvidia.com/gpu: 2

Scenario 3: Default Resource Allocation

Before: CLI requested max capacity, causing resource contention failures

hyp create hyp-pytorch-job 
--version 1.1 
--job-name resource-contention-failure
--image 162705258397.dkr.ecr.us-west-2.amazonaws.com/ptjob:latest 
--pull-policy "Always" 
--tasks-per-node 1 
--max-retry 1 
--namespace aws-hyperpod 
--instance-type ml.g5.12xlarge 

After: Automatically caps at safe thresholds (85% memory, 92% CPU)

      Spec:
        Containers:
          Image:              162705258397.dkr.ecr.us-west-2.amazonaws.com/ptjob:latest
          Image Pull Policy:  Always
          Name:               pytorch-job-container
          Resources:
            Limits:
              Memory:          164Gi
              nvidia.com/gpu:  4
            Requests:
              Cpu:             44
              Memory:          164Gi
              nvidia.com/gpu:  4
        Node Selector:
          node.kubernetes.io/instance-type:  ml.g5.12xlarge

How was this change tested?

  • Verified with ml.g5.12xlarge and ml.g5.8xlarge instances
  • Added unit tests for new validation functions
  • Updated existing unit tests for new default caps
  • Added integration tests following existing patterns

Are unit tests added?

Yes

Are integration tests added?

Yes

Test results:

https://quip-amazon.com/1UHaAG17sDMq/Test-Results-for-HP-CLI-Updates

Reviewer Guidelines

‼️ Merge Requirements: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:

  • All automated PR checks pass
  • Failed tests include local run results/screenshots proving they work
  • Changes are documentation-only

@Sean1783 Sean1783 requested a review from a team as a code owner September 24, 2025 17:41
"ml.i3en.24xlarge": {"cpu": 96, "gpu": 0, "trainium": 0, "memory": 768}
}

MAX_MEMORY_PROPORTION = 0.85
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious what is the source of these magic numbers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a previous iteration's hard cap on max memory request values as a proportion of total capacity that I didn't realize was still included which I can remove. The latest iteration uses a regressively scaled cap which is better.

)

MAX_MEMORY_PROPORTION = 0.85
MAX_CPU_PROPORTION = 0.92
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : to avoid redundancy, these constants can be moved to some common location to be used by both src and tst.

/hyperpod-cluster-stack-template/build
/hyperpod-pytorch-job-template/build
/hyperpod-custom-inference-template/build
/hyperpod-jumpstart-inference-template/build
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to keep these lines in gitignore ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; I'm not sure why/how these were removed. I'll add these back in.

@jam-jee
Copy link
Collaborator

jam-jee commented Sep 30, 2025

Known flaky tests failing, were passing in previous revision.

@jam-jee jam-jee merged commit 160cd80 into aws:main Sep 30, 2025
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants