Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
99 commits
Select commit Hold shift + click to select a range
223af40
Update telemetry status to be Integer for parity (#130)
Aditi2424 Jul 18, 2025
cf77296
Release new version for Health Monitoring Agent (1.0.643.0_1.0.192.0)…
maheshxb Jul 18, 2025
0342f60
Release new version for Health Monitoring Agent (1.0.674.0_1.0.199.0)…
jiayelamazon Jul 18, 2025
631ddf9
update inference CLI describe command print for better visualization …
mollyheamazon Jul 21, 2025
dc440c3
Update inference integ test to add dependency to improve telemetry ex…
mollyheamazon Jul 22, 2025
cc08405
Manual release v3.0.1 (#143)
mollyheamazon Jul 22, 2025
079fafd
change security-monitoring metrics data destination to us-east-2 for …
mollyheamazon Jul 22, 2025
29a16c5
feat: Add region detection to install Health Monitoring Agent and use…
haardm Jul 22, 2025
66232ed
Add unique time string to integ test (#150)
zhaoqizqwang Jul 23, 2025
9fbec4a
update example notebook for inference CLI (#151)
mollyheamazon Jul 23, 2025
8034a24
Training: Main documentation update (#153)
rsareddy0329 Jul 23, 2025
0bcee6d
Update inferenece SDK examples (#155)
zhaoqizqwang Jul 23, 2025
d2130e9
update help text to avoid truncation (#158)
mollyheamazon Jul 24, 2025
e3fafe0
Enable telemetry for cli (#165)
rsareddy0329 Jul 29, 2025
293f9b9
Add an option to disable the deployment of KubeFlow TrainingOperator …
DaniilGlazkoTR Jul 29, 2025
9f534b4
Remove unused param from documentation (#170)
nargokul Jul 30, 2025
ec8800d
Update volume flag to support hostPath and pvc (#171)
mollyheamazon Jul 31, 2025
95e073e
Restructure list-cluster output (#173)
pintaoz-aws Jul 31, 2025
a8a2baf
Update inference config and integ tests (#167)
zhaoqizqwang Jul 31, 2025
2908a62
Update readme for volume flag (#176)
mollyheamazon Jul 31, 2025
9b7220c
Manual release v3.0.2 (#177)
pintaoz-aws Jul 31, 2025
36fac66
Add schema pattern check to pytorch-job template (#178)
mollyheamazon Aug 1, 2025
0de2138
Add version comptability check between server K8s and Client python K…
papriwal Aug 1, 2025
dcbc8fb
Fix training test (#184)
zhaoqizqwang Aug 5, 2025
28424e4
Update logging information for submitting and deleting training job (…
pintaoz-aws Aug 5, 2025
17cfdbd
Merge Documentation changes to main for Launch (#196)
rsareddy0329 Aug 6, 2025
6553766
Added new column 'deploymeny configs' to the itable that allows user'…
mohamedzeidan2021 Aug 6, 2025
63ff3b4
Add instance type support for ml.p6e-gb200.36xlarge (#204)
zhaoqizqwang Aug 8, 2025
e3f697a
changed endpoint name from value user has to manually insert to place…
mohamedzeidan2021 Aug 12, 2025
d16d1b3
Enable PR checks on feature branches (#207)
rsareddy0329 Aug 12, 2025
0fd2bef
Release tg (#209)
jam-jee Aug 14, 2025
9560a48
Update generate_click_command inject logic to not expose unwanted fla…
mollyheamazon Aug 15, 2025
96c5b2b
update CHANGELOG.md (#175)
jam-jee Aug 15, 2025
7fda684
Minor update on README, example notebooks and documentation (#216)
mollyheamazon Aug 18, 2025
f747815
Add metadata_name argument to js and custom endpoint to match with SD…
mollyheamazon Aug 19, 2025
a4f0465
Add cert mgr installation which is required by HPTO (#180)
emeraldbay Aug 19, 2025
9c07154
Implementing hyp version command (#223)
jam-jee Aug 19, 2025
21d7ca2
FIX README DOCUMENTATION ISSUES (#221)
papriwal Aug 19, 2025
73a41b3
Update description for scheduler type (#222)
zhaoqizqwang Aug 19, 2025
743bd4d
fix: Set cert mgr installation disable by default (#224)
emeraldbay Aug 20, 2025
99121e7
Release new version for Health Monitoring Agent (1.0.742.0_1.0.241.0)…
992X Aug 20, 2025
853dfa8
feat: add get_operator_logs to pytorch job (#218)
rsareddy0329 Aug 20, 2025
d2bd3c2
Change default container name in pytorch template (#220)
mollyheamazon Aug 20, 2025
cc9eec6
Enhanced Error Handling for all hyp commands
mohamedzeidan2021 Aug 21, 2025
f571859
update v1.1 pytorch job template to match parity with v1.0 change in …
mollyheamazon Aug 22, 2025
935a4d9
Update list_pods to only display pods of corresponding endpoint type …
pintaoz-aws Aug 22, 2025
84aabcf
Implementing Task Gov. feature for SDK flow (#230)
jam-jee Aug 25, 2025
da607d2
Update warning message string for k8s version compatibility check (#229)
papriwal Aug 25, 2025
6f452bf
Implemented parallel processing for list-cluster operation to improve…
jam-jee Aug 25, 2025
91504e9
Add enpoint_name argument for list_pods() (#232)
pintaoz-aws Aug 25, 2025
e3cfe1d
Adding thread sleep before deleting resources in integ test (#236)
jam-jee Aug 26, 2025
5cff2a7
Release Cluster Management (#233)
nargokul Aug 26, 2025
3ad70ec
Create README.md (#237)
nargokul Aug 26, 2025
12730ca
Fix list_pods and AZ_ID error message (#238)
zhaoqizqwang Aug 26, 2025
16b48dd
Update setup.py to enable cluster creation template (#243)
nargokul Aug 27, 2025
e1ac050
Update docs for Cluster Management (#240)
papriwal Aug 27, 2025
0bf0782
Update CHANGELOG.md for 3.2.1 (#245)
rsareddy0329 Aug 27, 2025
1590894
Bug fix for cluster creation integ test, fixed cfn cleanup, wait for …
aviruthen Aug 27, 2025
0d7c810
update jumpstart and pytorch template for release (#248)
mollyheamazon Aug 27, 2025
4e73b0e
Update CHANGELOG.md for training and inference templates (#247)
rsareddy0329 Aug 27, 2025
5a346e8
Update pyproject.toml for inference templates (#249)
rsareddy0329 Aug 28, 2025
c7a285b
Update PR template with review information (#235)
aviruthen Sep 2, 2025
d30a69b
revert node-count val (#253)
mohamedzeidan2021 Sep 2, 2025
1ffbd65
Release new version for Health Monitoring Agent (1.0.790.0_1.0.266.0)…
992X Sep 3, 2025
d45d5c8
changelog version update (#256)
mohamedzeidan2021 Sep 4, 2025
58cff10
Fix README documentation and broken anchor links (#252)
papriwal Sep 4, 2025
60982da
Small bug fix to print debug messages for inference logger (PySDK) (#…
aviruthen Sep 8, 2025
050901b
Add code-coverage workflow to GitHub workflows (#257)
aviruthen Sep 9, 2025
162fb79
Bump version to 3.2.2 (#260)
papriwal Sep 10, 2025
458bd63
Update readme to include review guidelines (#261)
zhaoqizqwang Sep 10, 2025
883e534
Feature: Delete Cluster Command (#250)
mohamedzeidan2021 Sep 11, 2025
dffcc3d
Code Coverage for Integ Tests (#262)
aviruthen Sep 11, 2025
88bfd93
Release new version for Health Monitoring Agent (1.0.819.0_1.0.267.0)…
jiayelamazon Sep 16, 2025
5c42bcd
Removing duplicate cluster-creating integ test (#266)
aviruthen Sep 16, 2025
0b1bc8f
Access entry fix (#267)
mohamedzeidan2021 Sep 17, 2025
da2df2f
Fix Slurm failures from missing orchestration key (#268)
aviruthen Sep 17, 2025
d08aefb
Bump versions for release (#270)
aviruthen Sep 19, 2025
7421a76
Update CHANGELOG.md (#274)
rsareddy0329 Sep 23, 2025
3ee6d51
Integration tests for init experience (#242)
aviruthen Sep 8, 2025
6ed3031
return SDK class in pytorch model.py for v1_0 and v1_1, update pytorc…
mollyheamazon Sep 8, 2025
757d4ec
Init experience template agnostic change (TODO: CFN) (#241)
mollyheamazon Sep 9, 2025
72c14f7
Jumpstart and custom inference template agnostic change (#244)
mollyheamazon Sep 10, 2025
3f7c7bc
Cluster-stack template agnostic change (#245)
mollyheamazon Sep 11, 2025
68c28f9
Bugbash fix (#246)
mollyheamazon Sep 16, 2025
7c09e6a
delete cluster functionality (#247)
mohamedzeidan2021 Sep 16, 2025
4b1e0fb
Add telemetry and dog fooding fixes (#248)
mollyheamazon Sep 18, 2025
7697ded
Merge from public repo to resolve merge conflict before init experien…
mollyheamazon Sep 18, 2025
8bc72cf
add example notebooks for init experience, update README to match wit…
mollyheamazon Sep 19, 2025
a3e7efa
Fix test_hp_endpoint create_from_dict test
Sep 23, 2025
75c601b
Fix tox.ini to fix coverage issue
Sep 23, 2025
315f7ec
Fix tox.ini to fix coverage issue
Sep 24, 2025
9891db4
added describe cluster cmd (#278)
mohamedzeidan2021 Sep 25, 2025
dc2096a
Update aws-efa-k8s-device-plugin version to 0.5.10 (#282)
KeitaW Sep 30, 2025
160cd80
Regressive resource scaling and accelerators validation (#277)
Sean1783 Sep 30, 2025
7c185e3
Update README.md
mollyheamazon Oct 6, 2025
c5edf2d
Add ml.p5.4xlarge instance type support (#283)
ytlee93 Oct 6, 2025
79b0342
Update jinja template handling logic for inference and training (#279)
mollyheamazon Oct 8, 2025
98137da
Update cluster creation template url with versioning (#285)
pintaoz-aws Oct 8, 2025
0ae955c
Release new version for Health Monitoring Agent (1.0.935.0_1.0.282.0)…
nathanng17 Oct 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 29 additions & 31 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,30 @@
# PR Approval Steps

## For Requester

1. Description
- [ ] Check the PR title and description for clarity. It should describe the changes made and the reason behind them.
- [ ] Ensure that the PR follows the contribution guidelines, if applicable.
2. Security requirements
- [ ] Ensure that a Pull Request (PR) does not expose passwords and other sensitive information by using git-secrets and upload relevant evidence: https://github.com/awslabs/git-secrets
- [ ] Ensure commit has GitHub Commit Signature
3. Manual review
1. Click on the Files changed tab to see the code changes. Review the changes thoroughly:
- [ ] Code Quality: Check for coding standards, naming conventions, and readability.
- [ ] Functionality: Ensure that the changes meet the requirements and that all necessary code paths are tested.
- [ ] Security: Check for any security issues or vulnerabilities.
- [ ] Documentation: Confirm that any necessary documentation (code comments, README updates, etc.) has been updated.
4. Check for Merge Conflicts:
- [ ] Verify if there are any merge conflicts with the base branch. GitHub will usually highlight this. If there are conflicts, you should resolve them.

## For Reviewer

1. Go through `For Requester` section to double check each item.
2. Request Changes or Approve the PR:
1. If the PR is ready to be merged, click Review changes and select Approve.
2. If changes are required, select Request changes and provide feedback. Be constructive and clear in your feedback.
3. Merging the PR
1. Check the Merge Method:
1. Decide on the appropriate merge method based on your repository's guidelines (e.g., Squash and merge, Rebase and merge, or Merge).
2. Merge the PR:
1. Click the Merge pull request button.
2. Confirm the merge by clicking Confirm merge.
## What's changing and why?
<!-- Describe what you're changing and the motivation behind it -->


## Before/After UX
<!-- Show the user experience before and after your changes -->
**Before:**


**After:**


## How was this change tested?
<!-- Describe your testing approach -->


## Are unit tests added?


## Are integration tests added?


## Reviewer Guidelines

‼️ **Merge Requirements**: PRs with failing integration tests cannot be merged without justification.

One of the following must be true:
- [ ] All automated PR checks pass
- [ ] Failed tests include local run results/screenshots proving they work
- [ ] Changes are documentation-only
3 changes: 1 addition & 2 deletions .github/workflows/codebuild-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,7 @@ name: PR Checks
on:
pull_request_target:
branches:
- "master*"
- "main*"
- "*"

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.head_ref }}
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/security-monitoring.yml
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ jobs:
uses: aws-actions/configure-aws-credentials@12e3392609eaaceb7ae6191b3f54bbcb85b5002b
with:
role-to-assume: ${{ secrets.MONITORING_ROLE_ARN }}
aws-region: us-west-2
aws-region: us-east-2
- name: Put Dependabot Alert Metric Data
run: |
if [ "${{ needs.check-dependabot-alerts.outputs.dependabot_alert_status }}" == "1" ]; then
Expand Down
11 changes: 10 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,23 @@ __pycache__/
/.mypy_cache

/doc/_apidoc/
doc/_build/
/build

/sagemaker-hyperpod/build
/sagemaker-hyperpod/.coverage
/sagemaker-hyperpod/.coverage.*

/hyperpod-cluster-stack-template/build
/hyperpod-pytorch-job-template/build
/hyperpod-custom-inference-template/build
/hyperpod-jumpstart-inference-template/build

# Ignore all contents of result and results directories
/result/
/results/

.idea/
.idea/

.venv*
venv
20 changes: 20 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
version: 2

build:
os: ubuntu-22.04
tools:
python: "3.9"

python:
install:
- method: pip
path: .
- requirements: doc/requirements.txt

sphinx:
configuration: doc/conf.py
fail_on_warning: false

formats:
- pdf
- epub
65 changes: 59 additions & 6 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,76 @@
# Changelog

## v2.0.0 (2024-12-04)
## v.3.3.0 (2025-09-23)

### Features

- feature: The HyperPod CLI now support ([Hyperpod recipes](https://github.com/aws/sagemaker-hyperpod-recipes.git)). The HyperPod recipes enable customers to get started training and fine-tuning popular publicly-available foundation models like Llama 3.1 405B in minutes. Learn more ([here](https://github.com/aws/sagemaker-hyperpod-recipes.git)).
* Init Experience
* Init, Validate, and Create JumpStart endpoint, Custom endpoint, and PyTorch Training Job with local configuration
* Cluster management
* Bug fixes for cluster creation


## v1.0.0 (2024-09-09)
## v.3.2.2 (2025-09-10)

### Features

- feature: Add support for SageMaker HyperPod CLI
* Fix for production canary failures caused by bad training job template.
* New version for Health Monitoring Agent (1.0.790.0_1.0.266.0) with minor improvements and bug fixes.

## v3.2.1 (2025-08-27)

### Features

* Cluster management
* Bug Fixes with cluster creation
* Enable cluster template to be installed with hyperpod CLI .

## v3.2.0 (2025-08-25)

### Features

* Cluster management
* Creation of cluster stack
* Describing and listing a cluster stack
* Updating a cluster
* Init Experience
* Init, Validate, Create with local configurations


## v3.1.0 (2025-08-13)

### Features
* Task Governance feature for training jobs.


## v3.0.2 (2025-07-31)

### Features

* Update volume flag to support hostPath and PVC
* Add an option to disable the deployment of KubeFlow TrainingOperator
* Enable telemetry for CLI

## v1.0.0] ([2025]-[07]-[10])
## v3.0.0 (2025-07-10)

### Features

* Training Job - Create, List , Get
* Inference Jumpstart - Create , List, Get, Invoke
* Inference Custom - Create , List, Get, Invoke
* Observability changes
* Observability changes

## v2.0.0 (2024-12-04)

### Features

- feature: The HyperPod CLI now support ([Hyperpod recipes](https://github.com/aws/sagemaker-hyperpod-recipes.git)). The HyperPod recipes enable customers to get started training and fine-tuning popular publicly-available foundation models like Llama 3.1 405B in minutes. Learn more ([here](https://github.com/aws/sagemaker-hyperpod-recipes.git)).

## v1.0.0 (2024-09-09)

### Features

- feature: Add support for SageMaker HyperPod CLI



Loading
Loading