generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 49
Release new version for Health Monitoring Agent (1.0.790.0_1.0.266.0)…. #254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… with minor improvements and bug fixes.
huaziyao
approved these changes
Sep 3, 2025
zhaoqizqwang
approved these changes
Sep 3, 2025
Integ tests failed due to known errors. Change is this PR is only in helm_charts folder so overriding PR check |
papriwal
pushed a commit
to papriwal/sagemaker-hyperpod-cli
that referenced
this pull request
Sep 4, 2025
… with minor improvements and bug fixes. (aws#254)
mohamedzeidan2021
pushed a commit
to mohamedzeidan2021/sagemaker-hyperpod-cli
that referenced
this pull request
Sep 10, 2025
… with minor improvements and bug fixes. (aws#254)
rsareddy0329
pushed a commit
that referenced
this pull request
Sep 23, 2025
…ce launch (#249) * Release new version for Health Monitoring Agent (1.0.790.0_1.0.266.0) with minor improvements and bug fixes. (#254) * changelog version update (#256) Co-authored-by: Mohamed Zeidan <[email protected]> * Fix README documentation and broken anchor links (#252) **Description** - Updated README.md to fix broken internal navigation links, corrected SDK import paths, added proper syntax highlighting to code blocks. - Fixed training SDK imports, observability utils import path, and cluster management workflow examples. **Testing Done** - Verified all anchor links work correctly in table of contents and usage sections - Cross-referenced SDK imports against actual source code in src/sagemaker/hyperpod/ - Validated CLI commands match implementation in hyp_cli.py - Confirmed code examples use correct class names and method signatures * Small bug fix to print debug messages for inference logger (PySDK) (#246) * Draft of inference logger bug fix * Draft fix of inference logger for SDK * Revert adding --debug flag * Add debug parameter to failing unit tests * Fix create_from_dict to not have hardcoded debug flag * Add code-coverage workflow to GitHub workflows (#257) * Add code coverage workflow * Update artifact version to v4 * Fixed report upload * Simplified workflow using tox.ini * Make sure coverage is on right source files * Bug fix for 0 percent code coverage error * Bump version to 3.2.2 (#260) * Bump version to 3.2.2 **Description** Update package version from 3.2.1 to 3.2.2 in pyproject.toml and setup.py files. **Testing Done** Version bump only - no functional changes requiring additional testing. * Changelog update for v3.2.2 **Description** Added detaisl for Health Monitoring Agent updates to changelog **Testing Done** Production canary failure fixes validated. * Changelog update for v3.2.2 **Description** Updated the release date to represent the correct date. **Testing Done** No breaking changes. * Bump hyperpod-pytorch-job-template to v1.1.2 **Description** Update hyperpod-pytorch-job-template version from 1.1.1 to 1.1.2 and add changelog entry for node-count validation revert. **Testing Done** Version bump and changelog update - node-count validation revert functionality verified. * Update readme to include review guidelines (#261) * Update PR template * Update template * Update template format * Update format * Fix readme * Feature: Delete Cluster Command (#250) * delete cluster stack * delete cluster stack * removed unnecessary file * unit tests * more modular code * refactored modular code * sdk code added and improved modularity * cleanup * removed silent failure for sdk * fixed unit tests * integ tests * 2 integ happycase tests * changed test to use iam role instead of s3 bucket --------- Co-authored-by: Mohamed Zeidan <[email protected]> * Code Coverage for Integ Tests (#262) * Code Coverage for Integ Tests * Making sure target of coverage is correct * Removing duplicate implementation * Release new version for Health Monitoring Agent (1.0.819.0_1.0.267.0) with minor improvements and bug fixes. (#265) 1. New feature NVML API Check to detect hardware failure. Disabled Nvidia SMI query check 2. HMA will be able to detect File system read only error 3. For compatibility with AL2023, Non-NVIDIA devices will use a separate daemonset for deployment. * Removing duplicate cluster-creating integ test (#266) * Access entry fix (#267) * Fix Slurm failures from missing orchestration key (#268) * slurm-eks-helper-fix * Small fix to test to reflect new changes * small fix after resolving merge conflict --------- Co-authored-by: Xichao Wang <[email protected]> Co-authored-by: Mohamed Zeidan <[email protected]> Co-authored-by: Mohamed Zeidan <[email protected]> Co-authored-by: papriwal <[email protected]> Co-authored-by: aviruthen <[email protected]> Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: jiayelamazon <[email protected]>
rsareddy0329
pushed a commit
that referenced
this pull request
Sep 24, 2025
…ce launch (#249) * Release new version for Health Monitoring Agent (1.0.790.0_1.0.266.0) with minor improvements and bug fixes. (#254) * changelog version update (#256) Co-authored-by: Mohamed Zeidan <[email protected]> * Fix README documentation and broken anchor links (#252) **Description** - Updated README.md to fix broken internal navigation links, corrected SDK import paths, added proper syntax highlighting to code blocks. - Fixed training SDK imports, observability utils import path, and cluster management workflow examples. **Testing Done** - Verified all anchor links work correctly in table of contents and usage sections - Cross-referenced SDK imports against actual source code in src/sagemaker/hyperpod/ - Validated CLI commands match implementation in hyp_cli.py - Confirmed code examples use correct class names and method signatures * Small bug fix to print debug messages for inference logger (PySDK) (#246) * Draft of inference logger bug fix * Draft fix of inference logger for SDK * Revert adding --debug flag * Add debug parameter to failing unit tests * Fix create_from_dict to not have hardcoded debug flag * Add code-coverage workflow to GitHub workflows (#257) * Add code coverage workflow * Update artifact version to v4 * Fixed report upload * Simplified workflow using tox.ini * Make sure coverage is on right source files * Bug fix for 0 percent code coverage error * Bump version to 3.2.2 (#260) * Bump version to 3.2.2 **Description** Update package version from 3.2.1 to 3.2.2 in pyproject.toml and setup.py files. **Testing Done** Version bump only - no functional changes requiring additional testing. * Changelog update for v3.2.2 **Description** Added detaisl for Health Monitoring Agent updates to changelog **Testing Done** Production canary failure fixes validated. * Changelog update for v3.2.2 **Description** Updated the release date to represent the correct date. **Testing Done** No breaking changes. * Bump hyperpod-pytorch-job-template to v1.1.2 **Description** Update hyperpod-pytorch-job-template version from 1.1.1 to 1.1.2 and add changelog entry for node-count validation revert. **Testing Done** Version bump and changelog update - node-count validation revert functionality verified. * Update readme to include review guidelines (#261) * Update PR template * Update template * Update template format * Update format * Fix readme * Feature: Delete Cluster Command (#250) * delete cluster stack * delete cluster stack * removed unnecessary file * unit tests * more modular code * refactored modular code * sdk code added and improved modularity * cleanup * removed silent failure for sdk * fixed unit tests * integ tests * 2 integ happycase tests * changed test to use iam role instead of s3 bucket --------- Co-authored-by: Mohamed Zeidan <[email protected]> * Code Coverage for Integ Tests (#262) * Code Coverage for Integ Tests * Making sure target of coverage is correct * Removing duplicate implementation * Release new version for Health Monitoring Agent (1.0.819.0_1.0.267.0) with minor improvements and bug fixes. (#265) 1. New feature NVML API Check to detect hardware failure. Disabled Nvidia SMI query check 2. HMA will be able to detect File system read only error 3. For compatibility with AL2023, Non-NVIDIA devices will use a separate daemonset for deployment. * Removing duplicate cluster-creating integ test (#266) * Access entry fix (#267) * Fix Slurm failures from missing orchestration key (#268) * slurm-eks-helper-fix * Small fix to test to reflect new changes * small fix after resolving merge conflict --------- Co-authored-by: Xichao Wang <[email protected]> Co-authored-by: Mohamed Zeidan <[email protected]> Co-authored-by: Mohamed Zeidan <[email protected]> Co-authored-by: papriwal <[email protected]> Co-authored-by: aviruthen <[email protected]> Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: jiayelamazon <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
… with minor improvements and bug fixes.
What's changing and why?
Before/After UX
Before:
N/A
After:
N/A
How was this change tested?
Tested through node repair and fault detection through a newly created HyperPod EKS cluster on G5 and Trn1 instances.
Unit test coverage
Do we need integration tests?
Checklist