Update readme to show with charts are required for specific features #105

vivek-koppuru · 2025-06-18T23:44:45Z

Adding more information in the README for the helm charts that are required based on feature. Addresses #95

PR Approval Steps

For Requester

Description
- Check the PR title and description for clarity. It should describe the changes made and the reason behind them.
- Ensure that the PR follows the contribution guidelines, if applicable.
Security requirements
- Ensure that a Pull Request (PR) does not expose passwords and other sensitive information by using git-secrets and upload relevant evidence: https://github.com/awslabs/git-secrets
- Ensure commit has GitHub Commit Signature
Manual review
1. Click on the Files changed tab to see the code changes. Review the changes thoroughly:
  - Code Quality: Check for coding standards, naming conventions, and readability.
  - Functionality: Ensure that the changes meet the requirements and that all necessary code paths are tested.
  - Security: Check for any security issues or vulnerabilities.
  - Documentation: Confirm that any necessary documentation (code comments, README updates, etc.) has been updated.
Check for Merge Conflicts:
- Verify if there are any merge conflicts with the base branch. GitHub will usually highlight this. If there are conflicts, you should resolve them.

For Reviewer

Go through For Requester section to double check each item.
Request Changes or Approve the PR:
1. If the PR is ready to be merged, click Review changes and select Approve.
2. If changes are required, select Request changes and provide feedback. Be constructive and clear in your feedback.
Merging the PR
1. Check the Merge Method:
  1. Decide on the appropriate merge method based on your repository's guidelines (e.g., Squash and merge, Rebase and merge, or Merge).
2. Merge the PR:
  1. Click the Merge pull request button.
  2. Confirm the merge by clicking Confirm merge.

nghtm

This PR is improved, but a customer coming to this page wanting to know "What do I need to install on my HyperPod cluster vs what is extra stuffs" should more clearly have that question answered by this readme.

nghtm · 2025-06-19T00:04:08Z

helm_chart/readme.md


+Here are the list of dependent charts and plugins that can be installed as part of the HyperPod Helm chart. Features required for HyperPod Resiliency are recommended to enable cluster resiliency. Features required for HyperPod Task Governance are optional but help set access control on your cluster. More information about orchestration features for cluster admins [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks.html).
+
 | Chart Name                   | Usage                                                                                                                                                                                   | Required For | Enable by default |


Please make it more explicit, what is required for customers to install, vs what is optional.

If possible, please also provide example commands for turning features on and off, through --set flags or values.yaml file.

nghtm · 2025-06-19T00:04:42Z

helm_chart/readme.md

-| training-operators           | Installs operators for managing various machine learning training jobs, such as TensorFlow, PyTorch, and MXNet, providing native Kubernetes support for distributed training workloads. |              | Yes               |
-| HyperPod patching            | Deploys the RBAC and controller resources needed for orchestrating rolling updates and patching workflows in SageMaker HyperPod clusters. Includes pod eviction and node monitoring.    |              | Yes               |
-| aws-efa-k8s-device-plugin    | This plugin enables AWS Elastic Fabric Adapter (EFA) metrics on the EKS clusters.                                                                                                        |              | Yes               |
+| MPI Operators                | Orchestrates MPI (Message Passing Interface) jobs on Kubernetes, providing an efficient way to manage distributed machine learning or high-performance computing (HPC) workloads.       | HyperPod Resiliency             | Yes               |


Is this required for resiliency? I have had cusotmers tell me they do not plan to use the MPI operator, is their a depdency on resilincy features?

Please use singular form "MPI Operator", "Training Operator", etc.
Refer to official names here: https://github.com/kubeflow/mpi-operator and here: https://github.com/kubeflow/trainer/tree/v1.8-branch

nghtm · 2025-07-07T13:11:19Z

helm_chart/readme.md

-| namespaced-role-and-bindings | Creates roles and role bindings within a specific namespace to manage fine-grained access control for Kubernetes resources in a limited scope.                                          |              | No                |
-| neuron-device-plugin         | Deploys the AWS Neuron device plugin for Kubernetes, enabling support for AWS Inferentia chips to accelerate machine learning model inference workloads.                                |              | Yes               |
-| storage                      | Manages persistent storage resources for Kubernetes applications, ensuring that data is retained and accessible across pod restarts and cluster upgrades.                               |              | No                |
-| training-operators           | Installs operators for managing various machine learning training jobs, such as TensorFlow, PyTorch, and MXNet, providing native Kubernetes support for distributed training workloads. |              | Yes               |


How is this different from the HyperPod Training operator? 2 different concepts with the same name will lead to confusion

https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html

I went ahead and updated the name to Kubeflow Training Operator to avoid the confusion, as I am assuming that we probably want to clarify how the HyperPod Training Operator will conflict with this. I will also share this PR with the appropriate team for review too

…lm installation (#105)

* Introduce helm charts for hyperpod inference operator * Introduce helm charts for hyperpod inference operator * Introduce helm charts for hyperpod inference operator * Update Helm charts for inference operator, clean up to remove bedrock references. * Changes to 1. update image tag 2. Remove IAM policies for execution role 3. Rename to hyperpod-inference-operator prefix instead of deploymentoperator prefix * Removed binary from the code base. * Nit: Update the app name labels for sample yaml files. * Merge pull request #29 from mbnavali/main Introduce helm charts for hyperpod inference operator * Add crds, service account and region (#32) * Add CRDs and setup for region * Change annotation for SA * Remove default region * Add hyperpod inference classes **Description** Support jumpstart and custom model endpoints **Testing Done** Tested manually, will add unit tests in next few PRs * Refactor create inference function **Description** Refactor ModelEndpoint classes to let create happen in separate method instead of constructor **Testing Done** Manually tested in demo notebook * Add List, Delete, Describe endpoint features Tested manually in demo jupyter notebook * Add unit test and update class names **Description** **Testing Done** Unit test passes * Add end and setup.py * Update gitignore * Add setup.cfg * Fix HPEndpoint class and add optional values * remove utils.py * Make function classmethod and update unit tests * Fix bugs for inference endpoint * Small fixes * build: add mountpoint s3 csi driver, keda + cert-manager controllers as dependencies feat: add pv and pvc creation as part of helm * chore: add inference operator as dependency for HP Helm Chart, default disabled * feat: add support for jumpstart gated models * fix: remove stray symbol * fix: rename inference operator chart to match name in parent * change: sync charts with latest version of operator * doc: update readme.md identifying the inference operator as a subchart * Add HyperpodPytorchJob class (#39) * Add HyperpodPytorchJob * update to class methods * update to class methods * Address feedback * Fix bug --------- Co-authored-by: pintaoz <[email protected]> * Add tlsConfig to quick create * Revert "Add tlsConfig to quick create" This reverts commit 574351e. * Add tls config * Update CRD configs and minor updates * Add model_location to HPEndpont * Adding observability command to fetch details of grafana, prometheus and list of enabled metrics. * Training CLI implementation: create * Adding observability SDK experience and updating CLI command signature * Rename CLI commands to be consistent with SDK * Training CLI for Launch * Training CLI for Launch * Training CLI for Launch * Training CLI for Launch * Update JumpStartModel interface (#51) * Update JumpStartModel interface Tested in Jupyter notebook that endpoint can be successfully invoked * Add refresh method * remove debugging print * Update HPEndpoint classes Tested using example notebooks * Add example notebooks These notebooks haven't been cleaned up and they are for internal review only. Commands are supposed to change later * Add metadata class * Get Cluster Context * Update to HyperPodManager call * Cleanup import * Training CLI for Launch * Training CLI for Launch * Training CLI for Launch * Update HyperPodPytorchJob (#52) * Add HyperpodPytorchJob * update to class methods * update to class methods * Address feedback * Fix bug * Update HyperPodPytorchJob * Fix dependency * Add status * Add list_pods and get_logs_from_pod * Add error handling and metadata * Add example notebook * Fix bug --------- Co-authored-by: pintaoz <[email protected]> * E2E testing done for inference CLI * delete build * Revert accidental submodule pointer change * Update inference example notebook and fix bugs * Reformat code with black * Add get_logs function for inference * Update HyperPodPytorchJob to not use _HyperPodPytorchJob object (#63) * Add HyperpodPytorchJob * update to class methods * update to class methods * Address feedback * Fix bug * Update HyperPodPytorchJob * Fix dependency * Add status * Add list_pods and get_logs_from_pod * Add error handling and metadata * Add example notebook * Fix bug * Hide _HyperPodPytorchJob from user * Fix merge conflicts --------- Co-authored-by: pintaoz <[email protected]> * Update get_logs function to accept since_hour Tested in notebook * Separate get_logs and get_operator_logs methods * Update get_logs to class method * Add container name to get_logs function * Add container in get_logs_from_pod (#66) Co-authored-by: pintaoz <[email protected]> * change inference CLI directory, add inference CLI notebook, add get-logs and get-operator-logs * delete build * Training CLI for Launch - Changes per SDK HyperPodPytorchJob constructor (#64) * Training CLI for Launch * Training CLI for Launch --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * | * d2453d6 (rig-dev) Add notes about HMA patching * add cloudwatchtrigger and autoscalingspec to model.py and schema.json * Add exception handling and update example notebooks (#71) * Add exception handling and update example notebooks * Update HPEndpoint get status * Add unit tests for training sdk * Update util tests * Add training cli example notebook (#72) Co-authored-by: Roja Reddy Sareddy <[email protected]> * Address comments * fix tls flag issue, fsx endpoint successfully created with cli notebook * clear notebook outputs * minor update in notebook * minor change to notebook * Move Metadata model to common (#75) Co-authored-by: pintaoz <[email protected]> * REstructure HPCLI * Fix training cli unit tests * Fix list jobs test * Fixed logger Logger sometimes does not function properly. Tested in example notebook * Updates from Testing * Update import path * Revert lines from readme (should not have been updated) * unit test for inference CLI done * resolve merge conflicts * rebase with master * clean up * clean up recipes * Merging hyp and hyperpod commands in a common entry point as hyp * Removing not relevant directories and updating setup and pyproject (#87) * Add unit test and fix HyperPod Manager (#84) * Add unit test and fix HyperPod Manager 1. Default namespace can be set by HyperpodManager.set_context() 2. Added unit tests for inference * Remove debug print * update print for inference CLI for list and describe, bug fix for since-hours flag to support float, minor update to notebook (#85) * Append uuid to endpoint name (#90) * Add unit test and fix HyperPod Manager 1. Default namespace can be set by HyperpodManager.set_context() 2. Added unit tests for inference * Remove debug print * Append uuid to model name and endpoint name * minor fix in create method * Fix set_context in HyperPodManager (#91) * Add unit test and fix HyperPod Manager 1. Default namespace can be set by HyperpodManager.set_context() 2. Added unit tests for inference * Remove debug print * Append uuid to model name and endpoint name * minor fix in create method * Fix set_context in HyperPodManager * Add logging info for delete() * Remove Self from type hint (#92) * Add unit test and fix HyperPod Manager 1. Default namespace can be set by HyperpodManager.set_context() 2. Added unit tests for inference * Remove debug print * Append uuid to model name and endpoint name * minor fix in create method * Fix set_context in HyperPodManager * Add logging info for delete() * Remove Self in type hint This only supports python version 3.11+ * Minor documentation fixes for RIG Helm (#93) * Bug fix: Fixed create command job error (#94) Co-authored-by: Roja Reddy Sareddy <[email protected]> * [HyperPod Inference] Update RBAC with perms for KEDA, allow direct provision of operator image repository (#44) * change: add rbac perms for KEDA scaledobject * change: allow image.repository to be set directly via flag * change: consistently use namePrefix for app name and resources * fix: remove empty string as default value * fix: reference correct value for tls cert bucket URI fix: override empty image.repository values from domain map change: use shorter prefix for namespace change: do not require sageMakerEndpoint * Adding dynamic flag for dependencies installation (#95) * Add utils unit tests for training cli (#97) * Bug fix: Fixed create command job error * Add utils unit tests for training cli --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Add instance type validation for JS model (#98) * Adding observability notebook (#96) * Inference dogfood notebook update (#99) * update print for inference CLI for list and describe, bug fix for since-hours flag to support float, minor update to notebook * change hyperpod to hyp in inferece cli notebook * update inference CLI notebook to reflect uuid change * Unique job name: Append uuid to training job name (#101) * Bug fix: Fixed create command job error * Add utils unit tests for training cli * Unique job name: Append uuid to training job name * Unique job name: Append uuid to training job name --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Inference CLI update after dogfood (#102) * update print for inference CLI for list and describe, bug fix for since-hours flag to support float, minor update to notebook * change hyperpod to hyp in inferece cli notebook * update inference CLI notebook to reflect uuid change * update list and describe after dogfood callout, remove get_logs for inference CLI, update help text for CLI * Lookup standard Helm release name for RIG Helm installation (1ff9c) (#104) * Minor negative case update for Helm release name lookup during RIG Helm installation (#105) * Add JumpStart PublicHub model visualization utilities. (#106) * Add JumpStart PublicHub model visualization utilities. * Add JumpStart PublicHub model visualization utilities. * Update cli command noun to hyp-*, logging, list_jobs bug fix (#107) * Bug fix: Fixed create command job error * Add utils unit tests for training cli * Unique job name: Append uuid to training job name * Unique job name: Append uuid to training job name * Update command verb name to hyp, logging, list_jobs bug fix * Update command verb name to hyp, logging, list_jobs bug fix --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Make metadata name same as endpoint name; Updated instance type validation (#110) Unit test passes and verified in jupyter notebook * Add integ test for training CLI and SDK (#100) * Add integ test for training cli * Add integ test for training sdk * relax pydantic version * fix pydantic version * return latest cluster and fix set cluster context test --------- Co-authored-by: adishaa <[email protected]> * baseline inference integration test for CLI and SDK, minor bug fixes (#111) * baseline inference integration test for CLI and SDK, minor bug fix for inference cli, clear inference sdk notebook output * clean up merge header * Remove UUID from training and Inference (#108) * Remove UUID from training and Inference * Fixes and PR comments * Fix * Fix logging * Fix * Update inference logging setup similar to training (#113) * Bug fix: Fixed create command job error * Add utils unit tests for training cli * Unique job name: Append uuid to training job name * Unique job name: Append uuid to training job name * Update command verb name to hyp, logging, list_jobs bug fix * Update command verb name to hyp, logging, list_jobs bug fix * Update inference logging setup similar to training --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Change hp-pytorch-job to hyp-pytorch-job (#115) Co-authored-by: adishaa <[email protected]> * Add methods for list pods and namespaces (#114) Added unit test and tested in notebook * Minor change in training cli notebook: UUID removed (#117) * Bug fix: Fixed create command job error * Add utils unit tests for training cli * Unique job name: Append uuid to training job name * Unique job name: Append uuid to training job name * Update command verb name to hyp, logging, list_jobs bug fix * Update command verb name to hyp, logging, list_jobs bug fix * Update inference logging setup similar to training * Minor change in training cli notebook: UUID removed --------- Co-authored-by: Roja Reddy Sareddy <[email protected]> * Cleaner error messading for Endpoint invoke (#112) * Invoke Validation check * Fix * Bumping kubernetes python client version and updating observability command (#116) * change: add prefix to convert bucket name to s3 URI (#109) * Added type check on commands before invoking subprocess run (#118) * Bring HyperPodManager class util functions (#119) * Bring HyperPodManager class util functions Unit tests pass and verified in notebook * Update init * Add list_pods and get_logs for CLI (Update notebook, integ test, unit test) (#120) * baseline inference integration test for CLI and SDK, minor bug fix for inference cli, clear inference sdk notebook output * update integ test in progress * update cli code, notebook, integ and unit test to add list_pods and get_logs * clean up merge header * Update inference and training to only check kubeconfig on the first time (#122) Updated unit tests and verified in notebook * Update Readme to include Inference and Training (#121) * Update Readme to include Inference and Training * Update readme command * Documentation updates * Doc Updates * Move observability utils and constants; Rename set_context/get_context (#125) * Update inference and training to only check kubeconfig on the first time Updated unit tests and verified in notebook * Remove old unit tests * Revert "Remove old unit tests" This reverts commit e728e9864c853635f724e9a377fbe870f0f2e2a4. * Move observability utils and constants; Rename set_context/get_context * Updating template packages name and structure (#126) * Changelog updates (#128) * Changelog updates * Rebase and update * Fix * Readme update (#129) * Update Readme to include Inference and Training * Update readme command * Documentation updates * Doc Updates * Readme updates * Fix README.md * Remove Orchestrator from List Cluster * Changes to README.md * Fix the link * Remove orchestrator from README.md * Unit test fix (#127) * use unique basename for test file modules * fix unit tests * remove append_uuid test * fix failing test_invoke tests --------- Co-authored-by: adishaa <[email protected]> * Fix get_cluster_context runtime error (#130) * Remove Py38 Tests (#131) * Fix get_cluster_context runtime error * Remove Py38 fromtests * UNit test fixes (#132) * Fix get_cluster_context runtime error * Remove Py38 fromtests * Fix * Unit test fixes * Inference integ tests all passed in Chait's account (#135) * baseline inference integration test for CLI and SDK, minor bug fix for inference cli, clear inference sdk notebook output * update integ test in progress * update cli code, notebook, integ and unit test to add list_pods and get_logs * clean up merge header * inference integ tests all passing in chait's account * Update operator namespace string (#137) * Inference integ test passed on beta account (#140) * baseline inference integration test for CLI and SDK, minor bug fix for inference cli, clear inference sdk notebook output * update integ test in progress * update cli code, notebook, integ and unit test to add list_pods and get_logs * clean up merge header * inference integ tests all passing in chait's account * integ test passing on beta account * is_kubeconfig_loaded Fix (#139) * Test PR * Fix is_kubeconfig_loaded Class attribute bug * Include main branch in pull request target --------- Co-authored-by: Mahadeva N <[email protected]> Co-authored-by: Shantanu Tripathi <[email protected]> Co-authored-by: Zhaoqi <[email protected]> Co-authored-by: jzhaoqwa <[email protected]> Co-authored-by: Rahul Sahu <[email protected]> Co-authored-by: rvasahu-amazon <[email protected]> Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: pintaoz <[email protected]> Co-authored-by: Molly He <[email protected]> Co-authored-by: Amarjeet LNU <[email protected]> Co-authored-by: Roja Reddy Sareddy <[email protected]> Co-authored-by: rsareddy0329 <[email protected]> Co-authored-by: Chris Chan <[email protected]> Co-authored-by: adishaa <[email protected]> Co-authored-by: Aditi Sharma <[email protected]> Co-authored-by: chnnmz <[email protected]>

nghtm · 2025-07-10T19:51:35Z

Can we please push this PR?

vivek-koppuru · 2025-07-10T21:30:05Z

Can we please push this PR?

I was looking to have @jtbangani review this, will follow up.

nghtm · 2025-07-10T22:35:58Z

The readme formatting looks off when viewing the file

vivek-koppuru · 2025-07-10T22:42:07Z

The readme formatting looks off when viewing the file

It should be updated now, I forgot to push the saved the file

vivek-koppuru requested a review from a team as a code owner June 18, 2025 23:44

vivek-koppuru had a problem deploying to manual-approval June 18, 2025 23:44 — with GitHub Actions Error

nghtm reviewed Jun 19, 2025

View reviewed changes

vivek-koppuru force-pushed the update-required branch from ab857d7 to ae31e65 Compare July 7, 2025 07:28

vivek-koppuru had a problem deploying to manual-approval July 7, 2025 07:29 — with GitHub Actions Error

vivek-koppuru force-pushed the update-required branch from ae31e65 to 20750db Compare July 7, 2025 07:30

vivek-koppuru had a problem deploying to manual-approval July 7, 2025 07:31 — with GitHub Actions Error

nghtm approved these changes Jul 7, 2025

View reviewed changes

vivek-koppuru force-pushed the update-required branch from 20750db to e39e019 Compare July 7, 2025 22:34

vivek-koppuru had a problem deploying to manual-approval July 7, 2025 22:35 — with GitHub Actions Error

nargokul pushed a commit that referenced this pull request Jul 10, 2025

Minor negative case update for Helm release name lookup during RIG He…

63b0481

…lm installation (#105)

vivek-koppuru force-pushed the update-required branch from e39e019 to 0b170fe Compare July 10, 2025 21:34

vivek-koppuru had a problem deploying to manual-approval July 10, 2025 21:34 — with GitHub Actions Error

Update readme to show with charts are required for specific features

6605f6f

vivek-koppuru force-pushed the update-required branch from 0b170fe to 6605f6f Compare July 10, 2025 22:41

vivek-koppuru temporarily deployed to manual-approval July 10, 2025 22:41 — with GitHub Actions Inactive

jtbangani approved these changes Jul 14, 2025

View reviewed changes

rsareddy0329 merged commit c9571ae into aws:main Jul 17, 2025
4 of 16 checks passed


		Here are the list of dependent charts and plugins that can be installed as part of the HyperPod Helm chart. Features required for HyperPod Resiliency are recommended to enable cluster resiliency. Features required for HyperPod Task Governance are optional but help set access control on your cluster. More information about orchestration features for cluster admins [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks.html).

		\| Chart Name \| Usage \| Required For \| Enable by default \|

Update readme to show with charts are required for specific features #105

Update readme to show with charts are required for specific features #105

Uh oh!

Conversation

vivek-koppuru commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Approval Steps

For Requester

For Reviewer

Uh oh!

nghtm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nghtm commented Jul 10, 2025

Uh oh!

vivek-koppuru commented Jul 10, 2025

Uh oh!

nghtm commented Jul 10, 2025

Uh oh!

vivek-koppuru commented Jul 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vivek-koppuru commented Jun 18, 2025 •

edited

Loading