You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update ReadMe for Helm Charts to specify required charts and their purpose (#101)
* Update readme.md formatting to add a "required for" column
* Update readme.md
* Update readme.md with some initial requirements for helm chart depedencies
| Cluster role and binding | Defines cluster-wide roles and bindings for Kubernetes resources, allowing cluster administrators to assign and manage permissions across the entire cluster. | No |
19
-
| Team role and binging | Defines cluster and namespaced roles and bindings, allowing cluster administrators to create scientist roles with sufficient permissions to submit jobs to the accessible teams. | No |
20
-
| Deep health check | Implements advanced health checks for Kubernetes services and pods to ensure deep monitoring of resource status and functionality beyond basic liveness and readiness probes. | Yes |
21
-
| Health monitoring agent | Deploys an agent to continuously monitor the health of Kubernetes applications, providing detailed insights and alerting for potential issues. | Yes |
22
-
| Job auto restart | Configures automatic restart policies for Kubernetes jobs, ensuring failed or terminated jobs are restarted based on predefined conditions for high availability. | Yes |
23
-
| MLflow | Installs the MLflow platform for managing machine learning experiments, tracking models, and storing model artifacts in a scalable manner within the Kubernetes cluster. | No |
24
-
| MPI Operators | Orchestrates MPI (Message Passing Interface) jobs on Kubernetes, providing an efficient way to manage distributed machine learning or high-performance computing (HPC) workloads. | Yes |
25
-
| namespaced-role-and-bindings | Creates roles and role bindings within a specific namespace to manage fine-grained access control for Kubernetes resources in a limited scope. | No |
26
-
| neuron-device-plugin | Deploys the AWS Neuron device plugin for Kubernetes, enabling support for AWS Inferentia chips to accelerate machine learning model inference workloads. | Yes |
27
-
| storage | Manages persistent storage resources for Kubernetes applications, ensuring that data is retained and accessible across pod restarts and cluster upgrades. | No |
28
-
| training-operators | Installs operators for managing various machine learning training jobs, such as TensorFlow, PyTorch, and MXNet, providing native Kubernetes support for distributed training workloads. | Yes |
29
-
| HyperPod patching | Deploys the RBAC and controller resources needed for orchestrating rolling updates and patching workflows in SageMaker HyperPod clusters. Includes pod eviction and node monitoring. | Yes |
30
-
| aws-efa-k8s-device-plugin | This plugin enables AWS Elastic Fabric Adapter (EFA) metrics on the EKS clusters. | Yes|
16
+
| Chart Name | Usage |Required For |Enable by default |
| Cluster role and binding | Defines cluster-wide roles and bindings for Kubernetes resources, allowing cluster administrators to assign and manage permissions across the entire cluster. ||No |
19
+
| Team role and binging | Defines cluster and namespaced roles and bindings, allowing cluster administrators to create scientist roles with sufficient permissions to submit jobs to the accessible teams. ||No |
20
+
| Deep health check | Implements advanced health checks for Kubernetes services and pods to ensure deep monitoring of resource status and functionality beyond basic liveness and readiness probes. |[Deep Health Check](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-resiliency-deep-health-checks.html)|Yes |
21
+
| Health monitoring agent | Deploys an agent to continuously monitor the health of Kubernetes applications, providing detailed insights and alerting for potential issues. |[Health Checks done by Health Monitoring Agent](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-resiliency-health-monitoring-agent.html)|Yes |
22
+
| Job auto restart | Configures automatic restart policies for Kubernetes jobs, ensuring failed or terminated jobs are restarted based on predefined conditions for high availability. ||Yes |
23
+
| MLflow | Installs the MLflow platform for managing machine learning experiments, tracking models, and storing model artifacts in a scalable manner within the Kubernetes cluster. ||No |
24
+
| MPI Operators | Orchestrates MPI (Message Passing Interface) jobs on Kubernetes, providing an efficient way to manage distributed machine learning or high-performance computing (HPC) workloads. ||Yes |
25
+
| namespaced-role-and-bindings | Creates roles and role bindings within a specific namespace to manage fine-grained access control for Kubernetes resources in a limited scope. ||No |
26
+
| neuron-device-plugin | Deploys the AWS Neuron device plugin for Kubernetes, enabling support for AWS Inferentia chips to accelerate machine learning model inference workloads. ||Yes |
27
+
| storage | Manages persistent storage resources for Kubernetes applications, ensuring that data is retained and accessible across pod restarts and cluster upgrades. ||No |
28
+
| training-operators | Installs operators for managing various machine learning training jobs, such as TensorFlow, PyTorch, and MXNet, providing native Kubernetes support for distributed training workloads. ||Yes |
29
+
| HyperPod patching | Deploys the RBAC and controller resources needed for orchestrating rolling updates and patching workflows in SageMaker HyperPod clusters. Includes pod eviction and node monitoring. ||Yes |
30
+
| aws-efa-k8s-device-plugin | This plugin enables AWS Elastic Fabric Adapter (EFA) metrics on the EKS clusters. || Yes |
0 commit comments