Skip to content

Commit 8d1136e

Browse files
authored
Update ReadMe for Helm Charts to specify required charts and their purpose (#101)
* Update readme.md formatting to add a "required for" column * Update readme.md * Update readme.md with some initial requirements for helm chart depedencies
1 parent 84ba244 commit 8d1136e

File tree

1 file changed

+15
-15
lines changed

1 file changed

+15
-15
lines changed

helm_chart/readme.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -13,21 +13,21 @@ chmod 700 get_helm.sh
1313

1414
## 2. Package structure
1515

16-
| Chart Name | Usage | Enable by default |
17-
|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------|
18-
| Cluster role and binding | Defines cluster-wide roles and bindings for Kubernetes resources, allowing cluster administrators to assign and manage permissions across the entire cluster. | No |
19-
| Team role and binging | Defines cluster and namespaced roles and bindings, allowing cluster administrators to create scientist roles with sufficient permissions to submit jobs to the accessible teams. | No |
20-
| Deep health check | Implements advanced health checks for Kubernetes services and pods to ensure deep monitoring of resource status and functionality beyond basic liveness and readiness probes. | Yes |
21-
| Health monitoring agent | Deploys an agent to continuously monitor the health of Kubernetes applications, providing detailed insights and alerting for potential issues. | Yes |
22-
| Job auto restart | Configures automatic restart policies for Kubernetes jobs, ensuring failed or terminated jobs are restarted based on predefined conditions for high availability. | Yes |
23-
| MLflow | Installs the MLflow platform for managing machine learning experiments, tracking models, and storing model artifacts in a scalable manner within the Kubernetes cluster. | No |
24-
| MPI Operators | Orchestrates MPI (Message Passing Interface) jobs on Kubernetes, providing an efficient way to manage distributed machine learning or high-performance computing (HPC) workloads. | Yes |
25-
| namespaced-role-and-bindings | Creates roles and role bindings within a specific namespace to manage fine-grained access control for Kubernetes resources in a limited scope. | No |
26-
| neuron-device-plugin | Deploys the AWS Neuron device plugin for Kubernetes, enabling support for AWS Inferentia chips to accelerate machine learning model inference workloads. | Yes |
27-
| storage | Manages persistent storage resources for Kubernetes applications, ensuring that data is retained and accessible across pod restarts and cluster upgrades. | No |
28-
| training-operators | Installs operators for managing various machine learning training jobs, such as TensorFlow, PyTorch, and MXNet, providing native Kubernetes support for distributed training workloads. | Yes |
29-
| HyperPod patching | Deploys the RBAC and controller resources needed for orchestrating rolling updates and patching workflows in SageMaker HyperPod clusters. Includes pod eviction and node monitoring. | Yes |
30-
| aws-efa-k8s-device-plugin | This plugin enables AWS Elastic Fabric Adapter (EFA) metrics on the EKS clusters. | Yes |
16+
| Chart Name | Usage | Required For | Enable by default |
17+
|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|-------------------|
18+
| Cluster role and binding | Defines cluster-wide roles and bindings for Kubernetes resources, allowing cluster administrators to assign and manage permissions across the entire cluster. | | No |
19+
| Team role and binging | Defines cluster and namespaced roles and bindings, allowing cluster administrators to create scientist roles with sufficient permissions to submit jobs to the accessible teams. | | No |
20+
| Deep health check | Implements advanced health checks for Kubernetes services and pods to ensure deep monitoring of resource status and functionality beyond basic liveness and readiness probes. | [Deep Health Check](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-resiliency-deep-health-checks.html) | Yes |
21+
| Health monitoring agent | Deploys an agent to continuously monitor the health of Kubernetes applications, providing detailed insights and alerting for potential issues. | [Health Checks done by Health Monitoring Agent](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-resiliency-health-monitoring-agent.html) | Yes |
22+
| Job auto restart | Configures automatic restart policies for Kubernetes jobs, ensuring failed or terminated jobs are restarted based on predefined conditions for high availability. | | Yes |
23+
| MLflow | Installs the MLflow platform for managing machine learning experiments, tracking models, and storing model artifacts in a scalable manner within the Kubernetes cluster. | | No |
24+
| MPI Operators | Orchestrates MPI (Message Passing Interface) jobs on Kubernetes, providing an efficient way to manage distributed machine learning or high-performance computing (HPC) workloads. | | Yes |
25+
| namespaced-role-and-bindings | Creates roles and role bindings within a specific namespace to manage fine-grained access control for Kubernetes resources in a limited scope. | | No |
26+
| neuron-device-plugin | Deploys the AWS Neuron device plugin for Kubernetes, enabling support for AWS Inferentia chips to accelerate machine learning model inference workloads. | | Yes |
27+
| storage | Manages persistent storage resources for Kubernetes applications, ensuring that data is retained and accessible across pod restarts and cluster upgrades. | | No |
28+
| training-operators | Installs operators for managing various machine learning training jobs, such as TensorFlow, PyTorch, and MXNet, providing native Kubernetes support for distributed training workloads. | | Yes |
29+
| HyperPod patching | Deploys the RBAC and controller resources needed for orchestrating rolling updates and patching workflows in SageMaker HyperPod clusters. Includes pod eviction and node monitoring. | | Yes |
30+
| aws-efa-k8s-device-plugin | This plugin enables AWS Elastic Fabric Adapter (EFA) metrics on the EKS clusters. | | Yes |
3131

3232
## 3. Test the Chart Locally
3333

0 commit comments

Comments
 (0)