Skip to content

Commit ab857d7

Browse files
committed
Update readme to show with charts are required for specific features
1 parent 0f9f7b7 commit ab857d7

File tree

1 file changed

+52
-12
lines changed

1 file changed

+52
-12
lines changed

helm_chart/readme.md

Lines changed: 52 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -13,21 +13,61 @@ chmod 700 get_helm.sh
1313

1414
## 2. Package structure
1515

16+
Here are the list of dependent charts and plugins that can be installed as part of the HyperPod Helm chart. Features required for HyperPod Resiliency are recommended to enable cluster resiliency. Features required for HyperPod Task Governance are optional but help set access control on your cluster. More information about orchestration features for cluster admins [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks.html).
17+
1618
| Chart Name | Usage | Required For | Enable by default |
1719
|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|-------------------|
18-
| Cluster role and binding | Defines cluster-wide roles and bindings for Kubernetes resources, allowing cluster administrators to assign and manage permissions across the entire cluster. | | No |
19-
| Team role and binging | Defines cluster and namespaced roles and bindings, allowing cluster administrators to create scientist roles with sufficient permissions to submit jobs to the accessible teams. | | No |
20-
| Deep health check | Implements advanced health checks for Kubernetes services and pods to ensure deep monitoring of resource status and functionality beyond basic liveness and readiness probes. | [Deep Health Check](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-resiliency-deep-health-checks.html) | Yes |
21-
| Health monitoring agent | Deploys an agent to continuously monitor the health of Kubernetes applications, providing detailed insights and alerting for potential issues. | [Health Checks done by Health Monitoring Agent](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-resiliency-health-monitoring-agent.html) | Yes |
22-
| Job auto restart | Configures automatic restart policies for Kubernetes jobs, ensuring failed or terminated jobs are restarted based on predefined conditions for high availability. | | Yes |
20+
| Cluster role and binding | Defines cluster-wide roles and bindings for Kubernetes resources, allowing cluster administrators to assign and manage permissions across the entire cluster. | HyperPod Task Governance | No |
21+
| Team role and binding | Defines cluster and namespaced roles and bindings, allowing cluster administrators to create scientist roles with sufficient permissions to submit jobs to the accessible teams. | HyperPod Task Governance | No |
22+
| Deep health check | Implements advanced health checks for Kubernetes services and pods to ensure deep monitoring of resource status and functionality beyond basic liveness and readiness probes. | HyperPod Resiliency | Yes |
23+
| Health monitoring agent | Deploys an agent to continuously monitor the health of Kubernetes applications, providing detailed insights and alerting for potential issues. | HyperPod Resiliency | Yes |
24+
| Job auto restart | Configures automatic restart policies for Kubernetes jobs, ensuring failed or terminated jobs are restarted based on predefined conditions for high availability. | HyperPod Resiliency | Yes |
2325
| MLflow | Installs the MLflow platform for managing machine learning experiments, tracking models, and storing model artifacts in a scalable manner within the Kubernetes cluster. | | No |
24-
| MPI Operators | Orchestrates MPI (Message Passing Interface) jobs on Kubernetes, providing an efficient way to manage distributed machine learning or high-performance computing (HPC) workloads. | | Yes |
25-
| namespaced-role-and-bindings | Creates roles and role bindings within a specific namespace to manage fine-grained access control for Kubernetes resources in a limited scope. | | No |
26-
| neuron-device-plugin | Deploys the AWS Neuron device plugin for Kubernetes, enabling support for AWS Inferentia chips to accelerate machine learning model inference workloads. | | Yes |
27-
| storage | Manages persistent storage resources for Kubernetes applications, ensuring that data is retained and accessible across pod restarts and cluster upgrades. | | No |
28-
| training-operators | Installs operators for managing various machine learning training jobs, such as TensorFlow, PyTorch, and MXNet, providing native Kubernetes support for distributed training workloads. | | Yes |
29-
| HyperPod patching | Deploys the RBAC and controller resources needed for orchestrating rolling updates and patching workflows in SageMaker HyperPod clusters. Includes pod eviction and node monitoring. | | Yes |
30-
| aws-efa-k8s-device-plugin | This plugin enables AWS Elastic Fabric Adapter (EFA) metrics on the EKS clusters. | | Yes |
26+
| MPI Operators | Orchestrates MPI (Message Passing Interface) jobs on Kubernetes, providing an efficient way to manage distributed machine learning or high-performance computing (HPC) workloads. | HyperPod Resiliency | Yes |
27+
| Namespaced Role and Bindings | Creates roles and role bindings within a specific namespace to manage fine-grained access control for Kubernetes resources in a limited scope. | HyperPod Task Governance | No |
28+
| Storage | Manages persistent storage resources for Kubernetes applications, ensuring that data is retained and accessible across pod restarts and cluster upgrades. | | No |
29+
| Training Operators | Installs operators for managing various machine learning training jobs, such as TensorFlow, PyTorch, and MXNet, providing native Kubernetes support for distributed training workloads. | | Yes |
30+
| HyperPod patching | Deploys the RBAC and controller resources needed for orchestrating rolling updates and patching workflows in SageMaker HyperPod clusters. Includes pod eviction and node monitoring. | HyperPod Resiliency | Yes |
31+
32+
The following plugins are only required for HyperPod Resiliency if you are using the following supported devices, such as GPU/Neuron instances, unless you install these plugins on your own.
33+
34+
| Plugin Name | Usage | Required For | Enable by default |
35+
|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|-------------------|
36+
| neuron-device-plugin | Deploys the AWS Neuron device plugin for Kubernetes, enabling support for AWS Inferentia chips to accelerate machine learning model inference workloads. | HyperPod Resiliency with AWS Neuron | Yes |
37+
| aws-efa-k8s-device-plugin | This plugin enables AWS Elastic Fabric Adapter (EFA) metrics on the EKS clusters. | HyperPod Resiliency with AWS EFA | Yes |
38+
| nvidia-device-plugin | This plugin is a Daemon set that exposes number of GPUs on each node, keeps track health metrics, and enables running GPU enabled containers in EKS clusters. | HyperPod Resiliency with Nvidia GPUs | Yes |
39+
40+
If you install these plugins on your own, make sure that the following configurations are set to work with your HyperPod EKS clusters:
41+
42+
Tolerations (across all plugins):
43+
```
44+
- key: sagemaker.amazonaws.com/node-health-status
45+
operator: Equal
46+
value: Unschedulable
47+
effect: NoSchedule
48+
```
49+
50+
Node Affinities (for neuron and nvidia plugins):
51+
```
52+
affinity:
53+
nodeAffinity:
54+
requiredDuringSchedulingIgnoredDuringExecution:
55+
nodeSelectorTerms:
56+
- matchExpressions:
57+
- key: "node.kubernetes.io/instance-type"
58+
operator: In
59+
values:
60+
- <your HyperPod instance types>
61+
```
62+
63+
Supported Instance Labels (for efa plugin):
64+
Set this in your values.yaml
65+
```
66+
supportedInstanceLabels:
67+
values:
68+
- <your HyperPod instance types>
69+
```
70+
3171

3272
## 3. Test the Chart Locally
3373

0 commit comments

Comments
 (0)