You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: helm_chart/readme.md
+76-17Lines changed: 76 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Building, Packaging, and Testing Helm Charts
2
2
3
-
This guide walks cluster Admin users through the process of creating, packaging, and testing HyperPod Helm chart.
3
+
This guide walks cluster Admin users through the process of creating, packaging, and testing HyperPod Helm chart. More information in the official AWS documentation [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-install-packages-using-helm-chart.html).
| Cluster role and binding | Defines cluster-wide roles and bindings for Kubernetes resources, allowing cluster administrators to assign and manage permissions across the entire cluster. | No |
19
-
| Team role and binding | Defines cluster and namespaced roles and bindings, allowing cluster administrators to create scientist roles with sufficient permissions to submit jobs to the accessible teams. | No |
20
-
| Deep health check | Implements advanced health checks for Kubernetes services and pods to ensure deep monitoring of resource status and functionality beyond basic liveness and readiness probes. | Yes |
21
-
| Health monitoring agent | Deploys an agent to continuously monitor the health of Kubernetes applications, providing detailed insights and alerting for potential issues. | Yes |
22
-
| Job auto restart | Configures automatic restart policies for Kubernetes jobs, ensuring failed or terminated jobs are restarted based on predefined conditions for high availability. | Yes |
23
-
| MLflow | Installs the MLflow platform for managing machine learning experiments, tracking models, and storing model artifacts in a scalable manner within the Kubernetes cluster. | No |
24
-
| MPI Operators | Orchestrates MPI (Message Passing Interface) jobs on Kubernetes, providing an efficient way to manage distributed machine learning or high-performance computing (HPC) workloads. | Yes |
25
-
| namespaced-role-and-bindings | Creates roles and role bindings within a specific namespace to manage fine-grained access control for Kubernetes resources in a limited scope. | No |
26
-
| neuron-device-plugin | Deploys the AWS Neuron device plugin for Kubernetes, enabling support for AWS Inferentia chips to accelerate machine learning model inference workloads. | Yes |
27
-
| storage | Manages persistent storage resources for Kubernetes applications, ensuring that data is retained and accessible across pod restarts and cluster upgrades. | No |
28
-
| training-operators | Installs operators for managing various machine learning training jobs, such as TensorFlow, PyTorch, and MXNet, providing native Kubernetes support for distributed training workloads. | Yes |
29
-
| HyperPod patching | Deploys the RBAC and controller resources needed for orchestrating rolling updates and patching workflows in SageMaker HyperPod clusters. Includes pod eviction and node monitoring. | Yes |
30
-
| aws-efa-k8s-device-plugin | This plugin enables AWS Elastic Fabric Adapter (EFA) metrics on the EKS clusters. | Yes |
16
+
Here are the list of dependent charts and plugins that can be installed as part of the HyperPod Helm chart. Features required for HyperPod Resiliency as mentioned below are recommended to enable cluster resiliency. Features required for HyperPod Task Governance as mentioned below are optional but help set access control on your cluster.
17
+
18
+
More information about HyperPod task governance [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html).
19
+
20
+
More information about orchestration features for cluster admins [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks.html).
21
+
22
+
| Chart Name | Usage | Required For | Enable by default |
|[Cluster role and binding](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-setup-rbac.html)| Defines cluster-wide roles and bindings for Kubernetes resources, allowing cluster administrators to assign and manage permissions across the entire cluster. | HyperPod Task Governance | No |
25
+
|[Namespaced Role and Bindings](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-setup-rbac.html)| Creates roles and role bindings within a specific namespace to manage fine-grained access control for Kubernetes resources in a limited scope. | HyperPod Task Governance | No |
26
+
|[Team role and binding](#5-create-team-role)| Defines cluster and namespaced roles and bindings, allowing cluster administrators to create scientist roles with sufficient permissions to submit jobs to the accessible teams. | HyperPod Task Governance | No |
27
+
|[Deep health check](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-resiliency-deep-health-checks.html)| Implements advanced health checks for Kubernetes services and pods to ensure deep monitoring of resource status and functionality beyond basic liveness and readiness probes. | HyperPod Resiliency | Yes |
28
+
|[Health monitoring agent](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-resiliency-health-monitoring-agent.html)| Deploys an agent to continuously monitor the health of Kubernetes applications, providing detailed insights and alerting for potential issues. | HyperPod Resiliency | Yes |
29
+
| Job auto restart | Configures automatic restart policies for Kubernetes jobs, ensuring failed or terminated jobs are restarted based on predefined conditions for high availability. | HyperPod Resiliency | Yes |
30
+
| MLflow | Installs the MLflow platform for managing machine learning experiments, tracking models, and storing model artifacts in a scalable manner within the Kubernetes cluster. || No |
31
+
|[MPI Operator](https://www.kubeflow.org/docs/components/trainer/legacy-v1/user-guides/mpi/)| Orchestrates MPI (Message Passing Interface) jobs on Kubernetes, providing an efficient way to manage distributed machine learning or high-performance computing (HPC) workloads. | HyperPod Resiliency with MPIJobs | Yes |
32
+
| Storage | Manages persistent storage resources for Kubernetes applications, ensuring that data is retained and accessible across pod restarts and cluster upgrades. || No |
33
+
|[Kubeflow Training Operator](https://www.kubeflow.org/docs/components/trainer/legacy-v1/overview/)| Installs operators for managing various machine learning training jobs, such as TensorFlow, PyTorch, and MXNet, providing native Kubernetes support for distributed training workloads. || Yes |
34
+
| HyperPod patching | Deploys the RBAC and controller resources needed for orchestrating rolling updates and patching workflows in SageMaker HyperPod clusters. Includes pod eviction and node monitoring. | HyperPod Resiliency | Yes |
31
35
| hyperpod-inference-operator | Installs the HyperPod Inference Operator and its dependencies to the cluster, allowing cluster deployment and inferencing of JumpStart, s3-hosted, and FSx-hosted models | No |
32
36
37
+
> **_Note_** The `mpijob` scheme is disabled in the Training Operator helm chart to avoid conflicting with the MPI Operator.
38
+
39
+
If you would like to disable a helm chart that is enabled by default, such as the Training Operator, pass in `--set trainingOperators.enabled=false` when installing or upgrading the main chart or set the following in the values.yaml file.
40
+
```
41
+
trainingOperators:
42
+
enabled: false
43
+
```
44
+
45
+
If you would like to enable a helm chart that is disabled by default, such as the Storage chart, pass in `--set storage.enabled=true` when installing or upgrading the main chart of set the following in the values.yaml file.
46
+
```
47
+
storage:
48
+
enabled: true
49
+
```
50
+
51
+
---
52
+
53
+
The following plugins are only required for HyperPod Resiliency if you are using the following supported devices, such as GPU/Neuron instances, unless you install these plugins on your own.
54
+
55
+
| Plugin Name | Usage | Required For | Enable by default |
| neuron-device-plugin | Deploys the AWS Neuron device plugin for Kubernetes, enabling support for AWS Inferentia chips to accelerate machine learning model inference workloads. | HyperPod Resiliency with AWS Neuron | Yes |
58
+
| aws-efa-k8s-device-plugin | This plugin enables AWS Elastic Fabric Adapter (EFA) metrics on the EKS clusters. | HyperPod Resiliency with AWS EFA | Yes |
59
+
| nvidia-device-plugin | This plugin is a Daemon set that exposes number of GPUs on each node, keeps track health metrics, and enables running GPU enabled containers in EKS clusters. | HyperPod Resiliency with Nvidia GPUs | Yes |
60
+
61
+
If you install these plugins on your own, make sure that the following configurations are set to work with your HyperPod EKS clusters:
62
+
63
+
Tolerations (across all plugins):
64
+
```
65
+
- key: sagemaker.amazonaws.com/node-health-status
66
+
operator: Equal
67
+
value: Unschedulable
68
+
effect: NoSchedule
69
+
```
70
+
71
+
Node Affinities (for neuron and nvidia plugins):
72
+
```
73
+
affinity:
74
+
nodeAffinity:
75
+
requiredDuringSchedulingIgnoredDuringExecution:
76
+
nodeSelectorTerms:
77
+
- matchExpressions:
78
+
- key: "node.kubernetes.io/instance-type"
79
+
operator: In
80
+
values:
81
+
- <your HyperPod instance types>
82
+
```
83
+
84
+
Supported Instance Labels (for efa plugin):
85
+
Set this in your values.yaml
86
+
```
87
+
supportedInstanceLabels:
88
+
values:
89
+
- <your HyperPod instance types>
90
+
```
91
+
33
92
## 3. Test the Chart Locally
34
93
35
94
To ensure that your chart is properly defined, use the helm lint command:
@@ -93,7 +152,7 @@ Notes:
93
152
94
153
## 5. Create Team Role
95
154
96
-
* To create role for hyperpod cluster users, please set the value for `computeQuotaTarget.targeId` when installing or upgrade the chart. This value is the same as the `targeId` of quota allocation.
155
+
* To create role for hyperpod cluster users, please set the value for `computeQuotaTarget.targetId` when installing or upgrade the chart. This value is the same as the `targetId` of quota allocation.
0 commit comments