Skip to content

Commit d42d2bd

Browse files
committed
revise sagemaker and IBM addons
1 parent 788b06a commit d42d2bd

File tree

2 files changed

+51
-0
lines changed

2 files changed

+51
-0
lines changed

latest/ug/workloads/workloads-add-ons-available-eks.adoc

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,10 @@ You can use any of the following Amazon EKS add-ons.
6666
|<<addons-hyperpod-observability>>
6767
|EC2, EKS Auto Mode,
6868

69+
|Amazon SageMaker HyperPod training operator enables efficient distributed training on Amazon EKS clusters with advanced scheduling and resource management capabilities.
70+
|<<addons-hyperpod-training-operator>>
71+
|EC2, EKS Auto Mode
72+
6973
|A Kubernetes agent that collects and reports network flow data to Amazon CloudWatch, enabling comprehensive monitoring of TCP connections across cluster nodes.
7074
|<<addons-network-flow>>
7175
|EC2, EKS Auto Mode
@@ -437,6 +441,29 @@ This add-on uses the IAM roles for service accounts capability of Amazon EKS. Fo
437441

438442
To learn more about the add-on and its capabilities, see link:sagemaker/latest/dg/sagemaker-hyperpod-eks-cluster-observability-cluster.html["SageMaker HyperPod Observability",type="documentation"].
439443

444+
[#addons-hyperpod-training-operator]
445+
== Amazon SageMaker HyperPod training operator
446+
447+
The Amazon SageMaker HyperPod training operator helps you accelerate generative AI model development by efficiently managing distributed training across large GPU clusters. It introduces intelligent fault recovery, hang job detection, and process-level management capabilities that minimize training disruptions and reduce costs. Unlike traditional training infrastructure that requires complete job restarts when failures occur, this operator implements surgical process recovery to keep your training jobs running smoothly.
448+
449+
The operator also works with HyperPod's health monitoring and observability functions, providing real-time visibility into training execution and automatic monitoring of critical metrics like loss spikes and throughput degradation. You can define recovery policies through simple YAML configurations without code changes, allowing you to quickly respond to and recover from unrecoverable training states. These monitoring and recovery capabilities work together to maintain optimal training performance while minimizing operational overhead.
450+
451+
The Amazon EKS add-on name is `amazon-sagemaker-hyperpod-training-operator`.
452+
453+
For more information, see link:sagemaker/latest/dg/sagemaker-eks-operator.html[Using the HyperPod training operatorr,type="documentation"] in the _Amazon SageMaker Developer Guide_.
454+
455+
=== Required IAM permissions
456+
457+
This add-on requires IAM permissions, and uses EKS Pod Identity.
458+
459+
{aws} suggests the `AmazonSageMakerHyperPodTrainingOperatorAccess` link:aws-managed-policy/latest/reference/AmazonSageMakerHyperPodTrainingOperatorAccess.html["managed policy",type="documentation"].
460+
461+
For more information, see link:sagemaker/latest/dg/sagemaker-eks-operator-install.html#sagemaker-eks-operator-install-operator[Installing the training operator,type="documentation"] in the _Amazon SageMaker Developer Guide_.
462+
463+
=== Additional information
464+
465+
To learn more about the add-on, see link:sagemaker/latest/dg/sagemaker-eks-operator.html["SageMaker HyperPod training operator",type="documentation"].
466+
440467
[#addons-network-flow]
441468
== {aws} Network Flow Monitor Agent
442469

latest/ug/workloads/workloads-add-ons-available-vendors.adoc

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -333,6 +333,30 @@ A managed policy isn't used with this add-on.
333333

334334
Custom permissions aren't used with this add-on.
335335

336+
[#add-on-instana]
337+
== IBM Instana
338+
339+
The add-on name is `instana_agent` and the namespace is `instana-agent`. IBM publishes the add-on.
340+
341+
For information about the add-on, see link:blogs/ibm-redhat/implement-observability-for-amazon-eks-workloads-using-the-instana-amazon-eks-add-on/[Implement observability for Amazon EKS workloads using the Instana Amazon EKS add-on,type="marketing"] and link:blogs/ibm-redhat/monitor-and-optimize-amazon-eks-costs-with-ibm-instana-and-kubecost/[Monitor and optimize Amazon EKS costs with IBM Instana and Kubecost,type="marketing"] in the {aws} Blog.
342+
343+
Instana Observability (Instana) offers an Amazon EKS Add-on that deploys Instana agents to Amazon EKS clusters. Customers can use this add-on to collect and analyze real-time performance data to gain insights into their containerized applications. The Instana Amazon EKS add-on provides visibility across your Kubernetes environments. Once deployed, the Instana agent automatically discovers components within your Amazon EKS clusters including nodes, namespaces, deployments, services, and pods.
344+
345+
[#add-on-instana-service-account-name]
346+
=== Service account name
347+
348+
A service account isn't used with this add-on.
349+
350+
[#add-on-instana-managed-policy]
351+
=== {aws} managed IAM policy
352+
353+
A managed policy isn't used with this add-on.
354+
355+
[#add-on-instana-custom-permissions]
356+
=== Custom IAM permissions
357+
358+
Custom permissions aren't used with this add-on.
359+
336360
[#add-on-grafana]
337361
== Grafana Labs
338362

0 commit comments

Comments
 (0)