Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 106 additions & 6 deletions latest/ug/ml/node-efa.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -124,17 +124,117 @@ If you don't have an existing cluster, you can run the following command to crea
eksctl create cluster -f efa-cluster.yaml
----
+
NOTE: Because the instance type used in this example has GPUs, `eksctl` automatically installs the NVIDIA Kubernetes device plugin on each instance for you.
. Deploy the EFA Kubernetes device plugin.
[NOTE]
====
Because the instance type used in this example has GPUs, `eksctl` automatically installs the NVIDIA Kubernetes device plugin on each instance for you when using Amazon Linux 2. This is not necessary for Bottlerocket, as the NVIDIA device plugin is built into Bottlerocket's EKS NVIDIA variant. When `efaEnabled` is set to `true` in the nodegroup configuration, `eksctl` will also automatically deploy the EFA device plugin on the nodes.
====

[#efa-bottlerocket]
=== Using Bottlerocket with EFA

Bottlerocket AMI version 1.28.0 and later include official support for EFA. To use Bottlerocket for EFA-enabled nodes, specify `amiFamily: Bottlerocket` in your configuration. If you need to use a custom AMI ID, you must use standard `nodeGroups` instead of `managedNodeGroups`.

Here's an example configuration:

[source,yaml,subs="verbatim,attributes"]
----
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
name: my-efa-bottlerocket-cluster
region: region-code
version: "1.XX"

iam:
withOIDC: true

availabilityZones: ["us-west-2a", "us-west-2c"]

managedNodeGroups:
- name: my-efa-bottlerocket-ng
instanceType: p5.48xlarge
minSize: 1
desiredCapacity: 2
maxSize: 3
availabilityZones: ["us-west-2a"]
volumeSize: 300
privateNetworking: true
efaEnabled: true
amiFamily: Bottlerocket
bottlerocket:
enableAdminContainer: true
settings:
kernel:
sysctl:
"vm.nr_hugepages": "3000" # Configures 3000 * 2Mi = 6000Mi hugepages
----

The `vm.nr_hugepages` sysctl setting above configures the number of 2Mi hugepages. In this example, 3000 means 3000 * 2Mi = 6000Mi of hugepages.

[#verify-efa-device-plugin]
=== Verify EFA device plugin installation

When you create a node group with `efaEnabled: true`, `eksctl` automatically deploys the EFA Kubernetes device plugin for you. You can verify that the device plugin is installed and functioning correctly:

. Check the DaemonSet status:
+
[source,bash,subs="verbatim,attributes"]
----
kubectl get daemonsets -n kube-system
----
+
Sample output:
+
[source,bash,subs="verbatim,attributes"]
----
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
aws-efa-k8s-device-plugin-daemonset 2 2 2 2 2 <none> 6m16s
...
----
+
Here, the EFA device plugin DaemonSet is running on two nodes. Both are READY and AVAILABLE.

. Next, verify the pods created by the DaemonSet:
+
[source,bash,subs="verbatim,attributes"]
----
kubectl get pods -n kube-system -l name=aws-efa-k8s-device-plugin
----
+
The EFA Kubernetes device plugin detects and advertises EFA interfaces as allocatable resources to Kubernetes. An application can consume the extended resource type `vpc.amazonaws.com/efa` in a Pod request spec just like CPU and memory. For more information, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#consuming-extended-resources[Consuming extended resources] in the Kubernetes documentation. Once requested, the plugin automatically assigns and mounts an EFA interface to the Pod. Using the device plugin simplifies EFA setup and does not require a Pod to run in privileged mode.
Sample output:
+
[source,bash,subs="verbatim,attributes"]
----
helm repo add eks https://aws.github.io/eks-charts
helm install aws-efa-k8s-device-plugin --namespace kube-system eks/aws-efa-k8s-device-plugin
NAME READY STATUS RESTARTS AGE
aws-efa-k8s-device-plugin-daemonset-d68bs 1/1 Running 0 6m16s
aws-efa-k8s-device-plugin-daemonset-w4l8t 1/1 Running 0 6m16s
----
+
The EFA device plugin pods are in a Running state, confirming that the plugin is successfully deployed and operational.

. Verify resource registration:
+
You can confirm that the `vpc.amazonaws.com/efa` resource is registered with the kubelet by describing the nodes:
+
[source,bash,subs="verbatim,attributes"]
----
kubectl describe nodes
----
+
If the EFA resource is properly registered, you will see it listed under the node's Capacity and Allocatable resources. For example:
+
[source,bash,subs="verbatim,attributes"]
----
Capacity:
...
vpc.amazonaws.com/efa: 4
Allocatable:
...
vpc.amazonaws.com/efa: 4
----
+
This output confirms that the node recognizes the EFA resource, making it available for pods that request it.

[#efa-application]
== (Optional) Test the performance of the EFA
Expand Down Expand Up @@ -305,4 +405,4 @@ View the log for the `nccl-tests-launcher` Pod. Replace [.replaceable]`nbql9` wi
kubectl logs -f nccl-tests-launcher-nbql9
----

If the test completed successfully, you can deploy your applications that use the Nvidia Collective Communication Library.
If the test completed successfully, you can deploy your applications that use the Nvidia Collective Communication Library.