Skip to content

Commit cb9856c

Browse files
authored
Merge pull request #939 from ytsssun/update-efa-for-bottlerocket
node-efa: extend doc for Bottlerocket
2 parents 5a56cf3 + da964bf commit cb9856c

File tree

1 file changed

+106
-6
lines changed

1 file changed

+106
-6
lines changed

latest/ug/ml/node-efa.adoc

Lines changed: 106 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -124,17 +124,117 @@ If you don't have an existing cluster, you can run the following command to crea
124124
eksctl create cluster -f efa-cluster.yaml
125125
----
126126
+
127-
NOTE: Because the instance type used in this example has GPUs, `eksctl` automatically installs the NVIDIA Kubernetes device plugin on each instance for you.
128-
. Deploy the EFA Kubernetes device plugin.
127+
[NOTE]
128+
====
129+
Because the instance type used in this example has GPUs, `eksctl` automatically installs the NVIDIA Kubernetes device plugin on each instance for you when using Amazon Linux 2. This is not necessary for Bottlerocket, as the NVIDIA device plugin is built into Bottlerocket's EKS NVIDIA variant. When `efaEnabled` is set to `true` in the nodegroup configuration, `eksctl` will also automatically deploy the EFA device plugin on the nodes.
130+
====
131+
132+
[#efa-bottlerocket]
133+
=== Using Bottlerocket with EFA
134+
135+
Bottlerocket AMI version 1.28.0 and later include official support for EFA. To use Bottlerocket for EFA-enabled nodes, specify `amiFamily: Bottlerocket` in your configuration. If you need to use a custom AMI ID, you must use standard `nodeGroups` instead of `managedNodeGroups`.
136+
137+
Here's an example configuration:
138+
139+
[source,yaml,subs="verbatim,attributes"]
140+
----
141+
apiVersion: eksctl.io/v1alpha5
142+
kind: ClusterConfig
143+
144+
metadata:
145+
name: my-efa-bottlerocket-cluster
146+
region: region-code
147+
version: "1.XX"
148+
149+
iam:
150+
withOIDC: true
151+
152+
availabilityZones: ["us-west-2a", "us-west-2c"]
153+
154+
managedNodeGroups:
155+
- name: my-efa-bottlerocket-ng
156+
instanceType: p5.48xlarge
157+
minSize: 1
158+
desiredCapacity: 2
159+
maxSize: 3
160+
availabilityZones: ["us-west-2a"]
161+
volumeSize: 300
162+
privateNetworking: true
163+
efaEnabled: true
164+
amiFamily: Bottlerocket
165+
bottlerocket:
166+
enableAdminContainer: true
167+
settings:
168+
kernel:
169+
sysctl:
170+
"vm.nr_hugepages": "3000" # Configures 3000 * 2Mi = 6000Mi hugepages
171+
----
172+
173+
The `vm.nr_hugepages` sysctl setting above configures the number of 2Mi hugepages. In this example, 3000 means 3000 * 2Mi = 6000Mi of hugepages.
174+
175+
[#verify-efa-device-plugin]
176+
=== Verify EFA device plugin installation
177+
178+
When you create a node group with `efaEnabled: true`, `eksctl` automatically deploys the EFA Kubernetes device plugin for you. You can verify that the device plugin is installed and functioning correctly:
179+
180+
. Check the DaemonSet status:
181+
+
182+
[source,bash,subs="verbatim,attributes"]
183+
----
184+
kubectl get daemonsets -n kube-system
185+
----
186+
+
187+
Sample output:
188+
+
189+
[source,bash,subs="verbatim,attributes"]
190+
----
191+
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
192+
aws-efa-k8s-device-plugin-daemonset 2 2 2 2 2 <none> 6m16s
193+
...
194+
----
195+
+
196+
Here, the EFA device plugin DaemonSet is running on two nodes. Both are READY and AVAILABLE.
197+
198+
. Next, verify the pods created by the DaemonSet:
199+
+
200+
[source,bash,subs="verbatim,attributes"]
201+
----
202+
kubectl get pods -n kube-system -l name=aws-efa-k8s-device-plugin
203+
----
129204
+
130-
The EFA Kubernetes device plugin detects and advertises EFA interfaces as allocatable resources to Kubernetes. An application can consume the extended resource type `vpc.amazonaws.com/efa` in a Pod request spec just like CPU and memory. For more information, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#consuming-extended-resources[Consuming extended resources] in the Kubernetes documentation. Once requested, the plugin automatically assigns and mounts an EFA interface to the Pod. Using the device plugin simplifies EFA setup and does not require a Pod to run in privileged mode.
205+
Sample output:
131206
+
132207
[source,bash,subs="verbatim,attributes"]
133208
----
134-
helm repo add eks https://aws.github.io/eks-charts
135-
helm install aws-efa-k8s-device-plugin --namespace kube-system eks/aws-efa-k8s-device-plugin
209+
NAME READY STATUS RESTARTS AGE
210+
aws-efa-k8s-device-plugin-daemonset-d68bs 1/1 Running 0 6m16s
211+
aws-efa-k8s-device-plugin-daemonset-w4l8t 1/1 Running 0 6m16s
136212
----
213+
+
214+
The EFA device plugin pods are in a Running state, confirming that the plugin is successfully deployed and operational.
137215

216+
. Verify resource registration:
217+
+
218+
You can confirm that the `vpc.amazonaws.com/efa` resource is registered with the kubelet by describing the nodes:
219+
+
220+
[source,bash,subs="verbatim,attributes"]
221+
----
222+
kubectl describe nodes
223+
----
224+
+
225+
If the EFA resource is properly registered, you will see it listed under the node's Capacity and Allocatable resources. For example:
226+
+
227+
[source,bash,subs="verbatim,attributes"]
228+
----
229+
Capacity:
230+
...
231+
vpc.amazonaws.com/efa: 4
232+
Allocatable:
233+
...
234+
vpc.amazonaws.com/efa: 4
235+
----
236+
+
237+
This output confirms that the node recognizes the EFA resource, making it available for pods that request it.
138238

139239
[#efa-application]
140240
== (Optional) Test the performance of the EFA
@@ -305,4 +405,4 @@ View the log for the `nccl-tests-launcher` Pod. Replace [.replaceable]`nbql9` wi
305405
kubectl logs -f nccl-tests-launcher-nbql9
306406
----
307407

308-
If the test completed successfully, you can deploy your applications that use the Nvidia Collective Communication Library.
408+
If the test completed successfully, you can deploy your applications that use the Nvidia Collective Communication Library.

0 commit comments

Comments
 (0)