Skip to content

Commit f7b43e5

Browse files
committed
node-efa: extend doc for Bottlerocket
Signed-off-by: Yutong Sun <[email protected]>
1 parent bea440e commit f7b43e5

File tree

1 file changed

+105
-6
lines changed

1 file changed

+105
-6
lines changed

latest/ug/ml/node-efa.adoc

Lines changed: 105 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -124,17 +124,116 @@ If you don't have an existing cluster, you can run the following command to crea
124124
eksctl create cluster -f efa-cluster.yaml
125125
----
126126
+
127-
NOTE: Because the instance type used in this example has GPUs, `eksctl` automatically installs the NVIDIA Kubernetes device plugin on each instance for you.
128-
. Deploy the EFA Kubernetes device plugin.
127+
[NOTE]
128+
====
129+
Because the instance type used in this example has GPUs, `eksctl` automatically installs the NVIDIA Kubernetes device plugin on each instance for you when using Amazon Linux 2. This is not necessary for Bottlerocket, as the NVIDIA device plugin is built into Bottlerocket's EKS NVIDIA variant. When `efaEnabled` is set to `true` in the nodegroup configuration, `eksctl` will also automatically deploy the EFA device plugin on the nodes.
130+
====
131+
[#efa-bottlerocket]
132+
=== Using Bottlerocket with EFA
133+
134+
Bottlerocket AMI version 1.28.0 and later include official support for EFA. To use Bottlerocket for EFA-enabled nodes, specify `amiFamily: Bottlerocket` in your configuration. If you need to use a custom AMI ID, you must use standard `nodeGroups` instead of `managedNodeGroups`.
135+
136+
Here's an example configuration:
137+
138+
[source,yaml,subs="verbatim,attributes"]
139+
----
140+
apiVersion: eksctl.io/v1alpha5
141+
kind: ClusterConfig
142+
143+
metadata:
144+
name: my-efa-bottlerocket-cluster
145+
region: region-code
146+
version: "1.XX"
147+
148+
iam:
149+
withOIDC: true
150+
151+
availabilityZones: ["us-west-2a", "us-west-2c"]
152+
153+
managedNodeGroups:
154+
- name: my-efa-bottlerocket-ng
155+
instanceType: p5.48xlarge
156+
minSize: 1
157+
desiredCapacity: 2
158+
maxSize: 3
159+
availabilityZones: ["us-west-2a"]
160+
volumeSize: 300
161+
privateNetworking: true
162+
efaEnabled: true
163+
amiFamily: Bottlerocket
164+
bottlerocket:
165+
enableAdminContainer: true
166+
settings:
167+
kernel:
168+
sysctl:
169+
"vm.nr_hugepages": "3000" # Configures 3000 * 2Mi = 6000Mi hugepages
170+
----
171+
172+
The `vm.nr_hugepages` sysctl setting above configures the number of 2Mi hugepages. In this example, 3000 means 3000 * 2Mi = 6000Mi of hugepages.
173+
174+
[#verify-efa-device-plugin]
175+
=== Verify EFA device plugin installation
176+
177+
When you create a node group with `efaEnabled: true`, `eksctl` automatically deploys the EFA Kubernetes device plugin for you. You can verify that the device plugin is installed and functioning correctly:
178+
179+
. Check the DaemonSet status:
180+
+
181+
[source,bash,subs="verbatim,attributes"]
182+
----
183+
kubectl get daemonsets -n kube-system
184+
----
185+
+
186+
Sample output:
187+
+
188+
[source,bash,subs="verbatim,attributes"]
189+
----
190+
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
191+
aws-efa-k8s-device-plugin-daemonset 2 2 2 2 2 <none> 6m16s
192+
...
193+
----
194+
+
195+
Here, the EFA device plugin DaemonSet is running on two nodes. Both are READY and AVAILABLE.
196+
197+
. Next, verify the pods created by the DaemonSet:
198+
+
199+
[source,bash,subs="verbatim,attributes"]
200+
----
201+
kubectl get pods -n kube-system -l name=aws-efa-k8s-device-plugin
202+
----
129203
+
130-
The EFA Kubernetes device plugin detects and advertises EFA interfaces as allocatable resources to Kubernetes. An application can consume the extended resource type `vpc.amazonaws.com/efa` in a Pod request spec just like CPU and memory. For more information, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#consuming-extended-resources[Consuming extended resources] in the Kubernetes documentation. Once requested, the plugin automatically assigns and mounts an EFA interface to the Pod. Using the device plugin simplifies EFA setup and does not require a Pod to run in privileged mode.
204+
Sample output:
131205
+
132206
[source,bash,subs="verbatim,attributes"]
133207
----
134-
helm repo add eks https://aws.github.io/eks-charts
135-
helm install aws-efa-k8s-device-plugin --namespace kube-system eks/aws-efa-k8s-device-plugin
208+
NAME READY STATUS RESTARTS AGE
209+
aws-efa-k8s-device-plugin-daemonset-d68bs 1/1 Running 0 6m16s
210+
aws-efa-k8s-device-plugin-daemonset-w4l8t 1/1 Running 0 6m16s
136211
----
212+
+
213+
The EFA device plugin pods are in a Running state, confirming that the plugin is successfully deployed and operational.
137214

215+
. Verify resource registration:
216+
+
217+
You can confirm that the `vpc.amazonaws.com/efa` resource is registered with the kubelet by describing the nodes:
218+
+
219+
[source,bash,subs="verbatim,attributes"]
220+
----
221+
kubectl describe nodes
222+
----
223+
+
224+
If the EFA resource is properly registered, you will see it listed under the node's Capacity and Allocatable resources. For example:
225+
+
226+
[source,bash,subs="verbatim,attributes"]
227+
----
228+
Capacity:
229+
...
230+
vpc.amazonaws.com/efa: 4
231+
Allocatable:
232+
...
233+
vpc.amazonaws.com/efa: 4
234+
----
235+
+
236+
This output confirms that the node recognizes the EFA resource, making it available for pods that request it.
138237

139238
[#efa-application]
140239
== (Optional) Test the performance of the EFA
@@ -305,4 +404,4 @@ View the log for the `nccl-tests-launcher` Pod. Replace [.replaceable]`nbql9` wi
305404
kubectl logs -f nccl-tests-launcher-nbql9
306405
----
307406

308-
If the test completed successfully, you can deploy your applications that use the Nvidia Collective Communication Library.
407+
If the test completed successfully, you can deploy your applications that use the Nvidia Collective Communication Library.

0 commit comments

Comments
 (0)