Skip to content

Commit 0704fd3

Browse files
committed
node-efa: extend doc for Bottlerocket
Signed-off-by: Yutong Sun <[email protected]>
1 parent bea440e commit 0704fd3

File tree

1 file changed

+103
-6
lines changed

1 file changed

+103
-6
lines changed

latest/ug/ml/node-efa.adoc

Lines changed: 103 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -124,17 +124,114 @@ If you don't have an existing cluster, you can run the following command to crea
124124
eksctl create cluster -f efa-cluster.yaml
125125
----
126126
+
127-
NOTE: Because the instance type used in this example has GPUs, `eksctl` automatically installs the NVIDIA Kubernetes device plugin on each instance for you.
128-
. Deploy the EFA Kubernetes device plugin.
127+
NOTE: Because the instance type used in this example has GPUs, `eksctl` automatically installs the NVIDIA Kubernetes device plugin on each instance for you. When `efaEnabled: true` is set in your configuration, `eksctl` also automatically deploys the EFA device plugin on your nodes.
128+
129+
[#efa-bottlerocket]
130+
=== Using Bottlerocket with EFA
131+
132+
Bottlerocket AMI version 1.28.0 and later include official support for EFA. To use Bottlerocket for EFA-enabled nodes, specify `amiFamily: Bottlerocket` in your configuration. If you need to use a custom AMI ID, you must use standard `nodeGroups` instead of `managedNodeGroups`.
133+
134+
Here's an example configuration:
135+
136+
[source,yaml,subs="verbatim,attributes"]
137+
----
138+
apiVersion: eksctl.io/v1alpha5
139+
kind: ClusterConfig
140+
141+
metadata:
142+
name: my-efa-bottlerocket-cluster
143+
region: region-code
144+
version: "1.XX"
145+
146+
iam:
147+
withOIDC: true
148+
149+
availabilityZones: ["us-west-2a", "us-west-2c"]
150+
151+
nodeGroups:
152+
- name: my-efa-bottlerocket-ng
153+
instanceType: p5.48xlarge
154+
minSize: 1
155+
desiredCapacity: 2
156+
maxSize: 3
157+
availabilityZones: ["us-west-2a"]
158+
volumeSize: 300
159+
privateNetworking: true
160+
efaEnabled: true
161+
amiFamily: Bottlerocket
162+
bottlerocket:
163+
enableAdminContainer: true
164+
settings:
165+
kernel:
166+
sysctl:
167+
"vm.nr_hugepages": "3000" # Configures 3000 * 2Mi = 6000Mi hugepages
168+
----
169+
170+
The `vm.nr_hugepages` sysctl setting above configures the number of 2Mi hugepages. In this example, 3000 means 3000 * 2Mi = 6000Mi of hugepages.
171+
172+
[#verify-efa-device-plugin]
173+
=== Verify EFA Device Plugin Installation
174+
175+
When you create a node group with `efaEnabled: true`, eksctl automatically deploys the EFA Kubernetes device plugin for you. You can verify that the device plugin is installed and functioning correctly:
176+
177+
. Check the DaemonSet status:
178+
+
179+
[source,bash,subs="verbatim,attributes"]
180+
----
181+
kubectl get daemonsets -n kube-system
182+
----
129183
+
130-
The EFA Kubernetes device plugin detects and advertises EFA interfaces as allocatable resources to Kubernetes. An application can consume the extended resource type `vpc.amazonaws.com/efa` in a Pod request spec just like CPU and memory. For more information, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#consuming-extended-resources[Consuming extended resources] in the Kubernetes documentation. Once requested, the plugin automatically assigns and mounts an EFA interface to the Pod. Using the device plugin simplifies EFA setup and does not require a Pod to run in privileged mode.
184+
Sample output:
131185
+
132186
[source,bash,subs="verbatim,attributes"]
133187
----
134-
helm repo add eks https://aws.github.io/eks-charts
135-
helm install aws-efa-k8s-device-plugin --namespace kube-system eks/aws-efa-k8s-device-plugin
188+
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
189+
aws-efa-k8s-device-plugin-daemonset 2 2 2 2 2 <none> 6m16s
190+
...
136191
----
192+
+
193+
Here, the EFA device plugin DaemonSet is running on two nodes. Both are READY and AVAILABLE.
137194

195+
. Next, verify the pods created by the DaemonSet:
196+
+
197+
[source,bash,subs="verbatim,attributes"]
198+
----
199+
kubectl get pods -n kube-system -l name=aws-efa-k8s-device-plugin
200+
----
201+
+
202+
Sample output:
203+
+
204+
[source,bash,subs="verbatim,attributes"]
205+
----
206+
NAME READY STATUS RESTARTS AGE
207+
aws-efa-k8s-device-plugin-daemonset-d68bs 1/1 Running 0 6m16s
208+
aws-efa-k8s-device-plugin-daemonset-w4l8t 1/1 Running 0 6m16s
209+
----
210+
+
211+
The EFA device plugin pods are in a Running state, confirming that the plugin is successfully deployed and operational.
212+
213+
. Verify resource registration:
214+
+
215+
You can confirm that the `vpc.amazonaws.com/efa` resource is registered with the kubelet by describing the nodes:
216+
+
217+
[source,bash,subs="verbatim,attributes"]
218+
----
219+
kubectl describe nodes
220+
----
221+
+
222+
If the EFA resource is properly registered, you will see it listed under the node's Capacity and Allocatable resources. For example:
223+
+
224+
[source,bash,subs="verbatim,attributes"]
225+
----
226+
Capacity:
227+
...
228+
vpc.amazonaws.com/efa: 4
229+
Allocatable:
230+
...
231+
vpc.amazonaws.com/efa: 4
232+
----
233+
+
234+
This output confirms that the node recognizes the EFA resource, making it available for pods that request it.
138235

139236
[#efa-application]
140237
== (Optional) Test the performance of the EFA
@@ -305,4 +402,4 @@ View the log for the `nccl-tests-launcher` Pod. Replace [.replaceable]`nbql9` wi
305402
kubectl logs -f nccl-tests-launcher-nbql9
306403
----
307404

308-
If the test completed successfully, you can deploy your applications that use the Nvidia Collective Communication Library.
405+
If the test completed successfully, you can deploy your applications that use the Nvidia Collective Communication Library.

0 commit comments

Comments
 (0)