Skip to content

Commit b8498b4

Browse files
committed
node-efa: extend doc for Bottlerocket
Signed-off-by: Yutong Sun <[email protected]>
1 parent bea440e commit b8498b4

File tree

1 file changed

+103
-6
lines changed

1 file changed

+103
-6
lines changed

latest/ug/ml/node-efa.adoc

Lines changed: 103 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -124,17 +124,114 @@ If you don't have an existing cluster, you can run the following command to crea
124124
eksctl create cluster -f efa-cluster.yaml
125125
----
126126
+
127-
NOTE: Because the instance type used in this example has GPUs, `eksctl` automatically installs the NVIDIA Kubernetes device plugin on each instance for you.
128-
. Deploy the EFA Kubernetes device plugin.
127+
NOTE: Because the instance type used in this example has GPUs, `eksctl` automatically installs the NVIDIA Kubernetes device plugin on each instance for you when using Amazon Linux 2. This is not necessary for Bottlerocket, as the NVIDIA device plugin is built into Bottlerocket's EKS NVIDIA variant. When `efaEnabled` is set to `true` in the nodegroup configuration, `eksctl` will also automatically deploy the EFA device plugin on the nodes.
128+
129+
[#efa-bottlerocket]
130+
=== Using Bottlerocket with EFA
131+
132+
Bottlerocket AMI version 1.28.0 and later include official support for EFA. To use Bottlerocket for EFA-enabled nodes, specify `amiFamily: Bottlerocket` in your configuration. If you need to use a custom AMI ID, you must use standard `nodeGroups` instead of `managedNodeGroups`.
133+
134+
Here's an example configuration:
135+
136+
[source,yaml,subs="verbatim,attributes"]
137+
----
138+
apiVersion: eksctl.io/v1alpha5
139+
kind: ClusterConfig
140+
141+
metadata:
142+
name: my-efa-bottlerocket-cluster
143+
region: region-code
144+
version: "1.XX"
145+
146+
iam:
147+
withOIDC: true
148+
149+
availabilityZones: ["us-west-2a", "us-west-2c"]
150+
151+
managedNodeGroups:
152+
- name: my-efa-bottlerocket-ng
153+
instanceType: p5.48xlarge
154+
minSize: 1
155+
desiredCapacity: 2
156+
maxSize: 3
157+
availabilityZones: ["us-west-2a"]
158+
volumeSize: 300
159+
privateNetworking: true
160+
efaEnabled: true
161+
amiFamily: Bottlerocket
162+
bottlerocket:
163+
enableAdminContainer: true
164+
settings:
165+
kernel:
166+
sysctl:
167+
"vm.nr_hugepages": "3000" # Configures 3000 * 2Mi = 6000Mi hugepages
168+
----
169+
170+
The `vm.nr_hugepages` sysctl setting above configures the number of 2Mi hugepages. In this example, 3000 means 3000 * 2Mi = 6000Mi of hugepages.
171+
172+
[#verify-efa-device-plugin]
173+
=== Verify EFA Device Plugin Installation
174+
175+
When you create a node group with `efaEnabled: true`, eksctl automatically deploys the EFA Kubernetes device plugin for you. You can verify that the device plugin is installed and functioning correctly:
176+
177+
. Check the DaemonSet status:
178+
+
179+
[source,bash,subs="verbatim,attributes"]
180+
----
181+
kubectl get daemonsets -n kube-system
182+
----
129183
+
130-
The EFA Kubernetes device plugin detects and advertises EFA interfaces as allocatable resources to Kubernetes. An application can consume the extended resource type `vpc.amazonaws.com/efa` in a Pod request spec just like CPU and memory. For more information, see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#consuming-extended-resources[Consuming extended resources] in the Kubernetes documentation. Once requested, the plugin automatically assigns and mounts an EFA interface to the Pod. Using the device plugin simplifies EFA setup and does not require a Pod to run in privileged mode.
184+
Sample output:
131185
+
132186
[source,bash,subs="verbatim,attributes"]
133187
----
134-
helm repo add eks https://aws.github.io/eks-charts
135-
helm install aws-efa-k8s-device-plugin --namespace kube-system eks/aws-efa-k8s-device-plugin
188+
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
189+
aws-efa-k8s-device-plugin-daemonset 2 2 2 2 2 <none> 6m16s
190+
...
136191
----
192+
+
193+
Here, the EFA device plugin DaemonSet is running on two nodes. Both are READY and AVAILABLE.
137194

195+
. Next, verify the pods created by the DaemonSet:
196+
+
197+
[source,bash,subs="verbatim,attributes"]
198+
----
199+
kubectl get pods -n kube-system -l name=aws-efa-k8s-device-plugin
200+
----
201+
+
202+
Sample output:
203+
+
204+
[source,bash,subs="verbatim,attributes"]
205+
----
206+
NAME READY STATUS RESTARTS AGE
207+
aws-efa-k8s-device-plugin-daemonset-d68bs 1/1 Running 0 6m16s
208+
aws-efa-k8s-device-plugin-daemonset-w4l8t 1/1 Running 0 6m16s
209+
----
210+
+
211+
The EFA device plugin pods are in a Running state, confirming that the plugin is successfully deployed and operational.
212+
213+
. Verify resource registration:
214+
+
215+
You can confirm that the `vpc.amazonaws.com/efa` resource is registered with the kubelet by describing the nodes:
216+
+
217+
[source,bash,subs="verbatim,attributes"]
218+
----
219+
kubectl describe nodes
220+
----
221+
+
222+
If the EFA resource is properly registered, you will see it listed under the node's Capacity and Allocatable resources. For example:
223+
+
224+
[source,bash,subs="verbatim,attributes"]
225+
----
226+
Capacity:
227+
...
228+
vpc.amazonaws.com/efa: 4
229+
Allocatable:
230+
...
231+
vpc.amazonaws.com/efa: 4
232+
----
233+
+
234+
This output confirms that the node recognizes the EFA resource, making it available for pods that request it.
138235

139236
[#efa-application]
140237
== (Optional) Test the performance of the EFA
@@ -305,4 +402,4 @@ View the log for the `nccl-tests-launcher` Pod. Replace [.replaceable]`nbql9` wi
305402
kubectl logs -f nccl-tests-launcher-nbql9
306403
----
307404

308-
If the test completed successfully, you can deploy your applications that use the Nvidia Collective Communication Library.
405+
If the test completed successfully, you can deploy your applications that use the Nvidia Collective Communication Library.

0 commit comments

Comments
 (0)