Skip to content

Commit 33bec3e

Browse files
authored
Merge pull request #53022 from StephenJamesSmith/TELCODOCS-756-AWS
TELCODOCS-756-AWS: First draft
2 parents f583c1d + c152823 commit 33bec3e

File tree

3 files changed

+282
-0
lines changed

3 files changed

+282
-0
lines changed

machine_management/creating_machinesets/creating-machineset-aws.adoc

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,9 @@ include::modules/machineset-non-guaranteed-instance.adoc[leveloffset=+1]
3737

3838
//Creating Spot Instances by using compute machine sets
3939
include::modules/machineset-creating-non-guaranteed-instances.adoc[leveloffset=+2]
40+
41+
//Adding a GPU node to a machine set (stesmith)
42+
include::modules/nvidia-gpu-aws-adding-a-gpu-node.adoc[leveloffset=+1]
43+
44+
//Deploying the Node Feature Discovery Operator (stesmith)
45+
include::modules/nvidia-gpu-aws-deploying-the-node-feature-discovery-operator.adoc[leveloffset=+1]
Lines changed: 199 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,199 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * machine_management/creating-machinesets/creating-machineset-aws.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="nvidia-gpu-aws-adding-a-gpu-node_{context}"]
7+
= Adding a GPU node to an existing {product-title} cluster
8+
9+
You can copy and modify a default compute machine set configuration to create a GPU-enabled machine set and machines for the AWS EC2 cloud provider.
10+
11+
The following table lists the validated instance types:
12+
13+
[cols="1,1,1,1"]
14+
|===
15+
|Instance type |NVIDIA GPU accelerator |Maximum number of GPUs |Architecture
16+
17+
|`p4d.24xlarge`
18+
|A100
19+
|8
20+
|x86
21+
22+
|`g4dn.xlarge`
23+
|T4
24+
|1
25+
|x86
26+
|===
27+
28+
.Procedure
29+
30+
. View the existing nodes, machines, and machine sets by running the following command. Note that each node is an instance of a machine definition with a specific AWS region and {product-title} role.
31+
+
32+
[source,terminal]
33+
----
34+
$ oc get nodes
35+
----
36+
+
37+
.Example output
38+
+
39+
[source,terminal]
40+
----
41+
NAME STATUS ROLES AGE VERSION
42+
ip-10-0-52-50.us-east-2.compute.internal Ready worker 3d17h v1.25.4+86bd4ff
43+
ip-10-0-58-24.us-east-2.compute.internal Ready control-plane,master 3d17h v1.25.4+86bd4ff
44+
ip-10-0-68-148.us-east-2.compute.internal Ready worker 3d17h v1.25.4+86bd4ff
45+
ip-10-0-68-68.us-east-2.compute.internal Ready control-plane,master 3d17h v1.25.4+86bd4ff
46+
ip-10-0-72-170.us-east-2.compute.internal Ready control-plane,master 3d17h v1.25.4+86bd4ff
47+
ip-10-0-74-50.us-east-2.compute.internal Ready worker 3d17h v1.25.4+86bd4ff
48+
----
49+
50+
. View the machines and machine sets that exist in the `openshift-machine-api` namespace by running the following command. Each compute machine set is associated with a different availability zone within the AWS region. The installer automatically load balances compute machines across availability zones.
51+
+
52+
[source,terminal]
53+
----
54+
$ oc get machinesets -n openshift-machine-api
55+
----
56+
+
57+
.Example output
58+
+
59+
[source,terminal]
60+
----
61+
NAME DESIRED CURRENT READY AVAILABLE AGE
62+
preserve-dsoc12r4-ktjfc-worker-us-east-2a 1 1 1 1 3d11h
63+
preserve-dsoc12r4-ktjfc-worker-us-east-2b 2 2 2 2 3d11h
64+
----
65+
66+
. View the machines that exist in the `openshift-machine-api` namespace by running the following command. At this time, there is only one compute machine per machine set, though a compute machine set could be scaled to add a node in a particular region and zone.
67+
+
68+
[source,terminal]
69+
----
70+
$ oc get machines -n openshift-machine-api | grep worker
71+
----
72+
+
73+
.Example output
74+
+
75+
[source,terminal]
76+
----
77+
preserve-dsoc12r4-ktjfc-worker-us-east-2a-dts8r Running m5.xlarge us-east-2 us-east-2a 3d11h
78+
preserve-dsoc12r4-ktjfc-worker-us-east-2b-dkv7w Running m5.xlarge us-east-2 us-east-2b 3d11h
79+
preserve-dsoc12r4-ktjfc-worker-us-east-2b-k58cw Running m5.xlarge us-east-2 us-east-2b 3d11h
80+
----
81+
82+
. Make a copy of one of the existing compute `MachineSet` definitions and output the result to a JSON file by running the following command. This will be the basis for the GPU-enabled compute machine set definition.
83+
+
84+
[source,terminal]
85+
----
86+
$ oc get machineset preserve-dsoc12r4-ktjfc-worker-us-east-2a -n openshift-machine-api -o json > <output_file.json>
87+
----
88+
89+
. Edit the JSON file and make the following changes to the new `MachineSet` definition:
90+
+
91+
* Replace `worker` with `gpu`. This will be the name of the new machine set.
92+
* Change the instance type of the new `MachineSet` definition to `g4dn`, which includes an NVIDIA Tesla T4 GPU.
93+
To learn more about AWS `g4dn` instance types, see link:https://aws.amazon.com/ec2/instance-types/#Accelerated_Computing[Accelerated Computing].
94+
+
95+
[source,terminal]
96+
----
97+
$ jq .spec.template.spec.providerSpec.value.instanceType preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a.json
98+
99+
"g4dn.xlarge"
100+
----
101+
+
102+
The `<output_file.json>` file is saved as `preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a.json`.
103+
104+
. Update the following fields in `preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a.json`:
105+
+
106+
* `.metadata.name` to a name containing `gpu`.
107+
108+
* `.spec.selector.matchLabels["machine.openshift.io/cluster-api-machineset"]` to
109+
match the new `.metadata.name`.
110+
111+
* `.spec.template.metadata.labels["machine.openshift.io/cluster-api-machineset"]`
112+
to match the new `.metadata.name`.
113+
114+
* `.spec.template.spec.providerSpec.value.instanceType` to `g4dn.xlarge`.
115+
116+
. To verify your changes, perform a `diff` of the original compute definition and the new GPU-enabled node definition by running the following command:
117+
+
118+
[source,terminal]
119+
----
120+
$ oc -n openshift-machine-api get preserve-dsoc12r4-ktjfc-worker-us-east-2a -o json | diff preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a.json -
121+
----
122+
+
123+
.Example output
124+
+
125+
[source,terminal]
126+
----
127+
10c10
128+
129+
< "name": "preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a",
130+
---
131+
> "name": "preserve-dsoc12r4-ktjfc-worker-us-east-2a",
132+
133+
21c21
134+
135+
< "machine.openshift.io/cluster-api-machineset": "preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a"
136+
---
137+
> "machine.openshift.io/cluster-api-machineset": "preserve-dsoc12r4-ktjfc-worker-us-east-2a"
138+
139+
31c31
140+
141+
< "machine.openshift.io/cluster-api-machineset": "preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a"
142+
---
143+
> "machine.openshift.io/cluster-api-machineset": "preserve-dsoc12r4-ktjfc-worker-us-east-2a"
144+
145+
60c60
146+
147+
< "instanceType": "g4dn.xlarge",
148+
---
149+
> "instanceType": "m5.xlarge",
150+
----
151+
152+
. Create the GPU-enabled compute machine set from the definition by running the following command:
153+
+
154+
[source,terminal]
155+
----
156+
$ oc create -f preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a.json
157+
----
158+
+
159+
.Example output
160+
+
161+
[source,terminal]
162+
----
163+
machineset.machine.openshift.io/preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a created
164+
----
165+
166+
.Verification
167+
168+
. View the machine set you created by running the following command:
169+
+
170+
[source,terminal]
171+
----
172+
$ oc -n openshift-machine-api get machinesets | grep gpu
173+
----
174+
+
175+
The MachineSet replica count is set to `1` so a new `Machine` object is created automatically.
176+
177+
+
178+
.Example output
179+
+
180+
[source,terminal]
181+
----
182+
preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a 1 1 1 1 4m21s
183+
----
184+
185+
. View the `Machine` object that the machine set created by running the following command:
186+
+
187+
[source,terminal]
188+
----
189+
$ oc -n openshift-machine-api get machines | grep gpu
190+
----
191+
+
192+
.Example output
193+
+
194+
[source,terminal]
195+
----
196+
preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a running g4dn.xlarge us-east-2 us-east-2a 4m36s
197+
----
198+
199+
Note that there is no need to specify a namespace for the node. The node definition is cluster scoped.
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * machine_management/creating_machinesets/creating-machineset-aws.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="nvidia-gpu-aws-deploying-the-node-feature-discovery-operator_{context}"]
7+
= Deploying the Node Feature Discovery Operator
8+
9+
After the GPU-enabled node is created, you need to discover the GPU-enabled node so it can be scheduled. To do this, install the Node Feature Discovery (NFD) Operator. The NFD Operator identifies hardware device features in nodes. It solves the general problem of identifying and cataloging hardware resources in the infrastructure nodes so they can be made available to {product-title}.
10+
11+
.Procedure
12+
13+
. Install the Node Feature Discovery Operator from *OperatorHub* in the {product-title} console.
14+
15+
. After installing the NFD Operator into *OperatorHub*, select *Node Feature Discovery* from the installed Operators list and select *Create instance*. This installs the `nfd-master` and `nfd-worker` pods, one `nfd-worker` pod for each compute node, in the `openshift-nfd` namespace.
16+
17+
. Verify that the Operator is installed and running by running the following command:
18+
+
19+
[source,terminal]
20+
----
21+
$ oc get pods -n openshift-nfd
22+
----
23+
+
24+
.Example output
25+
+
26+
[source,terminal]
27+
----
28+
NAME READY STATUS RESTARTS AGE
29+
30+
nfd-controller-manager-8646fcbb65-x5qgk 2/2 Running 7 (8h ago) 1d
31+
----
32+
33+
. Browse to the installed Oerator in the console and select *Create Node Feature Discovery*.
34+
35+
. Select *Create* to build a NFD custom resource. This creates NFD pods in the `openshift-nfd` namespace that poll the {product-title} nodes for hardware resources and catalogue them.
36+
37+
.Verification
38+
39+
. After a successful build, verify that a NFD pod is running on each nodes by running the following command:
40+
+
41+
[source,terminal]
42+
----
43+
$ oc get pods -n openshift-nfd
44+
----
45+
+
46+
.Example output
47+
[source,terminal]
48+
----
49+
NAME READY STATUS RESTARTS AGE
50+
nfd-controller-manager-8646fcbb65-x5qgk 2/2 Running 7 (8h ago) 12d
51+
nfd-master-769656c4cb-w9vrv 1/1 Running 0 12d
52+
nfd-worker-qjxb2 1/1 Running 3 (3d14h ago) 12d
53+
nfd-worker-xtz9b 1/1 Running 5 (3d14h ago) 12d
54+
----
55+
+
56+
The NFD Operator uses vendor PCI IDs to identify hardware in a node. NVIDIA uses the PCI ID `10de`.
57+
58+
. View the NVIDIA GPU discovered by the NFD Operator by running the following command:
59+
+
60+
[source,terminal]
61+
----
62+
$ oc describe node ip-10-0-132-138.us-east-2.compute.internal | egrep 'Roles|pci'
63+
----
64+
+
65+
.Example output
66+
[source,terminal]
67+
----
68+
Roles: worker
69+
70+
feature.node.kubernetes.io/pci-1013.present=true
71+
72+
feature.node.kubernetes.io/pci-10de.present=true
73+
74+
feature.node.kubernetes.io/pci-1d0f.present=true
75+
----
76+
+
77+
`10de` appears in the node feature list for the GPU-enabled node. This mean the NFD Operator correctly identified the node from the GPU-enabled MachineSet.

0 commit comments

Comments
 (0)