Merge pull request #53022 from StephenJamesSmith/TELCODOCS-756-AWS

mburke5678 · web-flow · commit 33bec3e53939 · 2023-01-11T16:34:19.000-05:00
TELCODOCS-756-AWS: First draft
diff --git a/machine_management/creating_machinesets/creating-machineset-aws.adoc b/machine_management/creating_machinesets/creating-machineset-aws.adoc
@@ -37,3 +37,9 @@ include::modules/machineset-non-guaranteed-instance.adoc[leveloffset=+1]
 
 //Creating Spot Instances by using compute machine sets
 include::modules/machineset-creating-non-guaranteed-instances.adoc[leveloffset=+2]
+
+//Adding a GPU node to a machine set (stesmith)
+include::modules/nvidia-gpu-aws-adding-a-gpu-node.adoc[leveloffset=+1]
+
+//Deploying the Node Feature Discovery Operator (stesmith)
+include::modules/nvidia-gpu-aws-deploying-the-node-feature-discovery-operator.adoc[leveloffset=+1]
diff --git a/modules/nvidia-gpu-aws-adding-a-gpu-node.adoc b/modules/nvidia-gpu-aws-adding-a-gpu-node.adoc
@@ -0,0 +1,199 @@
+// Module included in the following assemblies:
+//
+//  * machine_management/creating-machinesets/creating-machineset-aws.adoc
+
+:_content-type: PROCEDURE
+[id="nvidia-gpu-aws-adding-a-gpu-node_{context}"]
+= Adding a GPU node to an existing {product-title} cluster
+
+You can copy and modify a default compute machine set configuration to create a GPU-enabled machine set and machines for the AWS EC2 cloud provider.
+
+The following table lists the validated instance types:
+
+[cols="1,1,1,1"]
+|===
+|Instance type |NVIDIA GPU accelerator |Maximum number of GPUs |Architecture
+
+|`p4d.24xlarge`
+|A100
+|8
+|x86
+
+|`g4dn.xlarge`
+|T4
+|1
+|x86
+|===
+
+.Procedure
+
+. View the existing nodes, machines, and machine sets  by running the following command. Note that each node is an instance of a machine definition with a specific AWS region and {product-title} role.
++
+[source,terminal]
+----
+$ oc get nodes
+----
++
+.Example output
++
+[source,terminal]
+----
+NAME                                        STATUS   ROLES                  AGE     VERSION
+ip-10-0-52-50.us-east-2.compute.internal    Ready    worker                 3d17h   v1.25.4+86bd4ff
+ip-10-0-58-24.us-east-2.compute.internal    Ready    control-plane,master   3d17h   v1.25.4+86bd4ff
+ip-10-0-68-148.us-east-2.compute.internal   Ready    worker                 3d17h   v1.25.4+86bd4ff
+ip-10-0-68-68.us-east-2.compute.internal    Ready    control-plane,master   3d17h   v1.25.4+86bd4ff
+ip-10-0-72-170.us-east-2.compute.internal   Ready    control-plane,master   3d17h   v1.25.4+86bd4ff
+ip-10-0-74-50.us-east-2.compute.internal    Ready    worker                 3d17h   v1.25.4+86bd4ff
+----
+
+. View the machines and machine sets that exist in the `openshift-machine-api` namespace by running the following command. Each compute machine set is associated with a different availability zone within the AWS region. The installer automatically load balances compute machines across availability zones.
++
+[source,terminal]
+----
+$ oc get machinesets -n openshift-machine-api
+----
++
+.Example output
++
+[source,terminal]
+----
+NAME                                        DESIRED   CURRENT   READY   AVAILABLE   AGE
+preserve-dsoc12r4-ktjfc-worker-us-east-2a   1         1         1       1           3d11h
+preserve-dsoc12r4-ktjfc-worker-us-east-2b   2         2         2       2           3d11h
+----
+
+. View the machines that exist in the `openshift-machine-api` namespace by running the following command. At this time, there is only one compute machine per machine set, though a compute machine set could be scaled to add a node in a particular region and zone.
++
+[source,terminal]
+----
+$ oc get machines -n openshift-machine-api | grep worker
+----
++
+.Example output
++
+[source,terminal]
+----
+preserve-dsoc12r4-ktjfc-worker-us-east-2a-dts8r      Running   m5.xlarge   us-east-2   us-east-2a   3d11h
+preserve-dsoc12r4-ktjfc-worker-us-east-2b-dkv7w      Running   m5.xlarge   us-east-2   us-east-2b   3d11h
+preserve-dsoc12r4-ktjfc-worker-us-east-2b-k58cw      Running   m5.xlarge   us-east-2   us-east-2b   3d11h
+----
+
+. Make a copy of one of the existing compute `MachineSet` definitions and output the result to a JSON file by running the following command. This will be the basis for the GPU-enabled compute machine set definition.
++
+[source,terminal]
+----
+$ oc get machineset preserve-dsoc12r4-ktjfc-worker-us-east-2a -n openshift-machine-api -o json > <output_file.json>
+----
+
+. Edit the JSON file and make the following changes to the new `MachineSet` definition:
++
+* Replace `worker` with `gpu`. This will be the name of the new machine set.
+* Change the instance type of the new `MachineSet` definition to `g4dn`, which includes an NVIDIA Tesla T4 GPU.
+To learn more about AWS `g4dn` instance types, see link:https://aws.amazon.com/ec2/instance-types/#Accelerated_Computing[Accelerated Computing].
++
+[source,terminal]
+----
+$ jq .spec.template.spec.providerSpec.value.instanceType preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a.json
+
+"g4dn.xlarge"
+----
++
+The `<output_file.json>` file is saved as `preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a.json`.
+
+ . Update the following fields in `preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a.json`:
++
+* `.metadata.name` to a name containing `gpu`.
+
+* `.spec.selector.matchLabels["machine.openshift.io/cluster-api-machineset"]` to
+match the new `.metadata.name`.
+
+* `.spec.template.metadata.labels["machine.openshift.io/cluster-api-machineset"]`
+to match the new `.metadata.name`.
+
+* `.spec.template.spec.providerSpec.value.instanceType` to `g4dn.xlarge`.
+
+. To verify your changes, perform a `diff` of the original compute definition and the new GPU-enabled node definition by running the following command:
++
+[source,terminal]
+----
+$ oc -n openshift-machine-api get preserve-dsoc12r4-ktjfc-worker-us-east-2a -o json | diff preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a.json -
+----
++
+.Example output
++
+[source,terminal]
+----
+10c10
+
+< "name": "preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a",
+---
+> "name": "preserve-dsoc12r4-ktjfc-worker-us-east-2a",
+
+21c21
+
+< "machine.openshift.io/cluster-api-machineset": "preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a"
+---
+> "machine.openshift.io/cluster-api-machineset": "preserve-dsoc12r4-ktjfc-worker-us-east-2a"
+
+31c31
+
+< "machine.openshift.io/cluster-api-machineset": "preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a"
+---
+> "machine.openshift.io/cluster-api-machineset": "preserve-dsoc12r4-ktjfc-worker-us-east-2a"
+
+60c60
+
+< "instanceType": "g4dn.xlarge",
+---
+> "instanceType": "m5.xlarge",
+----
+
+. Create the GPU-enabled compute machine set from the definition by running the following command:
++
+[source,terminal]
+----
+$ oc create -f preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a.json
+----
++
+.Example output
++
+[source,terminal]
+----
+machineset.machine.openshift.io/preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a created
+----
+
+.Verification
+
+. View the machine set you created by running the following command:
++
+[source,terminal]
+----
+$ oc -n openshift-machine-api get machinesets | grep gpu
+----
++
+The MachineSet replica count is set to `1` so a new `Machine` object is created automatically.
+
++
+.Example output
++
+[source,terminal]
+----
+preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a   1         1         1       1           4m21s
+----
+
+. View the `Machine` object that the machine set created by running the following command:
++
+[source,terminal]
+----
+$ oc -n openshift-machine-api get machines | grep gpu
+----
++
+.Example output
++
+[source,terminal]
+----
+preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a    running    g4dn.xlarge   us-east-2   us-east-2a  4m36s
+----
+
+Note that there is no need to specify a namespace for the node. The node definition is cluster scoped.
diff --git a/modules/nvidia-gpu-aws-deploying-the-node-feature-discovery-operator.adoc b/modules/nvidia-gpu-aws-deploying-the-node-feature-discovery-operator.adoc
@@ -0,0 +1,77 @@
+// Module included in the following assemblies:
+//
+//  * machine_management/creating_machinesets/creating-machineset-aws.adoc
+
+:_content-type: PROCEDURE
+[id="nvidia-gpu-aws-deploying-the-node-feature-discovery-operator_{context}"]
+= Deploying the Node Feature Discovery Operator
+
+After the GPU-enabled node is created, you need to discover the GPU-enabled node so it can be scheduled. To do this, install the Node Feature Discovery (NFD) Operator. The NFD Operator identifies hardware device features in nodes. It solves the general problem of identifying and cataloging hardware resources in the infrastructure nodes so they can be made available to {product-title}.
+
+.Procedure
+
+. Install the Node Feature Discovery Operator from *OperatorHub* in the {product-title} console.
+
+. After installing the NFD Operator into *OperatorHub*, select *Node Feature Discovery* from the installed Operators list and select *Create instance*. This installs the `nfd-master` and `nfd-worker` pods, one `nfd-worker` pod for each compute node, in the `openshift-nfd` namespace.
+
+. Verify that the Operator is installed and running by running the following command:
++
+[source,terminal]
+----
+$ oc get pods -n openshift-nfd
+----
++
+.Example output
++
+[source,terminal]
+----
+NAME                                       READY    STATUS     RESTARTS   AGE
+
+nfd-controller-manager-8646fcbb65-x5qgk    2/2      Running 7  (8h ago)   1d
+----
+
+. Browse to the installed Oerator in the console and select *Create Node Feature Discovery*.
+
+. Select *Create* to build a NFD custom resource. This creates NFD pods in the `openshift-nfd` namespace that poll the {product-title} nodes for hardware resources and catalogue them.
+
+.Verification
+
+. After a successful build, verify that a NFD pod is running on each nodes by running the following command:
++
+[source,terminal]
+----
+$ oc get pods -n openshift-nfd
+----
++
+.Example output
+[source,terminal]
+----
+NAME                                       READY   STATUS      RESTARTS        AGE
+nfd-controller-manager-8646fcbb65-x5qgk    2/2     Running     7 (8h ago)      12d
+nfd-master-769656c4cb-w9vrv                1/1     Running     0               12d
+nfd-worker-qjxb2                           1/1     Running     3 (3d14h ago)   12d
+nfd-worker-xtz9b                           1/1     Running     5 (3d14h ago)   12d
+----
++
+The NFD Operator uses vendor PCI IDs to identify hardware in a node. NVIDIA uses the PCI ID `10de`.
+
+. View the NVIDIA GPU discovered by the NFD Operator by running the following command:
++
+[source,terminal]
+----
+$ oc describe node ip-10-0-132-138.us-east-2.compute.internal | egrep 'Roles|pci'
+----
++
+.Example output
+[source,terminal]
+----
+Roles: worker
+
+feature.node.kubernetes.io/pci-1013.present=true
+
+feature.node.kubernetes.io/pci-10de.present=true
+
+feature.node.kubernetes.io/pci-1d0f.present=true
+----
++
+`10de` appears in the node feature list for the GPU-enabled node. This mean the NFD Operator correctly identified the node from the GPU-enabled MachineSet.