Skip to content

Commit fb1558a

Browse files
committed
pr review edits
1 parent 157a14d commit fb1558a

File tree

1 file changed

+20
-21
lines changed

1 file changed

+20
-21
lines changed

articles/openshift/howto-gpu-workloads.md

Lines changed: 20 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -31,14 +31,14 @@ Linux:
3131
sudo dnf install jq moreutils gettext
3232
```
3333

34-
MacOS
34+
macOS
3535
```bash
3636
brew install jq moreutils gettext
3737
```
3838

3939
## Request GPU quota
4040

41-
All GPU quotas in Azure are 0 by default. You will need to sign in to the Azure portal and request GPU quota. Since there is a lot of competition for GPU workers, you may have to provision an ARO cluster in a region where you can actually reserve GPU.
41+
All GPU quotas in Azure are 0 by default. You will need to sign in to the Azure portal and request GPU quota. Due to competition for GPU workers, you may have to provision an ARO cluster in a region where you can actually reserve GPU.
4242

4343
ARO supports the following GPU workers:
4444

@@ -58,7 +58,7 @@ ARO supports the following GPU workers:
5858

5959
1. Configure quota.
6060

61-
:::image type="content" source="media/howto-gpu-workloads/gpu-quota-azure.png" alt-text="Screen capture of quotas page on Azure portal.":::
61+
:::image type="content" source="media/howto-gpu-workloads/gpu-quota-azure.png" alt-text="Screenshot of quotas page on Azure portal.":::
6262

6363
## Sign in to your ARO cluster
6464

@@ -68,7 +68,7 @@ Sign in to OpenShift with a user account with cluster-admin privileges. The exam
6868
oc login <apiserver> -u kubeadmin -p <kubeadminpass>
6969
```
7070

71-
## Pull secret (Conditional)
71+
## Pull secret (conditional)
7272

7373
Update your pull secret to make sure you can install operators and connect to [cloud.redhat.com](https://cloud.redhat.com/).
7474

@@ -102,15 +102,15 @@ Update your pull secret to make sure you can install operators and connect to [c
102102
oc set data secret/pull-secret -n openshift-config --from-file=.dockerconfigjson=new-pull-secret.json
103103
```
104104

105-
> You may need to wait about 1 hour for everything to sync up with cloud.redhat.com.
105+
> You may need to wait about 1 hour for everything to sync up with cloud.redhat.com.
106106
107107
1. Delete secrets.
108108

109109
```bash
110110
rm pull-secret.txt export-pull.json new-pull-secret.json
111111
```
112112

113-
## GPU Machine Set
113+
## GPU machine set
114114

115115
ARO uses Kubernetes MachineSet to create machine sets. The procedure below explains how to export the first machine set in a cluster and use that as a template to build a single GPU machine.
116116

@@ -178,7 +178,7 @@ ARO uses Kubernetes MachineSet to create machine sets. The procedure below expla
178178

179179
#### Create GPU machine set
180180

181-
Use the following steps to create the new GPU machine. It may take 10-15 minutes to provision a new GPU machine. If this step fails, sign in to [azure portal](https://portal.azure.com) and ensure there are no availability issues. To do so, go to **Virtual Machines** and search for the worker name you created previously to see the status of VMs.
181+
Use the following steps to create the new GPU machine. It may take 10-15 minutes to provision a new GPU machine. If this step fails, sign in to [Azure portal](https://portal.azure.com) and ensure there are no availability issues. To do so, go to **Virtual Machines** and search for the worker name you created previously to see the status of VMs.
182182

183183
1. Create the GPU Machine set.
184184

@@ -270,9 +270,9 @@ This section explains how to create the `nvidia-gpu-operator` namespace, set up
270270
271271
Don't proceed until you have verified that the operator has finished installing. Also, ensure that your GPU worker is online.
272272
273-
:::image type="content" source="media/howto-gpu-workloads/nvidia-installed.png" alt-text="Screen shot of installed operators on namespace.":::
273+
:::image type="content" source="media/howto-gpu-workloads/nvidia-installed.png" alt-text="Screenshot of installed operators on namespace.":::
274274
275-
#### Install Node Feature Discovery Operator
275+
#### Install node feature discovery operator
276276
277277
The node feature discovery operator will discover the GPU on your nodes and appropriately label the nodes so you can target them for workloads.
278278
@@ -463,11 +463,11 @@ Official Documentation for Installing [Node Feature Discovery Operator](https://
463463
464464
The status of this operator should show as **Available**.
465465
466-
:::image type="content" source="media/howto-gpu-workloads/nfd-ready-for-use.png" alt-text="Screen shot of node feature discovery operator.":::
466+
:::image type="content" source="media/howto-gpu-workloads/nfd-ready-for-use.png" alt-text="Screenshot of node feature discovery operator.":::
467467
468468
#### Apply Nvidia Cluster Config
469469
470-
This sections explains how to apply the Nvidia cluster config. Please read the [Nvidia documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/install-gpu-ocp.html) on customizing this if you have your own private repos or specific settings. This process may take several minutes to complete.
470+
This section explains how to apply the Nvidia cluster config. Please read the [Nvidia documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/install-gpu-ocp.html) on customizing this if you have your own private repos or specific settings. This process may take several minutes to complete.
471471
472472
1. Apply cluster config.
473473
@@ -525,7 +525,7 @@ This sections explains how to apply the Nvidia cluster config. Please read the [
525525
526526
Log in to OpenShift console and browse to operators. Ensure sure you're in the `nvidia-gpu-operator` namespace. It should say `State: Ready once everything is complete`.
527527
528-
:::image type="content" source="media/howto-gpu-workloads/nvidia-cluster-policy.png" alt-text="Screen shot of existing cluster policies.":::
528+
:::image type="content" source="media/howto-gpu-workloads/nvidia-cluster-policy.png" alt-text="Screenshot of existing cluster policies on OpenShift console.":::
529529
530530
## Validate GPU
531531
@@ -548,7 +548,7 @@ It may take some time for the Nvidia Operator and NFD to completely install and
548548
549549
You can see the node labels by logging into the OpenShift console -> Compute -> Nodes -> nvidia-worker-southcentralus1-. You should see multiple Nvidia GPU labels and the pci-10de device from above.
550550
551-
:::image type="content" source="media/howto-gpu-workloads/node-labels.png" alt-text="Screen shot of GPU labels.":::
551+
:::image type="content" source="media/howto-gpu-workloads/node-labels.png" alt-text="Screenshot of GPU labels on OpenShift console.":::
552552
553553
1. Nvidia SMI tool verification.
554554
@@ -559,9 +559,9 @@ It may take some time for the Nvidia Operator and NFD to completely install and
559559
560560
You should see output that shows the GPUs available on the host such as this example screenshot. (Varies depending on GPU worker type)
561561
562-
:::image type="content" source="media/howto-gpu-workloads/test-gpu.png" alt-text="Screen shot of output showing available GPUs.":::
562+
:::image type="content" source="media/howto-gpu-workloads/test-gpu.png" alt-text="Screenshot of output showing available GPUs.":::
563563
564-
2. Create Pod to run a GPU workload
564+
1. Create Pod to run a GPU workload
565565
566566
```yaml
567567
oc project nvidia-gpu-operator
@@ -583,16 +583,16 @@ It may take some time for the Nvidia Operator and NFD to completely install and
583583
EOF
584584
```
585585
586-
3. View logs.
586+
1. View logs.
587587
588588
```bash
589589
oc logs cuda-vector-add --tail=-1
590590
```
591591
592-
> [!NOTE]
593-
> If you get an error `Error from server (BadRequest): container "cuda-vector-add" in pod "cuda-vector-add" is waiting to start: ContainerCreating`, try running `oc delete pod cuda-vector-add` and then re-run the create statement above.
592+
> [!NOTE]
593+
> If you get an error `Error from server (BadRequest): container "cuda-vector-add" in pod "cuda-vector-add" is waiting to start: ContainerCreating`, try running `oc delete pod cuda-vector-add` and then re-run the create statement above.
594594
595-
The output should be similar to the following (depending on GPU):
595+
The output should be similar to the following (depending on GPU):
596596
597597
```bash
598598
[Vector addition of 5000 elements]
@@ -602,8 +602,7 @@ It may take some time for the Nvidia Operator and NFD to completely install and
602602
Test PASSED
603603
Done
604604
```
605-
606-
4. If successful, the pod can be deleted.
605+
If successful, the pod can be deleted:
607606
608607
```bash
609608
oc delete pod cuda-vector-add

0 commit comments

Comments
 (0)