You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/openshift/howto-gpu-workloads.md
+20-21Lines changed: 20 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,14 +31,14 @@ Linux:
31
31
sudo dnf install jq moreutils gettext
32
32
```
33
33
34
-
MacOS
34
+
macOS
35
35
```bash
36
36
brew install jq moreutils gettext
37
37
```
38
38
39
39
## Request GPU quota
40
40
41
-
All GPU quotas in Azure are 0 by default. You will need to sign in to the Azure portal and request GPU quota. Since there is a lot of competition for GPU workers, you may have to provision an ARO cluster in a region where you can actually reserve GPU.
41
+
All GPU quotas in Azure are 0 by default. You will need to sign in to the Azure portal and request GPU quota. Due to competition for GPU workers, you may have to provision an ARO cluster in a region where you can actually reserve GPU.
42
42
43
43
ARO supports the following GPU workers:
44
44
@@ -58,7 +58,7 @@ ARO supports the following GPU workers:
58
58
59
59
1. Configure quota.
60
60
61
-
:::image type="content" source="media/howto-gpu-workloads/gpu-quota-azure.png" alt-text="Screen capture of quotas page on Azure portal.":::
61
+
:::image type="content" source="media/howto-gpu-workloads/gpu-quota-azure.png" alt-text="Screenshot of quotas page on Azure portal.":::
62
62
63
63
## Sign in to your ARO cluster
64
64
@@ -68,7 +68,7 @@ Sign in to OpenShift with a user account with cluster-admin privileges. The exam
ARO uses Kubernetes MachineSet to create machine sets. The procedure below explains how to export the first machine set in a cluster and use that as a template to build a single GPU machine.
116
116
@@ -178,7 +178,7 @@ ARO uses Kubernetes MachineSet to create machine sets. The procedure below expla
178
178
179
179
#### Create GPU machine set
180
180
181
-
Use the following steps to create the new GPU machine. It may take 10-15 minutes to provision a new GPU machine. If this step fails, sign in to [azure portal](https://portal.azure.com) and ensure there are no availability issues. To do so, go to **Virtual Machines** and search for the worker name you created previously to see the status of VMs.
181
+
Use the following steps to create the new GPU machine. It may take 10-15 minutes to provision a new GPU machine. If this step fails, sign in to [Azure portal](https://portal.azure.com) and ensure there are no availability issues. To do so, go to **Virtual Machines** and search for the worker name you created previously to see the status of VMs.
182
182
183
183
1. Create the GPU Machine set.
184
184
@@ -270,9 +270,9 @@ This section explains how to create the `nvidia-gpu-operator` namespace, set up
270
270
271
271
Don't proceed until you have verified that the operator has finished installing. Also, ensure that your GPU worker is online.
272
272
273
-
:::image type="content" source="media/howto-gpu-workloads/nvidia-installed.png" alt-text="Screen shot of installed operators on namespace.":::
273
+
:::image type="content" source="media/howto-gpu-workloads/nvidia-installed.png" alt-text="Screenshot of installed operators on namespace.":::
274
274
275
-
#### Install Node Feature Discovery Operator
275
+
#### Install node feature discovery operator
276
276
277
277
The node feature discovery operator will discover the GPU on your nodes and appropriately label the nodes so you can target them for workloads.
278
278
@@ -463,11 +463,11 @@ Official Documentation for Installing [Node Feature Discovery Operator](https://
463
463
464
464
The status of this operator should show as **Available**.
465
465
466
-
:::image type="content" source="media/howto-gpu-workloads/nfd-ready-for-use.png" alt-text="Screen shot of node feature discovery operator.":::
466
+
:::image type="content" source="media/howto-gpu-workloads/nfd-ready-for-use.png" alt-text="Screenshot of node feature discovery operator.":::
467
467
468
468
#### Apply Nvidia Cluster Config
469
469
470
-
This sections explains how to apply the Nvidia cluster config. Please read the [Nvidia documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/install-gpu-ocp.html) on customizing this if you have your own private repos or specific settings. This process may take several minutes to complete.
470
+
This section explains how to apply the Nvidia cluster config. Please read the [Nvidia documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/install-gpu-ocp.html) on customizing this if you have your own private repos or specific settings. This process may take several minutes to complete.
471
471
472
472
1. Apply cluster config.
473
473
@@ -525,7 +525,7 @@ This sections explains how to apply the Nvidia cluster config. Please read the [
525
525
526
526
Log in to OpenShift console and browse to operators. Ensure sure you're in the `nvidia-gpu-operator` namespace. It should say `State: Ready once everything is complete`.
527
527
528
-
:::image type="content" source="media/howto-gpu-workloads/nvidia-cluster-policy.png" alt-text="Screen shot of existing cluster policies.":::
528
+
:::image type="content" source="media/howto-gpu-workloads/nvidia-cluster-policy.png" alt-text="Screenshot of existing cluster policies on OpenShift console.":::
529
529
530
530
## Validate GPU
531
531
@@ -548,7 +548,7 @@ It may take some time for the Nvidia Operator and NFD to completely install and
548
548
549
549
You can see the node labels by logging into the OpenShift console -> Compute -> Nodes -> nvidia-worker-southcentralus1-. You should see multiple Nvidia GPU labels and the pci-10de device from above.
550
550
551
-
:::image type="content" source="media/howto-gpu-workloads/node-labels.png" alt-text="Screen shot of GPU labels.":::
551
+
:::image type="content" source="media/howto-gpu-workloads/node-labels.png" alt-text="Screenshot of GPU labels on OpenShift console.":::
552
552
553
553
1. Nvidia SMI tool verification.
554
554
@@ -559,9 +559,9 @@ It may take some time for the Nvidia Operator and NFD to completely install and
559
559
560
560
You should see output that shows the GPUs available on the host such as this example screenshot. (Varies depending on GPU worker type)
561
561
562
-
:::image type="content" source="media/howto-gpu-workloads/test-gpu.png" alt-text="Screen shot of output showing available GPUs.":::
562
+
:::image type="content" source="media/howto-gpu-workloads/test-gpu.png" alt-text="Screenshot of output showing available GPUs.":::
563
563
564
-
2. Create Pod to run a GPU workload
564
+
1. Create Pod to run a GPU workload
565
565
566
566
```yaml
567
567
oc project nvidia-gpu-operator
@@ -583,16 +583,16 @@ It may take some time for the Nvidia Operator and NFD to completely install and
583
583
EOF
584
584
```
585
585
586
-
3. View logs.
586
+
1. View logs.
587
587
588
588
```bash
589
589
oc logs cuda-vector-add --tail=-1
590
590
```
591
591
592
-
> [!NOTE]
593
-
> If you get an error `Error from server (BadRequest): container "cuda-vector-add" in pod "cuda-vector-add" is waiting to start: ContainerCreating`, try running `oc delete pod cuda-vector-add` and then re-run the create statement above.
592
+
> [!NOTE]
593
+
> If you get an error `Error from server (BadRequest): container "cuda-vector-add" in pod "cuda-vector-add" is waiting to start: ContainerCreating`, try running `oc delete pod cuda-vector-add` and then re-run the create statement above.
594
594
595
-
The output should be similar to the following (depending on GPU):
595
+
The output should be similar to the following (depending on GPU):
596
596
597
597
```bash
598
598
[Vector addition of 5000 elements]
@@ -602,8 +602,7 @@ It may take some time for the Nvidia Operator and NFD to completely install and
0 commit comments