Skip to content

Commit 3e5d61b

Browse files
Update docs for OL8
1 parent 1b7f2d4 commit 3e5d61b

File tree

9 files changed

+348
-313
lines changed

9 files changed

+348
-313
lines changed

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,6 @@ Please visit OKE documentation page for more information: https://docs.oracle.co
66

77
This repository will focus on two workload types using GPUs: RDMA workloads using OCI's high performance network with support for RDMA (e.g. training jobs) and non-RDMA workloads that don't need to use the RDMA network (e.g. inference jobs).
88

9-
### [Running RDMA workloads on OKE](./docs/running-rdma-workloads-on-oke.md)
10-
11-
### [Running non-RDMA workloads on OKE](./docs/running-non-rdma-workloads-on-oke.md)
9+
### Running RDMA workloads on OKE
10+
[Using Nvidia A100 shapes](./docs/running-rdma-workloads-on-oke-a100.md)
11+
[Using Nvidia H100 shapes](./docs/running-rdma-workloads-on-oke-h100.md)

docs/running-non-rdma-workloads-on-oke.md

Lines changed: 0 additions & 71 deletions
This file was deleted.

docs/running-rdma-workloads-on-oke.md renamed to docs/running-rdma-workloads-on-oke-a100.md

Lines changed: 13 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -61,18 +61,6 @@ NAME STATUS ROLES AGE VERSION
6161
10.0.96.81 Ready node 2d23h v1.25.6
6262
```
6363

64-
### Deploy the OCI RDMA Health Check daemonset
65-
> [!IMPORTANT]
66-
> Deploying this daemonset is important.
67-
> When a new node joins to the OKE cluster, it will report itself as ready. However, the RDMA network configuration of the nodes usually takes longer than the node joining the cluster. The health check daemonset checks the status of the RDMA interfaces, and removes the `oci.oraclecloud.com/oci-rdma-health-check` that is being added via cloud init.
68-
69-
```
70-
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/oci-rdma-health-check-ds.yaml
71-
```
72-
73-
### Build the GPU Operator driver container image for Oracle Linux
74-
You can follow the instructions [here](./building-ol7-gpu-operator-driver-image.md) for building the GPU Operator driver container image.
75-
7664
### Get the latest Helm 3 version
7765
```sh
7866
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
@@ -95,11 +83,10 @@ Change the `driver.repository` and `driver.version` in the Helm command below.
9583
helm install --wait \
9684
-n gpu-operator --create-namespace \
9785
gpu-operator nvidia/gpu-operator \
98-
--version v23.3.2 \
86+
--version v23.9.1 \
87+
--set driver.enabled=false \
9988
--set operator.defaultRuntime=crio \
100-
--set driver.repository=<The repository that you pushed your image> \
101-
--set driver.version=<The driver version in your pushed image. Only the version, don't add ol7.9 at the end> \
102-
--set toolkit.version=v1.13.5-centos7 \
89+
--set toolkit.version=v1.14.5-ubi8 \
10390
--set driver.rdma.enabled=true \
10491
--set driver.rdma.useHostMofed=true
10592
```
@@ -113,12 +100,14 @@ Wait until all network operator pods are running with `kubectl get pods -n gpu-o
113100
helm install --wait \
114101
-n network-operator --create-namespace \
115102
network-operator nvidia/network-operator \
116-
--version v23.5.0 \
103+
--version v23.10.0 \
117104
--set deployCR=true \
118105
--set nfd.enabled=false \
119106
--set rdmaSharedDevicePlugin.deploy=false \
120107
--set nvPeerDriver.deploy=true \
121108
--set sriovDevicePlugin.deploy=true \
109+
--set secondaryNetwork.ipamPlugin.deploy=false \
110+
--set nvIpam.deploy=true \
122111
--set-json sriovDevicePlugin.resources='[{"name": "sriov_rdma_vf", "drivers": ["mlx5_core"], "devices": ["101a"], "isRdma": [true]}]'
123112
```
124113

@@ -142,7 +131,12 @@ By default, we create one Virtual Function per Physical Function. So for the A10
142131
You can run the following command to see all allocatable resources of a node:
143132

144133
```
145-
kubectl get node <node name> -o json | jq '.status.allocatable'
134+
kubectl get nodes -l 'node.kubernetes.io/instance-type in (BM.GPU.H100.8, BM.GPU.A100-v2.8, BM.GPU4.8, BM.GPU.B4.8)' --sort-by=.status.capacity."nvidia\.com/gpu" -o=custom-columns='NODE:metadata.name,GPUs:status.capacity.nvidia\.com/gpu,RDMA-VFs:status.capacity.nvidia\.com/sriov_rdma_vf'
135+
136+
NODE GPUs RDMA-VFs
137+
10.79.148.115 8 16
138+
10.79.151.167 8 16
139+
10.79.156.205 8 16
146140
```
147141

148142
### Create Network Attachment Definition
@@ -156,7 +150,7 @@ kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke
156150
kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.4.0/deploy/v2beta1/mpi-operator.yaml
157151
```
158152

159-
### Run NCCL test
153+
### Optional - Run NCCL test
160154

161155
Run the test with `kubectl apply -f nccl-test.yaml`.
162156

0 commit comments

Comments
 (0)