You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+27-27Lines changed: 27 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,23 +1,20 @@
1
1
# Running RDMA (remote direct memory access) GPU workloads on OKE
2
-
Oracle Cloud Infrastructure Kubernetes Engine (OKE) is a fully-managed, scalable, and highly available service that you can use to deploy your containerized applications to the cloud.
3
-
4
-
Please visit the [OKE documentation page](https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengoverview.htm) for more information.
2
+
Oracle Cloud Infrastructure Kubernetes Engine (OKE)[https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengoverview.htm] is a fully-managed, scalable, and highly available service that you can use to deploy your containerized applications to the cloud.
5
3
6
4
### Supported Operating Systems
7
5
- Ubuntu 22.04
6
+
- Oracle Linux 8 (except for the GPU & RDMA worker pool)
8
7
9
8
### Required policies
10
-
The OCI Resource Manager stack template uses the [Self Managed Nodes](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengworkingwithselfmanagednodes.htm) functionality of OKE.
11
-
12
-
Below policies are required. The OCI Resource Manager stack will create them for you if you have the necessary permissions. If you don't have the permissions, please find more information about the policies below.
9
+
The following policies are required. The OCI Resource Manager stack will create them for you if you have the necessary permissions. If you don't have the permissions, please find more information about the policies below.
13
10
14
11
-[Policy Configuration for Cluster Creation and Deployment](https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengpolicyconfig.htm)
15
12
-[Creating a Dynamic Group and a Policy for Self-Managed Nodes](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengdynamicgrouppolicyforselfmanagednodes.htm)
16
13
17
14
## Instructions for deploying an OKE cluster with GPUs and RDMA connectivity
18
-
You will need a CPU pool and a GPU pool. The OCI Resource Manager stack deploys an operational worker pool by default and you choose to deploy addidional CPU/GPU worker pools.
15
+
You will need a CPU pool and a GPU pool. The OCI Resource Manager stack deploys an operational worker pool by default and you choose to deploy additional CPU/GPU worker pools.
19
16
20
-
You can use the below images for both CPU and GPU pools.
17
+
You can use the following images for both CPU and GPU pools.
21
18
22
19
> [!NOTE]
23
20
> The GPU image has the GPU drivers pre-installed.
@@ -31,26 +28,24 @@ You can use the instructions [here](https://docs.oracle.com/en-us/iaas/Content/C
31
28
32
29
**Images for NVIDIA shapes**
33
30
34
-
-[GPU driver 570 & CUDA 12.8](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-GPU-570-CUDA-12.8-2025.03.26-0)
31
+
-[GPU driver 570 & CUDA 12.8](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2025.05.20-0-OFED-24.10-1.1.4.0-GPU-570-OPEN-CUDA-12.8-2025.07.22-0)
35
32
36
-
-[GPU driver 560 & CUDA 12.6](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-GPU-560-CUDA-12.6-2025.03.26-0)
37
-
38
-
-[GPU driver 550 & CUDA 12.4](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-GPU-550-CUDA-12.4-2025.03.26-0)
33
+
-[GPU driver 550 & CUDA 12.4](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2025.05.20-0-OFED-24.10-1.1.4.0-GPU-550-CUDA-12.4-2025.07.22-0)
### Deploy the cluster using the Oracle Cloud Resource Manager template
47
42
You can easily deploy the cluster using the **Deploy to Oracle Cloud** button below.
48
43
49
-
[](https://cloud.oracle.com/resourcemanager/stacks/create?zipUrl=https://github.com/oracle-quickstart/oci-hpc-oke/releases/download/v25.5.1/oke-rdma-quickstart-v25.5.1.zip)
44
+
[](https://github.com/oracle-quickstart/oci-hpc-oke/releases/latest/download/oke-gpu-rdma-quickstart.zip)
50
45
51
46
For the image ID, use the ID of the image that you imported in the previous step.
52
47
53
-
The template will deploy a `bastion` instance and an `operator` instance. The `operator` instance will have access to the OKE cluster. You can connect to the `operator` instance via SSH with `ssh -J ubuntu@<bastion IP> ubuntu@<operator IP>`.
48
+
The template will deploy a `bastion` instance and an `operator` instance by default. The `operator` instance will have access to the OKE cluster. You can connect to the `operator` instance via SSH with `ssh -J ubuntu@<bastion IP> ubuntu@<operator IP>`.
54
49
55
50
You can also find this information under the **Application information** tab in the OCI Resource Manager stack.
56
51
@@ -60,15 +55,15 @@ You can also find this information under the **Application information** tab in
60
55
kubectl get nodes
61
56
62
57
NAME STATUS ROLES AGE VERSION
63
-
10.0.103.73 Ready <none> 2d23h v1.25.6
64
-
10.0.127.206 Ready node 2d3h v1.25.6
65
-
10.0.127.32 Ready node 2d3h v1.25.6
66
-
10.0.83.93 Ready <none> 2d23h v1.25.6
67
-
10.0.96.82 Ready node 2d23h v1.25.6
58
+
10.0.103.73 Ready <none> 2d23h v1.31.1
59
+
10.0.127.206 Ready node 2d3h v1.31.1
60
+
10.0.127.32 Ready node 2d3h v1.31.1
61
+
10.0.83.93 Ready <none> 2d23h v1.31.1
62
+
10.0.96.82 Ready node 2d23h v1.31.1
68
63
```
69
64
70
65
### Add a Service Account Authentication Token (optional but recommended)
71
-
More info [here.](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengaddingserviceaccttoken.htm)
66
+
More info [here](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengaddingserviceaccttoken.htm).
> The NCCL parameters are different between the H100 and A100 shapes. Please make sure that you are using the correct manifest for your bare metal GPU shapes.
The initial pull of the container will take long. Once the master pod `nccl-allreduce-job0-mpimaster-0` starts running, you can check it logs for the NCCL test result.
188
+
The initial pull of the container will take long. Once the master pod `nccl-allreduce-job0-mpimaster-0` starts running, you can check its logs for the NCCL test result.
189
189
190
190
```sh
191
191
Defaulted container "mpimaster" out of: mpimaster, wait-for-workers (init)
@@ -280,16 +280,16 @@ You can follow the instructions [here](./docs/adding-ssh-keys-to-worker-nodes.md
280
280
Please see the instructions [here](./docs/running-pytorch-jobs-on-oke-using-hostnetwork-with-rdma.md) for the best practices on running PyTorch jobs.
281
281
282
282
### I have large container images. Can I import them from a shared location instead of downloading them?
283
-
Yes, you can use OCI's File Storage Service (FSS) with `skopeo` to accomplish that. You can find the instructions [here.](./docs/importing-images-from-fss-skopeo.md)
283
+
Yes, you can use OCI's File Storage Service (FSS) with `skopeo` to accomplish that. You can find the instructions [here](./docs/importing-images-from-fss-skopeo.md).
284
284
285
285
### How can I run GPU & RDMA health checks in my nodes?
286
-
You can deploy the health check script with Node Problem Detector by following the instructions [here.](./docs/running-gpu-rdma-healtchecks-with-node-problem-detector.md)
286
+
You can deploy the health check script with Node Problem Detector by following the instructions [here](./docs/running-gpu-rdma-healtchecks-with-node-problem-detector.md).
287
287
288
288
### Can I autoscale my RDMA enabled nodes in a Cluster Network?
289
-
You can setup autoscaling for your nodes in a Cluster Network using the instructions [here.](./docs/using-cluster-autoscaler-with-cluster-networks.md)
289
+
You can set up autoscaling for your nodes in a Cluster Network using the instructions [here](./docs/using-cluster-autoscaler-with-cluster-networks.md).
290
290
291
291
### How do I use network locality information when running workloads on OKE?
292
-
You can follow the instructions [here.](./docs/using-rdma-network-locality-when-running-workloads-on-oke.md)
292
+
You can follow the instructions [here](./docs/using-rdma-network-locality-when-running-workloads-on-oke.md).
0 commit comments