You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+56-44Lines changed: 56 additions & 44 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,44 +1,46 @@
1
-
# Running RDMA (remote direct memory access) GPU workloads on OKE
2
-
Oracle Cloud Infrastructure Kubernetes Engine (OKE) is a fully-managed, scalable, and highly available service that you can use to deploy your containerized applications to the cloud.
1
+
# Running RDMA (Remote Direct Memory Access) GPU Workloads on OKE
3
2
4
-
### Supported Operating Systems
3
+
This guide provides instructions for deploying and managing GPU workloads with RDMA connectivity on Oracle Cloud Infrastructure Kubernetes Engine (OKE). OKE is a fully-managed, scalable, and highly available Kubernetes service that enables you to deploy containerized applications to the cloud.
4
+
5
+
## Supported Operating Systems
5
6
- Ubuntu 22.04
6
7
- Oracle Linux 8 (except for the GPU & RDMA worker pool)
7
8
8
-
###Required policies
9
-
The following policies are required. The OCI Resource Manager stack will create them for you if you have the necessary permissions. If you don't have the permissions, please find more information about the policies below.
9
+
## Required Policies
10
+
The following policies are required. The OCI Resource Manager stack will create them for you if you have the necessary permissions. If you don't have the permissions, please refer to the policy documentation below.
10
11
11
12
-[Policy Configuration for Cluster Creation and Deployment](https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengpolicyconfig.htm)
12
13
-[Creating a Dynamic Group and a Policy for Self-Managed Nodes](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengdynamicgrouppolicyforselfmanagednodes.htm)
13
14
14
-
## Instructions for deploying an OKE cluster with GPUs and RDMA connectivity
15
-
You will need a CPU pool and a GPU pool. The OCI Resource Manager stack deploys an operational worker pool by default and you choose to deploy additional CPU/GPU worker pools.
15
+
## Deploying an OKE Cluster with GPUs and RDMA Connectivity
16
+
17
+
You will need a CPU pool and a GPU pool. The OCI Resource Manager stack deploys a system worker pool by default, and you can choose to deploy additional CPU/GPU worker pools.
16
18
17
19
You can use the following images for both CPU and GPU pools.
18
20
19
21
> [!NOTE]
20
22
> The GPU image has the GPU drivers pre-installed.
21
23
22
-
#### Images to use
23
-
You can use the instructions [here](https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/imageimportexport.htm#Importing) for importing the below image to your tenancy.
24
+
### Images to Use
24
25
25
-
**Image to use for non-GPU nodes**
26
+
You can use the instructions [here](https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/imageimportexport.htm#Importing) for importing the images below to your tenancy.
26
27
27
-
-[Link to import the image](https://objectstorage.us-chicago-1.oraclecloud.com/p/O1VP9Rx0p7uWKRQW6739ZzTbnUPK5F8cvlN0apUaiO_cF5x9R2ESYN6yskW0FUVq/n/hpc_limited_availability/b/oke-images-do-not-delete/o/Canonical-Ubuntu-22.04-2025.03.28-0-OKE)
28
-
29
-
**Images for NVIDIA shapes**
28
+
#### Images for NVIDIA x86 Shapes (B200, H200, H100, A100, L40s, A10)
30
29
31
30
-[GPU driver 570 & CUDA 12.8](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2025.05.20-0-OFED-24.10-1.1.4.0-GPU-570-OPEN-CUDA-12.8-2025.07.22-0)
32
-
33
31
-[GPU driver 550 & CUDA 12.4](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2025.05.20-0-OFED-24.10-1.1.4.0-GPU-550-CUDA-12.4-2025.07.22-0)
34
32
33
+
#### Images for NVIDIA Arm Shapes (GB200)
35
34
36
-
**Image for AMD shapes**
35
+
-[GPU driver 570 & CUDA 12.8](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-aarch64-2025.05.20-0-DOCA-OFED-3.0.0-GPU-570-OPEN-CUDA-12.8-2025.07.24-0)
### Deploy the cluster using the Oracle Cloud Resource Manager template
43
+
### Deploy the Cluster Using the Oracle Cloud Resource Manager Template
42
44
You can easily deploy the cluster using the **Deploy to Oracle Cloud** button below.
43
45
44
46
[](https://cloud.oracle.com/resourcemanager/stacks/create?zipUrl=https://github.com/oracle-quickstart/oci-hpc-oke/releases/latest/download/oke-gpu-rdma-quickstart.zip)
@@ -49,7 +51,11 @@ The template will deploy a `bastion` instance and an `operator` instance by defa
49
51
50
52
You can also find this information under the **Application information** tab in the OCI Resource Manager stack.
51
53
52
-
### Wait until you see all nodes in the cluster
54
+

55
+
56
+
### Verify Node Availability
57
+
58
+
Wait until all nodes are ready in the cluster:
53
59
54
60
```sh
55
61
kubectl get nodes
@@ -62,8 +68,9 @@ NAME STATUS ROLES AGE VERSION
62
68
10.0.96.82 Ready node 2d23h v1.31.1
63
69
```
64
70
65
-
### Add a Service Account Authentication Token (optional but recommended)
66
-
More info [here](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengaddingserviceaccttoken.htm).
71
+
### Add a Service Account Authentication Token (Optional but Recommended)
72
+
73
+
For more information, see [Adding a Service Account Token](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengaddingserviceaccttoken.htm).
The initial pull of the container will take long. Once the launcher pod `nccl-test-launcher-XXXXX` starts running, you can check its logs for the NCCL test result.
201
+
The initial container image pull may take some time. Once the launcher pod `nccl-test-launcher-XXXXX` starts running, you can check its logs for the NCCL test results.
202
+
203
+
### Example Output
192
204
193
205
```sh
194
206
Waiting for workers to be ready...
@@ -260,7 +272,7 @@ Please see the instructions [here](./docs/running-pytorch-jobs-on-oke-using-host
260
272
Yes, you can use OCI's File Storage Service (FSS) with `skopeo` to accomplish that. You can find the instructions [here](./docs/importing-images-from-fss-skopeo.md).
261
273
262
274
### How can I run GPU & RDMA health checks in my nodes?
263
-
You can deploy the health check script with Node Problem Detector by following the instructions [here](./docs/running-gpu-rdma-healtchecks-with-node-problem-detector.md).
275
+
You can deploy the health check script with Node Problem Detector by following the instructions [here](./docs/running-gpu-rdma-healthchecks-with-node-problem-detector.md).
264
276
265
277
### Can I autoscale my RDMA enabled nodes in a Cluster Network?
266
278
You can set up autoscaling for your nodes in a Cluster Network using the instructions [here](./docs/using-cluster-autoscaler-with-cluster-networks.md).
Copy file name to clipboardExpand all lines: docs/adding-ssh-keys-to-worker-nodes.md
+92-4Lines changed: 92 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,18 @@
1
-
### Adding SSH public keys to worker nodes
2
-
When you create worker nodes with the stack, it adds one SSH public key. If you need to add other SSH keys to the worker nodes, you can use the following manifest.
1
+
# Adding SSH Public Keys to Worker Nodes
3
2
4
-
1 - Create a `ConfigMap` for the keys you want to add
3
+
When you create worker nodes with the OCI Resource Manager stack, a single SSH public key is added by default. You may need to add additional SSH keys for team access, automation, or administrative purposes. This guide explains how to add multiple SSH keys to your worker nodes using Kubernetes resources.
4
+
5
+
## Prerequisites
6
+
7
+
- Access to your OKE cluster
8
+
- kubectl configured and authenticated
9
+
- SSH public keys that you want to add to the worker nodes
10
+
11
+
## Procedure
12
+
13
+
### Step 1: Create a ConfigMap for SSH Keys
14
+
15
+
Create a `ConfigMap` containing the SSH public keys you want to add to the worker nodes:
- Update the `authorized_keys` file on each node at pod startup
113
+
- Work with both Ubuntu (user: `ubuntu`) and Oracle Linux (user: `opc`) nodes
114
+
115
+
> [!NOTE]
116
+
> The keys are applied when the DaemonSet pods start. To update keys after the initial deployment, you will need to restart the pods (see "Adding or Updating Keys" section below).
117
+
118
+
## Verification
119
+
120
+
To verify that the SSH keys have been successfully added:
121
+
122
+
1. Check that the DaemonSet pods are running on all nodes:
123
+
124
+
```sh
125
+
kubectl get pods -n kube-system -l app=authorized-ssh-keys -o wide
126
+
```
127
+
128
+
2. Check the logs of a DaemonSet pod to confirm key updates:
0 commit comments