Skip to content

Commit d9fcd2d

Browse files
Update docs
1 parent 72c584b commit d9fcd2d

14 files changed

+956
-1932
lines changed

README.md

Lines changed: 50 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,40 @@
1-
# Running RDMA (remote direct memory access) GPU workloads on OKE
2-
Oracle Cloud Infrastructure Kubernetes Engine (OKE) is a fully-managed, scalable, and highly available service that you can use to deploy your containerized applications to the cloud.
1+
# Running RDMA (Remote Direct Memory Access) GPU Workloads on OKE
32

4-
### Supported Operating Systems
3+
This guide provides instructions for deploying and managing GPU workloads with RDMA connectivity on Oracle Cloud Infrastructure Kubernetes Engine (OKE). OKE is a fully-managed, scalable, and highly available Kubernetes service that enables you to deploy containerized applications to the cloud.
4+
5+
## Supported Operating Systems
56
- Ubuntu 22.04
67
- Oracle Linux 8 (except for the GPU & RDMA worker pool)
78

8-
### Required policies
9-
The following policies are required. The OCI Resource Manager stack will create them for you if you have the necessary permissions. If you don't have the permissions, please find more information about the policies below.
9+
## Required Policies
10+
The following policies are required. The OCI Resource Manager stack will create them for you if you have the necessary permissions. If you don't have the permissions, please refer to the policy documentation below.
1011

1112
- [Policy Configuration for Cluster Creation and Deployment](https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengpolicyconfig.htm)
1213
- [Creating a Dynamic Group and a Policy for Self-Managed Nodes](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengdynamicgrouppolicyforselfmanagednodes.htm)
1314

14-
## Instructions for deploying an OKE cluster with GPUs and RDMA connectivity
15-
You will need a CPU pool and a GPU pool. The OCI Resource Manager stack deploys an operational worker pool by default and you choose to deploy additional CPU/GPU worker pools.
15+
## Deploying an OKE Cluster with GPUs and RDMA Connectivity
16+
17+
You will need a CPU pool and a GPU pool. The OCI Resource Manager stack deploys a system worker pool by default, and you can choose to deploy additional CPU/GPU worker pools.
1618

1719
You can use the following images for both CPU and GPU pools.
1820

1921
> [!NOTE]
2022
> The GPU image has the GPU drivers pre-installed.
2123
22-
#### Images to use
23-
You can use the instructions [here](https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/imageimportexport.htm#Importing) for importing the below image to your tenancy.
24+
### Images to Use
2425

25-
**Image to use for non-GPU nodes**
26+
You can use the instructions [here](https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/imageimportexport.htm#Importing) for importing the images below to your tenancy.
2627

27-
- [Link to import the image](https://objectstorage.us-chicago-1.oraclecloud.com/p/O1VP9Rx0p7uWKRQW6739ZzTbnUPK5F8cvlN0apUaiO_cF5x9R2ESYN6yskW0FUVq/n/hpc_limited_availability/b/oke-images-do-not-delete/o/Canonical-Ubuntu-22.04-2025.03.28-0-OKE)
28-
29-
**Images for NVIDIA shapes**
28+
#### Images for NVIDIA Shapes
3029

3130
- [GPU driver 570 & CUDA 12.8](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2025.05.20-0-OFED-24.10-1.1.4.0-GPU-570-OPEN-CUDA-12.8-2025.07.22-0)
32-
3331
- [GPU driver 550 & CUDA 12.4](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2025.05.20-0-OFED-24.10-1.1.4.0-GPU-550-CUDA-12.4-2025.07.22-0)
3432

35-
36-
**Image for AMD shapes**
33+
#### Images for AMD Shapes
3734

3835
- [ROCm 6.3.2](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2025.05.20-0-OFED-24.10-1.1.4.0-AMD-ROCM-632-2025.07.23-0)
3936

40-
41-
### Deploy the cluster using the Oracle Cloud Resource Manager template
37+
### Deploy the Cluster Using the Oracle Cloud Resource Manager Template
4238
You can easily deploy the cluster using the **Deploy to Oracle Cloud** button below.
4339

4440
[![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?zipUrl=https://github.com/oracle-quickstart/oci-hpc-oke/releases/latest/download/oke-gpu-rdma-quickstart.zip)
@@ -49,7 +45,11 @@ The template will deploy a `bastion` instance and an `operator` instance by defa
4945

5046
You can also find this information under the **Application information** tab in the OCI Resource Manager stack.
5147

52-
### Wait until you see all nodes in the cluster
48+
![Application Information Tab](./docs/images/rms-application-information.png)
49+
50+
### Verify Node Availability
51+
52+
Wait until all nodes are ready in the cluster:
5353

5454
```sh
5555
kubectl get nodes
@@ -62,8 +62,9 @@ NAME STATUS ROLES AGE VERSION
6262
10.0.96.82 Ready node 2d23h v1.31.1
6363
```
6464

65-
### Add a Service Account Authentication Token (optional but recommended)
66-
More info [here](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengaddingserviceaccttoken.htm).
65+
### Add a Service Account Authentication Token (Optional but Recommended)
66+
67+
For more information, see [Adding a Service Account Token](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengaddingserviceaccttoken.htm).
6768

6869
```
6970
kubectl -n kube-system create serviceaccount kubeconfig-sa
@@ -79,8 +80,9 @@ kubectl config set-credentials kubeconfig-sa --token=$TOKEN
7980
kubectl config set-context --current --user=kubeconfig-sa
8081
```
8182

82-
### Using the host RDMA network interfaces in manifests
83-
In order to use the RDMA interfaces on the host in your pods, you should have the below sections in your manifests:
83+
### Using Host RDMA Network Interfaces in Manifests
84+
85+
To use the RDMA interfaces on the host in your pods, include the following sections in your manifests:
8486

8587
```yaml
8688
spec:
@@ -139,56 +141,60 @@ spec:
139141
sleep 1000000
140142
```
141143
142-
### Optional - Deploy Kueue & MPI Operator to run NCCL tests
143-
Kueue & MPI Operator are needed for running the optional NCCL tests.
144+
## Optional: Deploy Kueue & MPI Operator to Run NCCL Tests
145+
146+
Kueue and MPI Operator are required for running the optional NCCL tests.
144147
145-
#### Deploy MPI Operator & Kueue
148+
### Deploy MPI Operator and Kueue
146149
```sh
147150
kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yaml
148151

149152
helm install kueue oci://registry.k8s.io/kueue/charts/kueue --version="0.14.1" --create-namespace --namespace=kueue-system
150153
```
151154

152-
#### Run the NCCL/RCCL tests
155+
### Run the NCCL/RCCL Tests
156+
153157
> [!IMPORTANT]
154-
> The NCCL parameters are different between different shapes. Please make sure that you are using the correct manifest for your bare metal GPU shapes.
158+
> The NCCL parameters differ between GPU shapes. Ensure that you use the correct manifest for your specific bare metal GPU shape.
155159
156-
##### BM.GPU.GB200-v2.4
157-
```
160+
#### BM.GPU.GB200-v2.4
161+
```sh
158162
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/nccl-tests/kueue/BM.GPU.GB200-v2.4.yaml
159163
```
160164

161-
##### BM.GPU.GB200.4
162-
```
165+
#### BM.GPU.GB200.4
166+
```sh
163167
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/nccl-tests/kueue/BM.GPU.GB200.4.yaml
164168
```
165169

166-
##### BM.GPU.H200
167-
```
170+
#### BM.GPU.H200
171+
```sh
168172
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/nccl-tests/kueue/BM.GPU.H200.8.yaml
169173
```
170174

171-
##### BM.GPU.H100
172-
```
175+
#### BM.GPU.H100
176+
```sh
173177
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/nccl-tests/kueue/BM.GPU.H100.8.yaml
174178
```
175179

176-
##### BM.GPU.A100-v2.8
177-
```
180+
#### BM.GPU.A100-v2.8
181+
```sh
178182
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/nccl-tests/kueue/BM.GPU.A100-v2.8.yaml
179183
```
180184

181-
##### BM.GPU4.8
182-
```
185+
#### BM.GPU4.8
186+
```sh
183187
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/nccl-tests/kueue/BM.GPU4.8.yaml
184188
```
185189

186-
##### BM.GPU.B4.8
187-
```
190+
#### BM.GPU.B4.8
191+
```sh
188192
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/nccl-tests/kueue/BM.GPU.B4.8.yaml
189193
```
190194

191-
The initial pull of the container will take long. Once the launcher pod `nccl-test-launcher-XXXXX` starts running, you can check its logs for the NCCL test result.
195+
The initial container image pull may take some time. Once the launcher pod `nccl-test-launcher-XXXXX` starts running, you can check its logs for the NCCL test results.
196+
197+
### Example Output
192198

193199
```sh
194200
Waiting for workers to be ready...
@@ -260,7 +266,7 @@ Please see the instructions [here](./docs/running-pytorch-jobs-on-oke-using-host
260266
Yes, you can use OCI's File Storage Service (FSS) with `skopeo` to accomplish that. You can find the instructions [here](./docs/importing-images-from-fss-skopeo.md).
261267

262268
### How can I run GPU & RDMA health checks in my nodes?
263-
You can deploy the health check script with Node Problem Detector by following the instructions [here](./docs/running-gpu-rdma-healtchecks-with-node-problem-detector.md).
269+
You can deploy the health check script with Node Problem Detector by following the instructions [here](./docs/running-gpu-rdma-healthchecks-with-node-problem-detector.md).
264270

265271
### Can I autoscale my RDMA enabled nodes in a Cluster Network?
266272
You can set up autoscaling for your nodes in a Cluster Network using the instructions [here](./docs/using-cluster-autoscaler-with-cluster-networks.md).

docs/adding-ssh-keys-to-worker-nodes.md

Lines changed: 92 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,18 @@
1-
### Adding SSH public keys to worker nodes
2-
When you create worker nodes with the stack, it adds one SSH public key. If you need to add other SSH keys to the worker nodes, you can use the following manifest.
1+
# Adding SSH Public Keys to Worker Nodes
32

4-
1 - Create a `ConfigMap` for the keys you want to add
3+
When you create worker nodes with the OCI Resource Manager stack, a single SSH public key is added by default. You may need to add additional SSH keys for team access, automation, or administrative purposes. This guide explains how to add multiple SSH keys to your worker nodes using Kubernetes resources.
4+
5+
## Prerequisites
6+
7+
- Access to your OKE cluster
8+
- kubectl configured and authenticated
9+
- SSH public keys that you want to add to the worker nodes
10+
11+
## Procedure
12+
13+
### Step 1: Create a ConfigMap for SSH Keys
14+
15+
Create a `ConfigMap` containing the SSH public keys you want to add to the worker nodes:
516

617
```yaml
718
apiVersion: v1
@@ -14,7 +25,15 @@ data:
1425
key2.pub: 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQD....'
1526
```
1627
17-
2 - Use the below `DaemonSet` to add the keys in the ConfigMap to nodes.
28+
Apply the ConfigMap:
29+
30+
```sh
31+
kubectl apply -f configmap.yaml
32+
```
33+
34+
### Step 2: Deploy the DaemonSet
35+
36+
Deploy a `DaemonSet` to automatically distribute and manage the SSH keys across all worker nodes:
1837

1938
```yaml
2039
apiVersion: apps/v1
@@ -80,3 +99,72 @@ spec:
8099
- { name: root, mountPath: /host }
81100
- { name: authorized-ssh-keys, mountPath: /authorized }
82101
```
102+
103+
Apply the DaemonSet:
104+
105+
```sh
106+
kubectl apply -f daemonset.yaml
107+
```
108+
109+
The DaemonSet will automatically:
110+
- Deploy a pod on each worker node
111+
- Read SSH keys from the ConfigMap
112+
- Update the `authorized_keys` file on each node at pod startup
113+
- Work with both Ubuntu (user: `ubuntu`) and Oracle Linux (user: `opc`) nodes
114+
115+
> [!NOTE]
116+
> The keys are applied when the DaemonSet pods start. To update keys after the initial deployment, you will need to restart the pods (see "Adding or Updating Keys" section below).
117+
118+
## Verification
119+
120+
To verify that the SSH keys have been successfully added:
121+
122+
1. Check that the DaemonSet pods are running on all nodes:
123+
124+
```sh
125+
kubectl get pods -n kube-system -l app=authorized-ssh-keys -o wide
126+
```
127+
128+
2. Check the logs of a DaemonSet pod to confirm key updates:
129+
130+
```sh
131+
kubectl logs -n kube-system -l app=authorized-ssh-keys --tail=20
132+
```
133+
134+
## Adding or Updating Keys
135+
136+
To add or update SSH keys after the initial deployment:
137+
138+
1. Edit the ConfigMap to add or modify keys:
139+
140+
```sh
141+
kubectl edit configmap authorized-ssh-keys -n kube-system
142+
```
143+
144+
2. Restart the DaemonSet pods to apply the changes:
145+
146+
```sh
147+
kubectl rollout restart daemonset/authorized-ssh-keys -n kube-system
148+
```
149+
150+
The pods will be restarted with a rolling update strategy, ensuring continuous availability while applying the new keys across all nodes.
151+
152+
## Removing Keys
153+
154+
To remove an SSH key:
155+
156+
1. Delete the key entry from the ConfigMap:
157+
158+
```sh
159+
kubectl edit configmap authorized-ssh-keys -n kube-system
160+
```
161+
162+
2. Remove the specific key line from the `data` section and save.
163+
164+
3. Restart the DaemonSet pods to apply the changes:
165+
166+
```sh
167+
kubectl rollout restart daemonset/authorized-ssh-keys -n kube-system
168+
```
169+
170+
The key will be removed from all worker nodes as the pods restart.

docs/building-ol7-gpu-operator-driver-image.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,4 +82,4 @@ Example:
8282

8383
```
8484
docker push oguzpastirmaci/driver:510.85.02-ol7.9
85-
```
85+
```
688 KB
Loading
File renamed without changes.

0 commit comments

Comments
 (0)