Skip to content

Commit 3e2da79

Browse files
Merge pull request #82 from oracle-quickstart/update-docs
Update docs
2 parents 461cf73 + 1ebaf3e commit 3e2da79

14 files changed

+1122
-2001
lines changed

README.md

Lines changed: 56 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,46 @@
1-
# Running RDMA (remote direct memory access) GPU workloads on OKE
2-
Oracle Cloud Infrastructure Kubernetes Engine (OKE) is a fully-managed, scalable, and highly available service that you can use to deploy your containerized applications to the cloud.
1+
# Running RDMA (Remote Direct Memory Access) GPU Workloads on OKE
32

4-
### Supported Operating Systems
3+
This guide provides instructions for deploying and managing GPU workloads with RDMA connectivity on Oracle Cloud Infrastructure Kubernetes Engine (OKE). OKE is a fully-managed, scalable, and highly available Kubernetes service that enables you to deploy containerized applications to the cloud.
4+
5+
## Supported Operating Systems
56
- Ubuntu 22.04
67
- Oracle Linux 8 (except for the GPU & RDMA worker pool)
78

8-
### Required policies
9-
The following policies are required. The OCI Resource Manager stack will create them for you if you have the necessary permissions. If you don't have the permissions, please find more information about the policies below.
9+
## Required Policies
10+
The following policies are required. The OCI Resource Manager stack will create them for you if you have the necessary permissions. If you don't have the permissions, please refer to the policy documentation below.
1011

1112
- [Policy Configuration for Cluster Creation and Deployment](https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengpolicyconfig.htm)
1213
- [Creating a Dynamic Group and a Policy for Self-Managed Nodes](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengdynamicgrouppolicyforselfmanagednodes.htm)
1314

14-
## Instructions for deploying an OKE cluster with GPUs and RDMA connectivity
15-
You will need a CPU pool and a GPU pool. The OCI Resource Manager stack deploys an operational worker pool by default and you choose to deploy additional CPU/GPU worker pools.
15+
## Deploying an OKE Cluster with GPUs and RDMA Connectivity
16+
17+
You will need a CPU pool and a GPU pool. The OCI Resource Manager stack deploys a system worker pool by default, and you can choose to deploy additional CPU/GPU worker pools.
1618

1719
You can use the following images for both CPU and GPU pools.
1820

1921
> [!NOTE]
2022
> The GPU image has the GPU drivers pre-installed.
2123
22-
#### Images to use
23-
You can use the instructions [here](https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/imageimportexport.htm#Importing) for importing the below image to your tenancy.
24+
### Images to Use
2425

25-
**Image to use for non-GPU nodes**
26+
You can use the instructions [here](https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/imageimportexport.htm#Importing) for importing the images below to your tenancy.
2627

27-
- [Link to import the image](https://objectstorage.us-chicago-1.oraclecloud.com/p/O1VP9Rx0p7uWKRQW6739ZzTbnUPK5F8cvlN0apUaiO_cF5x9R2ESYN6yskW0FUVq/n/hpc_limited_availability/b/oke-images-do-not-delete/o/Canonical-Ubuntu-22.04-2025.03.28-0-OKE)
28-
29-
**Images for NVIDIA shapes**
28+
#### Images for NVIDIA x86 Shapes (B200, H200, H100, A100, L40s, A10)
3029

3130
- [GPU driver 570 & CUDA 12.8](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2025.05.20-0-OFED-24.10-1.1.4.0-GPU-570-OPEN-CUDA-12.8-2025.07.22-0)
32-
3331
- [GPU driver 550 & CUDA 12.4](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2025.05.20-0-OFED-24.10-1.1.4.0-GPU-550-CUDA-12.4-2025.07.22-0)
3432

33+
#### Images for NVIDIA Arm Shapes (GB200)
3534

36-
**Image for AMD shapes**
35+
- [GPU driver 570 & CUDA 12.8](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-aarch64-2025.05.20-0-DOCA-OFED-3.0.0-GPU-570-OPEN-CUDA-12.8-2025.07.24-0)
3736

38-
- [ROCm 6.3.2](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2025.05.20-0-OFED-24.10-1.1.4.0-AMD-ROCM-632-2025.07.23-0)
37+
#### Images for AMD Shapes
38+
39+
- [ROCm 6.4.3](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2025.07.23-0-DOCA-OFED-3.1.0-AMD-ROCM-643-2025.09.25-0)
3940

41+
- [ROCm 6.3.2](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2025.05.20-0-OFED-24.10-1.1.4.0-AMD-ROCM-632-2025.07.23-0)
4042

41-
### Deploy the cluster using the Oracle Cloud Resource Manager template
43+
### Deploy the Cluster Using the Oracle Cloud Resource Manager Template
4244
You can easily deploy the cluster using the **Deploy to Oracle Cloud** button below.
4345

4446
[![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?zipUrl=https://github.com/oracle-quickstart/oci-hpc-oke/releases/latest/download/oke-gpu-rdma-quickstart.zip)
@@ -49,7 +51,11 @@ The template will deploy a `bastion` instance and an `operator` instance by defa
4951

5052
You can also find this information under the **Application information** tab in the OCI Resource Manager stack.
5153

52-
### Wait until you see all nodes in the cluster
54+
![Application Information Tab](./docs/images/rms-application-information.png)
55+
56+
### Verify Node Availability
57+
58+
Wait until all nodes are ready in the cluster:
5359

5460
```sh
5561
kubectl get nodes
@@ -62,8 +68,9 @@ NAME STATUS ROLES AGE VERSION
6268
10.0.96.82 Ready node 2d23h v1.31.1
6369
```
6470

65-
### Add a Service Account Authentication Token (optional but recommended)
66-
More info [here](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengaddingserviceaccttoken.htm).
71+
### Add a Service Account Authentication Token (Optional but Recommended)
72+
73+
For more information, see [Adding a Service Account Token](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengaddingserviceaccttoken.htm).
6774

6875
```
6976
kubectl -n kube-system create serviceaccount kubeconfig-sa
@@ -79,8 +86,9 @@ kubectl config set-credentials kubeconfig-sa --token=$TOKEN
7986
kubectl config set-context --current --user=kubeconfig-sa
8087
```
8188

82-
### Using the host RDMA network interfaces in manifests
83-
In order to use the RDMA interfaces on the host in your pods, you should have the below sections in your manifests:
89+
### Using Host RDMA Network Interfaces in Manifests
90+
91+
To use the RDMA interfaces on the host in your pods, include the following sections in your manifests:
8492

8593
```yaml
8694
spec:
@@ -139,56 +147,60 @@ spec:
139147
sleep 1000000
140148
```
141149
142-
### Optional - Deploy Kueue & MPI Operator to run NCCL tests
143-
Kueue & MPI Operator are needed for running the optional NCCL tests.
150+
## Optional: Deploy Kueue & MPI Operator to Run NCCL Tests
144151
145-
#### Deploy MPI Operator & Kueue
152+
Kueue and MPI Operator are required for running the optional NCCL tests.
153+
154+
### Deploy MPI Operator and Kueue
146155
```sh
147156
kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yaml
148157

149-
helm install kueue oci://registry.k8s.io/kueue/charts/kueue --version="0.13.4" --create-namespace --namespace=kueue-system
158+
helm install kueue oci://registry.k8s.io/kueue/charts/kueue --version="0.14.1" --create-namespace --namespace=kueue-system
150159
```
151160

152-
#### Run the NCCL/RCCL tests
161+
### Run the NCCL/RCCL Tests
162+
153163
> [!IMPORTANT]
154-
> The NCCL parameters are different between different shapes. Please make sure that you are using the correct manifest for your bare metal GPU shapes.
164+
> The NCCL parameters differ between GPU shapes. Ensure that you use the correct manifest for your specific bare metal GPU shape.
155165
156-
##### BM.GPU.GB200-v2.4
157-
```
166+
#### BM.GPU.GB200-v2.4
167+
```sh
158168
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/nccl-tests/kueue/BM.GPU.GB200-v2.4.yaml
159169
```
160170

161-
##### BM.GPU.GB200.4
162-
```
171+
#### BM.GPU.GB200.4
172+
```sh
163173
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/nccl-tests/kueue/BM.GPU.GB200.4.yaml
164174
```
165175

166-
##### BM.GPU.H200
167-
```
176+
#### BM.GPU.H200
177+
```sh
168178
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/nccl-tests/kueue/BM.GPU.H200.8.yaml
169179
```
170180

171-
##### BM.GPU.H100
172-
```
181+
#### BM.GPU.H100
182+
```sh
173183
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/nccl-tests/kueue/BM.GPU.H100.8.yaml
174184
```
175185

176-
##### BM.GPU.A100-v2.8
177-
```
186+
#### BM.GPU.A100-v2.8
187+
```sh
178188
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/nccl-tests/kueue/BM.GPU.A100-v2.8.yaml
179189
```
180190

181-
##### BM.GPU4.8
182-
```
191+
#### BM.GPU4.8
192+
```sh
183193
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/nccl-tests/kueue/BM.GPU4.8.yaml
184194
```
185195

186-
##### BM.GPU.B4.8
187-
```
196+
#### BM.GPU.B4.8
197+
```sh
188198
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/nccl-tests/kueue/BM.GPU.B4.8.yaml
189199
```
190200

191-
The initial pull of the container will take long. Once the launcher pod `nccl-test-launcher-XXXXX` starts running, you can check its logs for the NCCL test result.
201+
The initial container image pull may take some time. Once the launcher pod `nccl-test-launcher-XXXXX` starts running, you can check its logs for the NCCL test results.
202+
203+
### Example Output
192204

193205
```sh
194206
Waiting for workers to be ready...
@@ -260,7 +272,7 @@ Please see the instructions [here](./docs/running-pytorch-jobs-on-oke-using-host
260272
Yes, you can use OCI's File Storage Service (FSS) with `skopeo` to accomplish that. You can find the instructions [here](./docs/importing-images-from-fss-skopeo.md).
261273

262274
### How can I run GPU & RDMA health checks in my nodes?
263-
You can deploy the health check script with Node Problem Detector by following the instructions [here](./docs/running-gpu-rdma-healtchecks-with-node-problem-detector.md).
275+
You can deploy the health check script with Node Problem Detector by following the instructions [here](./docs/running-gpu-rdma-healthchecks-with-node-problem-detector.md).
264276

265277
### Can I autoscale my RDMA enabled nodes in a Cluster Network?
266278
You can set up autoscaling for your nodes in a Cluster Network using the instructions [here](./docs/using-cluster-autoscaler-with-cluster-networks.md).

docs/adding-ssh-keys-to-worker-nodes.md

Lines changed: 92 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,18 @@
1-
### Adding SSH public keys to worker nodes
2-
When you create worker nodes with the stack, it adds one SSH public key. If you need to add other SSH keys to the worker nodes, you can use the following manifest.
1+
# Adding SSH Public Keys to Worker Nodes
32

4-
1 - Create a `ConfigMap` for the keys you want to add
3+
When you create worker nodes with the OCI Resource Manager stack, a single SSH public key is added by default. You may need to add additional SSH keys for team access, automation, or administrative purposes. This guide explains how to add multiple SSH keys to your worker nodes using Kubernetes resources.
4+
5+
## Prerequisites
6+
7+
- Access to your OKE cluster
8+
- kubectl configured and authenticated
9+
- SSH public keys that you want to add to the worker nodes
10+
11+
## Procedure
12+
13+
### Step 1: Create a ConfigMap for SSH Keys
14+
15+
Create a `ConfigMap` containing the SSH public keys you want to add to the worker nodes:
516

617
```yaml
718
apiVersion: v1
@@ -14,7 +25,15 @@ data:
1425
key2.pub: 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQD....'
1526
```
1627
17-
2 - Use the below `DaemonSet` to add the keys in the ConfigMap to nodes.
28+
Apply the ConfigMap:
29+
30+
```sh
31+
kubectl apply -f configmap.yaml
32+
```
33+
34+
### Step 2: Deploy the DaemonSet
35+
36+
Deploy a `DaemonSet` to automatically distribute and manage the SSH keys across all worker nodes:
1837

1938
```yaml
2039
apiVersion: apps/v1
@@ -80,3 +99,72 @@ spec:
8099
- { name: root, mountPath: /host }
81100
- { name: authorized-ssh-keys, mountPath: /authorized }
82101
```
102+
103+
Apply the DaemonSet:
104+
105+
```sh
106+
kubectl apply -f daemonset.yaml
107+
```
108+
109+
The DaemonSet will automatically:
110+
- Deploy a pod on each worker node
111+
- Read SSH keys from the ConfigMap
112+
- Update the `authorized_keys` file on each node at pod startup
113+
- Work with both Ubuntu (user: `ubuntu`) and Oracle Linux (user: `opc`) nodes
114+
115+
> [!NOTE]
116+
> The keys are applied when the DaemonSet pods start. To update keys after the initial deployment, you will need to restart the pods (see "Adding or Updating Keys" section below).
117+
118+
## Verification
119+
120+
To verify that the SSH keys have been successfully added:
121+
122+
1. Check that the DaemonSet pods are running on all nodes:
123+
124+
```sh
125+
kubectl get pods -n kube-system -l app=authorized-ssh-keys -o wide
126+
```
127+
128+
2. Check the logs of a DaemonSet pod to confirm key updates:
129+
130+
```sh
131+
kubectl logs -n kube-system -l app=authorized-ssh-keys --tail=20
132+
```
133+
134+
## Adding or Updating Keys
135+
136+
To add or update SSH keys after the initial deployment:
137+
138+
1. Edit the ConfigMap to add or modify keys:
139+
140+
```sh
141+
kubectl edit configmap authorized-ssh-keys -n kube-system
142+
```
143+
144+
2. Restart the DaemonSet pods to apply the changes:
145+
146+
```sh
147+
kubectl rollout restart daemonset/authorized-ssh-keys -n kube-system
148+
```
149+
150+
The pods will be restarted with a rolling update strategy, ensuring continuous availability while applying the new keys across all nodes.
151+
152+
## Removing Keys
153+
154+
To remove an SSH key:
155+
156+
1. Delete the key entry from the ConfigMap:
157+
158+
```sh
159+
kubectl edit configmap authorized-ssh-keys -n kube-system
160+
```
161+
162+
2. Remove the specific key line from the `data` section and save.
163+
164+
3. Restart the DaemonSet pods to apply the changes:
165+
166+
```sh
167+
kubectl rollout restart daemonset/authorized-ssh-keys -n kube-system
168+
```
169+
170+
The key will be removed from all worker nodes as the pods restart.

docs/building-ol7-gpu-operator-driver-image.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,4 +82,4 @@ Example:
8282

8383
```
8484
docker push oguzpastirmaci/driver:510.85.02-ol7.9
85-
```
85+
```
688 KB
Loading
File renamed without changes.

0 commit comments

Comments
 (0)