Skip to content

Commit 3961e71

Browse files
FSS fix and OL8 support with managed node pools
1 parent 114a53e commit 3961e71

File tree

5 files changed

+105
-52
lines changed

5 files changed

+105
-52
lines changed

README.md

Lines changed: 27 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,20 @@
11
# Running RDMA (remote direct memory access) GPU workloads on OKE
2-
Oracle Cloud Infrastructure Kubernetes Engine (OKE) is a fully-managed, scalable, and highly available service that you can use to deploy your containerized applications to the cloud.
3-
4-
Please visit the [OKE documentation page](https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengoverview.htm) for more information.
2+
Oracle Cloud Infrastructure Kubernetes Engine (OKE)[https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengoverview.htm] is a fully-managed, scalable, and highly available service that you can use to deploy your containerized applications to the cloud.
53

64
### Supported Operating Systems
75
- Ubuntu 22.04
6+
- Oracle Linux 8 (except for the GPU & RDMA worker pool)
87

98
### Required policies
10-
The OCI Resource Manager stack template uses the [Self Managed Nodes](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengworkingwithselfmanagednodes.htm) functionality of OKE.
11-
12-
Below policies are required. The OCI Resource Manager stack will create them for you if you have the necessary permissions. If you don't have the permissions, please find more information about the policies below.
9+
The following policies are required. The OCI Resource Manager stack will create them for you if you have the necessary permissions. If you don't have the permissions, please find more information about the policies below.
1310

1411
- [Policy Configuration for Cluster Creation and Deployment](https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengpolicyconfig.htm)
1512
- [Creating a Dynamic Group and a Policy for Self-Managed Nodes](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengdynamicgrouppolicyforselfmanagednodes.htm)
1613

1714
## Instructions for deploying an OKE cluster with GPUs and RDMA connectivity
18-
You will need a CPU pool and a GPU pool. The OCI Resource Manager stack deploys an operational worker pool by default and you choose to deploy addidional CPU/GPU worker pools.
15+
You will need a CPU pool and a GPU pool. The OCI Resource Manager stack deploys an operational worker pool by default and you choose to deploy additional CPU/GPU worker pools.
1916

20-
You can use the below images for both CPU and GPU pools.
17+
You can use the following images for both CPU and GPU pools.
2118

2219
> [!NOTE]
2320
> The GPU image has the GPU drivers pre-installed.
@@ -31,26 +28,24 @@ You can use the instructions [here](https://docs.oracle.com/en-us/iaas/Content/C
3128

3229
**Images for NVIDIA shapes**
3330

34-
- [GPU driver 570 & CUDA 12.8](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-GPU-570-CUDA-12.8-2025.03.26-0)
31+
- [GPU driver 570 & CUDA 12.8](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2025.05.20-0-OFED-24.10-1.1.4.0-GPU-570-OPEN-CUDA-12.8-2025.07.22-0)
3532

36-
- [GPU driver 560 & CUDA 12.6](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-GPU-560-CUDA-12.6-2025.03.26-0)
37-
38-
- [GPU driver 550 & CUDA 12.4](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-GPU-550-CUDA-12.4-2025.03.26-0)
33+
- [GPU driver 550 & CUDA 12.4](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2025.05.20-0-OFED-24.10-1.1.4.0-GPU-550-CUDA-12.4-2025.07.22-0)
3934

4035

4136
**Image for AMD shapes**
4237

43-
- [ROCm 6.3](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-AMD-ROCM-632-2025.03.26-0)
38+
- [ROCm 6.3.2](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2025.05.20-0-OFED-24.10-1.1.4.0-AMD-ROCM-632-2025.07.23-0)
4439

4540

4641
### Deploy the cluster using the Oracle Cloud Resource Manager template
4742
You can easily deploy the cluster using the **Deploy to Oracle Cloud** button below.
4843

49-
[![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?zipUrl=https://github.com/oracle-quickstart/oci-hpc-oke/releases/download/v25.5.1/oke-rdma-quickstart-v25.5.1.zip)
44+
[![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://github.com/oracle-quickstart/oci-hpc-oke/releases/latest/download/oke-gpu-rdma-quickstart.zip)
5045

5146
For the image ID, use the ID of the image that you imported in the previous step.
5247

53-
The template will deploy a `bastion` instance and an `operator` instance. The `operator` instance will have access to the OKE cluster. You can connect to the `operator` instance via SSH with `ssh -J ubuntu@<bastion IP> ubuntu@<operator IP>`.
48+
The template will deploy a `bastion` instance and an `operator` instance by default. The `operator` instance will have access to the OKE cluster. You can connect to the `operator` instance via SSH with `ssh -J ubuntu@<bastion IP> ubuntu@<operator IP>`.
5449

5550
You can also find this information under the **Application information** tab in the OCI Resource Manager stack.
5651

@@ -60,15 +55,15 @@ You can also find this information under the **Application information** tab in
6055
kubectl get nodes
6156

6257
NAME STATUS ROLES AGE VERSION
63-
10.0.103.73 Ready <none> 2d23h v1.25.6
64-
10.0.127.206 Ready node 2d3h v1.25.6
65-
10.0.127.32 Ready node 2d3h v1.25.6
66-
10.0.83.93 Ready <none> 2d23h v1.25.6
67-
10.0.96.82 Ready node 2d23h v1.25.6
58+
10.0.103.73 Ready <none> 2d23h v1.31.1
59+
10.0.127.206 Ready node 2d3h v1.31.1
60+
10.0.127.32 Ready node 2d3h v1.31.1
61+
10.0.83.93 Ready <none> 2d23h v1.31.1
62+
10.0.96.82 Ready node 2d23h v1.31.1
6863
```
6964

7065
### Add a Service Account Authentication Token (optional but recommended)
71-
More info [here.](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengaddingserviceaccttoken.htm)
66+
More info [here](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengaddingserviceaccttoken.htm).
7267

7368
```
7469
kubectl -n kube-system create serviceaccount kubeconfig-sa
@@ -107,7 +102,7 @@ securityContext:
107102
- { mountPath: /dev/infiniband, name: devinf }
108103
- { mountPath: /dev/shm, name: shm }
109104
```
110-
Here's a simple example. You can also look at the NCCL test manifests in the repo [here.](./manifests/)
105+
Here's a simple example. You can also look at the NCCL test manifests in the repo [here](./manifests/).
111106
112107
```yaml
113108
apiVersion: v1
@@ -160,6 +155,11 @@ kubectl create rolebinding default-view --namespace default --serviceaccount def
160155
> [!IMPORTANT]
161156
> The NCCL parameters are different between the H100 and A100 shapes. Please make sure that you are using the correct manifest for your bare metal GPU shapes.
162157
158+
##### BM.GPU.H100
159+
```
160+
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/nccl-tests/BM.GPU.H200.8-nccl-test.yaml
161+
```
162+
163163
##### BM.GPU.H100
164164
```
165165
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/nccl-tests/BM.GPU.H100.8-nccl-test.yaml
@@ -185,7 +185,7 @@ kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke
185185
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/rccl-tests/BM.GPU.MI300X.8.yaml
186186
```
187187

188-
The initial pull of the container will take long. Once the master pod `nccl-allreduce-job0-mpimaster-0` starts running, you can check it logs for the NCCL test result.
188+
The initial pull of the container will take long. Once the master pod `nccl-allreduce-job0-mpimaster-0` starts running, you can check its logs for the NCCL test result.
189189

190190
```sh
191191
Defaulted container "mpimaster" out of: mpimaster, wait-for-workers (init)
@@ -280,16 +280,16 @@ You can follow the instructions [here](./docs/adding-ssh-keys-to-worker-nodes.md
280280
Please see the instructions [here](./docs/running-pytorch-jobs-on-oke-using-hostnetwork-with-rdma.md) for the best practices on running PyTorch jobs.
281281

282282
### I have large container images. Can I import them from a shared location instead of downloading them?
283-
Yes, you can use OCI's File Storage Service (FSS) with `skopeo` to accomplish that. You can find the instructions [here.](./docs/importing-images-from-fss-skopeo.md)
283+
Yes, you can use OCI's File Storage Service (FSS) with `skopeo` to accomplish that. You can find the instructions [here](./docs/importing-images-from-fss-skopeo.md).
284284

285285
### How can I run GPU & RDMA health checks in my nodes?
286-
You can deploy the health check script with Node Problem Detector by following the instructions [here.](./docs/running-gpu-rdma-healtchecks-with-node-problem-detector.md)
286+
You can deploy the health check script with Node Problem Detector by following the instructions [here](./docs/running-gpu-rdma-healtchecks-with-node-problem-detector.md).
287287

288288
### Can I autoscale my RDMA enabled nodes in a Cluster Network?
289-
You can setup autoscaling for your nodes in a Cluster Network using the instructions [here.](./docs/using-cluster-autoscaler-with-cluster-networks.md)
289+
You can set up autoscaling for your nodes in a Cluster Network using the instructions [here](./docs/using-cluster-autoscaler-with-cluster-networks.md).
290290

291291
### How do I use network locality information when running workloads on OKE?
292-
You can follow the instructions [here.](./docs/using-rdma-network-locality-when-running-workloads-on-oke.md)
292+
You can follow the instructions [here](./docs/using-rdma-network-locality-when-running-workloads-on-oke.md).
293293

294294
## Contributing
295295

files/oke-nvme-raid.sh

Lines changed: 27 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
21
#!/usr/bin/env bash
32
# shellcheck disable=SC2086,SC2174
43
set -o errexit -o nounset -o pipefail -x
@@ -16,6 +15,14 @@ if [ ${#devices[@]} -eq 0 ]; then
1615
exit 0
1716
fi
1817

18+
# Exit if cannot detect OS (Ubuntu and Oracle Linux are supported)
19+
if [[ -f /etc/os-release ]]; then
20+
. /etc/os-release
21+
else
22+
echo "Cannot detect OS: /etc/os-release missing"
23+
exit 0
24+
fi
25+
1926
# Used for boot volume replacement - check if an array exists
2027
legacy_dev_paths=(/dev/md/0 /dev/md/0_0 /dev/md127)
2128
mdadm --assemble --scan --quiet || true
@@ -93,9 +100,24 @@ for mount in "${mount_extra[@]}"; do
93100
WantedBy=multi-user.target
94101
EOF
95102
systemd-analyze verify "${mount_unit_name}"
96-
systemctl enable "${mount_unit_name}" --now
103+
systemctl enable "${mount_unit_name}" --now
97104
done
98105

99-
mdadm --detail --scan --verbose >> /etc/mdadm/mdadm.conf
100-
101-
update-initramfs -u
106+
case "$ID" in
107+
ubuntu)
108+
MDADM_CONF="/etc/mdadm/mdadm.conf"
109+
[[ -f $MDADM_CONF ]] || touch "$MDADM_CONF"
110+
mdadm --detail --scan --verbose >> "$MDADM_CONF"
111+
update-initramfs -u
112+
;;
113+
ol)
114+
MDADM_CONF="/etc/mdadm.conf"
115+
[[ -f $MDADM_CONF ]] || touch "$MDADM_CONF"
116+
mdadm --detail --scan --verbose >> "$MDADM_CONF"
117+
dracut --force
118+
;;
119+
*)
120+
echo "Unsupported OS: $ID"
121+
exit 1
122+
;;
123+
esac

files/oke-ubuntu-cloud-init.sh

Lines changed: 50 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,62 @@
1-
#!/bin/bash
2-
set -x
1+
#!/usr/bin/env bash
2+
set -euo pipefail
33

4-
source /etc/os-release
4+
if [[ -f /etc/os-release ]]; then
5+
. /etc/os-release
6+
else
7+
echo "Cannot detect OS: /etc/os-release missing"
8+
exit 1
9+
fi
510

6-
kubernetes_version=$1
7-
oke_package_version="${kubernetes_version:1}"
8-
oke_package_repo_version="${oke_package_version:0:4}"
9-
oke_package_name="oci-oke-node-all-$oke_package_version"
10-
oke_package_repo="https://odx-oke.objectstorage.us-sanjose-1.oci.customer-oci.com/n/odx-oke/b/okn-repositories/o/prod/ubuntu-$VERSION_CODENAME/kubernetes-$oke_package_repo_version"
11+
case "$ID" in
12+
ubuntu)
13+
echo "Detected Ubuntu"
14+
if command -v oke >/dev/null 2>&1; then
15+
echo "[Ubuntu] oke binary already present → running bootstrap only"
16+
oke bootstrap
17+
else
18+
echo "[Ubuntu] oke binary not found → installing package"
19+
kubernetes_version="$1"
20+
oke_package_version="${kubernetes_version:1}"
21+
oke_package_repo_version="${oke_package_version:0:4}"
22+
oke_package_name="oci-oke-node-all-$oke_package_version"
23+
oke_package_repo="https://odx-oke.objectstorage.us-sanjose-1.oci.customer-oci.com/n/odx-oke/b/okn-repositories/o/prod/ubuntu-$VERSION_CODENAME/kubernetes-$oke_package_repo_version"
1124

12-
# Add OKE Ubuntu package repo
13-
tee /etc/apt/sources.list.d/oke-node-client.sources <<EOF
25+
tee /etc/apt/sources.list.d/oke-node-client.sources > /dev/null <<EOF
1426
Enabled: yes
1527
Types: deb
16-
URIs: https://odx-oke.objectstorage.us-sanjose-1.oci.customer-oci.com/n/odx-oke/b/okn-repositories/o/prod/ubuntu-$VERSION_CODENAME/kubernetes-$oke_package_repo_version
28+
URIs: $oke_package_repo
1729
Suites: stable
1830
Components: main
1931
Trusted: yes
2032
EOF
2133

22-
# Wait for apt lock and install the package
23-
while fuser /var/{lib/{dpkg/{lock,lock-frontend},apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do
24-
echo "Waiting for dpkg/apt lock"
25-
sleep 1
26-
done
34+
apt-get -y update
35+
apt-get -y install "$oke_package_name"
2736

28-
apt-get -y update && apt-get -y install $oke_package_name
37+
echo "[Ubuntu] Running bootstrap"
38+
oke bootstrap
39+
fi
40+
;;
41+
ol)
42+
echo "Detected Oracle Linux"
43+
if command -v oke >/dev/null 2>&1; then
44+
echo "[Oracle Linux] oke binary already present → running bootstrap only"
45+
oke bootstrap
46+
else
47+
echo "[Oracle Linux] oke binary not found, fetching init script"
48+
curl --fail -H "Authorization: Bearer Oracle" \
49+
-L0 http://169.254.169.254/opc/v2/instance/metadata/oke_init_script \
50+
| base64 --decode >/var/run/oke-init.sh
2951

30-
# OKE bootstrap
31-
oke bootstrap
52+
echo "[Oracle Linux] Running init script"
53+
bash /var/run/oke-init.sh
54+
fi
55+
;;
56+
*)
57+
echo "Unsupported OS: $ID"
58+
exit 1
59+
;;
60+
esac
61+
62+
echo "OKE setup completed successfully."

terraform/versions.tf

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ terraform {
88
oci = {
99
configuration_aliases = [oci.home]
1010
source = "oracle/oci"
11-
version = ">= 4.115.0"
11+
version = "= 7.15.0"
1212
}
1313
local = {
1414
source = "hashicorp/local"

0 commit comments

Comments
 (0)