Skip to content

Commit 5b790c6

Browse files
Merge pull request #25 from OguzPastirmaci/main
Update instructions for using hostnetwork
2 parents 3c67f9b + 4522ea8 commit 5b790c6

19 files changed

+573
-1209
lines changed

README.md

Lines changed: 81 additions & 155 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,38 @@
1-
# Running RDMA (remote direct memory access) GPU workloads on OKE using GPU Operator and Network Operator
2-
1+
# Running RDMA (remote direct memory access) GPU workloads on OKE
32
Oracle Cloud Infrastructure Container Engine for Kubernetes (OKE) is a fully-managed, scalable, and highly available service that you can use to deploy your containerized applications to the cloud.
43

54
Please visit the [OKE documentation page](https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengoverview.htm) for more information.
65

7-
This guide has the instructions for deploying an OKE cluster using H100 & A100 bare metal nodes with RDMA connectivity using the [GPU Operator](https://github.com/NVIDIA/gpu-operator) and [Network Operator](https://github.com/Mellanox/network-operator).
8-
9-
> [!IMPORTANT]
10-
> Currently, creating SR-IOV Virtual Functions is supported in limited regions. For H100, all regions with H100s are supported. For A100s, Phoenix (PHX) and Osaka (KIX) regions are supported. For other regions, please contact your sales representative.
11-
12-
### What is NVIDIA GPU Operator?
13-
Kubernetes provides access to special hardware resources such as NVIDIA GPUs, NICs, Infiniband adapters and other devices through the device plugin framework. However, configuring and managing nodes with these hardware resources requires configuration of multiple software components such as drivers, container runtimes or other libraries which are difficult and prone to errors. The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM based monitoring and others.
14-
15-
### What is NVIDIA Network Operator?
16-
NVIDIA Network Operator leverages Kubernetes CRDs and Operator SDK to manage Networking related Components in order to enable Fast networking, RDMA and GPUDirect for workloads in a Kubernetes cluster.
17-
18-
The Goal of Network Operator is to manage all networking related components to enable execution of RDMA and GPUDirect RDMA workloads in a kubernetes cluster.
19-
206
### Supported Operating Systems
21-
For the A100 and H100 shapes (BM.GPU.H100.8, BM.GPU.A100-v2.8, BM.GPU4.8), Oracle Linux 8 with the Red Hat Compatible Kernel (RHCK) is supported.
7+
For the A100 and H100 shapes (BM.GPU.H100.8, BM.GPU.A100-v2.8, BM.GPU4.8, BM.GPU.B4.8), Ubuntu 22.04 is supported.
228

239
### Required policies
24-
The Terraform deployment template uses the [Self Managed Nodes](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengworkingwithselfmanagednodes.htm) functionality of OKE.
10+
The OCI Resource Manager stack template uses the [Self Managed Nodes](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengworkingwithselfmanagednodes.htm) functionality of OKE.
2511

26-
You must create the necessary OKE policies:
12+
Below policies are required. The OCI Resource Manager stack will create them for you if you have the necessary permissions. If you don't have the permissions, please find more information about the policies below.
2713

2814
- [Policy Configuration for Cluster Creation and Deployment](https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengpolicyconfig.htm)
2915
- [Creating a Dynamic Group and a Policy for Self-Managed Nodes](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengdynamicgrouppolicyforselfmanagednodes.htm)
3016

3117
## Instructions for deploying an OKE cluster with GPUs and RDMA connectivity
18+
You will need a CPU pool and a GPU pool. The OCI Resource Manager stack deploys an operational worker pool by default and you choose to deploy addidional CPU/GPU worker pools.
3219

33-
You will need a CPU and a GPU pool. The Terraform template deploys an operational/system worker pool (CPU) and a GPU worker pool.
34-
35-
The GPU pool requires you to use an image provided by the Oracle HPC team, you can find the import link below. This image included the OFED drivers and necessary packages configured for RDMA.
36-
37-
For the non-GPU worker pools, you can use the default OKE images (no need to specify them in the Terraform template).
20+
You can use the below image for both CPU and GPU pools.
3821

3922
> [!NOTE]
40-
> The GPU image has the GPU drivers pre-installed (GPU driver version 535.154.05 with CUDA 12.2). Deploying the GPU driver as a container with the GPU Operator is currently not supported.
23+
> The GPU image has the GPU drivers pre-installed (GPU driver version 535.154.05 with CUDA 12.2).
4124
4225
#### Image to import and use for the H100 and A100 nodes
43-
[OracleLinux-8-OCA-RHCK-OFED-5.8-3.0.7.0-GPU-535-OKE-2024.02.12-0](https://objectstorage.us-ashburn-1.oraclecloud.com/p/f6mKO0d_OG7gL4EyE5rvOWObL6LBgQ1XXtpM2H67SYmFHQ-tBwxyg7Wmii94VYc8/n/hpc_limited_availability/b/images/o/OracleLinux-8-OCA-RHCK-OFED-5.8-3.0.7.0-GPU-535-OKE-2024.02.12-0)
26+
You can use the instructions [here.](https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/imageimportexport.htm#Importing) for importing the below image to your tenancy.
27+
28+
[Image to import](https://objectstorage.ca-toronto-1.oraclecloud.com/p/oXC6BcCkB0lXhycxV-0UuDqGGnVtFWfLOkwuJWA5WbsBDb4FkHwnsOHa_ElRcfL2/n/hpc_limited_availability/b/images/o/Ubuntu-22-OCA-OFED-23.10-2.1.3.1-GPU-535-CUDA-12.2-2024.03.15-0)
4429

45-
### Deploy the cluster using the Terraform template
46-
You can find the template in the [terraform directory](./terraform/).
30+
### Deploy the cluster using the Oracle Cloud Resource Manager template
31+
You can easily deploy the cluster using the **Deploy to Oracle Cloud** button below.
4732

48-
Make sure to update the variables in the `worker pools` blocks.
33+
[![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?zipUrl=https://github.com/oracle-quickstart/oci-hpc-oke/releases/download/v24.6.0/oke-rdma-quickstart-v24.6.0.zip)
4934

50-
You can find more information on setting up Terraform for OCI [here](https://docs.oracle.com/en-us/iaas/developer-tutorials/tutorials/tf-provider/01-summary.htm).
35+
For the image ID, use the ID of the image that you imported in the previous step.
5136

5237
The template will deploy a `bastion` instance and an `operator` instance. The `operator` instance will have access to the OKE cluster. You can connect to the `operator` instance via SSH with `ssh -J opc@<bastion IP> opc@<operator IP>`.
5338

@@ -61,136 +46,67 @@ NAME STATUS ROLES AGE VERSION
6146
10.0.127.206 Ready node 2d3h v1.25.6
6247
10.0.127.32 Ready node 2d3h v1.25.6
6348
10.0.83.93 Ready <none> 2d23h v1.25.6
64-
10.0.96.81 Ready node 2d23h v1.25.6
65-
```
66-
67-
### Get the latest Helm 3 version
68-
```sh
69-
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
70-
chmod 700 get_helm.sh
71-
./get_helm.sh
72-
```
73-
74-
### Add Helm repos for Network Operator and GPU Operator
75-
```sh
76-
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
77-
helm repo update
78-
```
79-
80-
### Deploy GPU Operator
81-
```
82-
helm install --wait \
83-
-n gpu-operator --create-namespace \
84-
gpu-operator nvidia/gpu-operator \
85-
--version v23.9.1 \
86-
--set driver.enabled=false \
87-
--set operator.defaultRuntime=crio \
88-
--set toolkit.version=v1.14.5-ubi8 \
89-
--set driver.rdma.enabled=true \
90-
--set driver.rdma.useHostMofed=true
91-
```
92-
93-
Wait until all network operator pods are running with `kubectl get pods -n gpu-operator`.
94-
95-
### Deploy Network Operator
96-
97-
> [!IMPORTANT]
98-
> The device name you will use when deploying the Network Operator is different between A100 and H100 shapes. Please make sure that you are running the correct command based on your shape.
99-
100-
#### A100 shapes (BM.GPU.A100-v2.8, BM.GPU4.8)
101-
```
102-
helm install --wait \
103-
-n network-operator --create-namespace \
104-
network-operator nvidia/network-operator \
105-
--version v23.10.0 \
106-
--set deployCR=true \
107-
--set nfd.enabled=false \
108-
--set rdmaSharedDevicePlugin.deploy=false \
109-
--set nvPeerDriver.deploy=true \
110-
--set sriovDevicePlugin.deploy=true \
111-
--set secondaryNetwork.ipamPlugin.deploy=false \
112-
--set nvIpam.deploy=true \
113-
--set-json sriovDevicePlugin.resources='[{"name": "sriov_rdma_vf", "drivers": ["mlx5_core"], "devices": ["101a"], "isRdma": [true]}]'
114-
```
115-
116-
#### H100 shapes (BM.GPU.H100.8)
117-
```
118-
helm install --wait \
119-
-n network-operator --create-namespace \
120-
network-operator nvidia/network-operator \
121-
--version v23.10.0 \
122-
--set deployCR=true \
123-
--set nfd.enabled=false \
124-
--set rdmaSharedDevicePlugin.deploy=false \
125-
--set nvPeerDriver.deploy=true \
126-
--set sriovDevicePlugin.deploy=true \
127-
--set secondaryNetwork.ipamPlugin.deploy=false \
128-
--set nvIpam.deploy=true \
129-
--set-json sriovDevicePlugin.resources='[{"name": "sriov_rdma_vf", "drivers": ["mlx5_core"], "devices": ["101e"], "isRdma": [true]}]'
130-
```
131-
132-
### Deploy SR-IOV CNI
49+
10.0.96.82 Ready node 2d23h v1.25.6
13350
```
134-
kubectl apply -f https://raw.githubusercontent.com/openshift/sriov-cni/master/images/k8s-v1.16/sriov-cni-daemonset.yaml
135-
```
136-
137-
### Deploy RDMA CNI
138-
```
139-
kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg/rdma-cni/master/deployment/rdma-cni-daemonset.yaml
140-
```
141-
142-
Wait until all network operator pods are running with `kubectl get pods -n network-operator`.
143-
144-
### Deploy the Virtual Function Configuration daemonset
145-
```
146-
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/vf-config.yaml
147-
```
148-
### Create Network Attachment Definition
14951

150-
```sh
151-
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/network-attachment-definition.yaml
152-
```
153-
154-
### Create the IP Pool for Nvidia IPAM
155-
```
156-
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/ip-pool.yaml
157-
```
158-
159-
### Create the topology ConfigMap
160-
This step creates a ConfigMap that can be used as the NCCL topology file when running your jobs that use NCCL as the backend.
52+
### Using the host RDMA network interfaces in manifests
53+
In order to use the RDMA interfaces on the host in your pods, you should have the below sections in your manifests:
16154

162-
You can find the topology files in the [topology directory](https://github.com/oracle-quickstart/oci-hpc-oke/tree/main/manifests/topology) in this repo. Please make sure you use the correct topology file based on your shape when creating the ConfigMap.
163-
164-
```
165-
SHAPE=<your GPU shape>
166-
167-
curl -s -o ./topo.xml https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/topology/$SHAPE.xml
168-
169-
kubectl create configmap nccl-topology --from-file ./topo.xml
55+
```yaml
56+
spec:
57+
hostNetwork: true
58+
dnsPolicy: ClusterFirstWithHostNet
59+
volumes:
60+
- { name: devinf, hostPath: { path: /dev/infiniband }}
61+
- { name: shm, emptyDir: { medium: Memory, sizeLimit: 32Gi }}
17062
```
17163
172-
### Confirm that the GPUs are Virtual Functions (VFs) are correctly exposed
173-
Once the Network Operator pods are deployed, the GPU nodes with RDMA NICs will start reporting `nvidia.com/sriov_rdma_vf` as an available resource. You can request that resource in your pod manifests for assigning RDMA VFs to pods.
174-
175-
By default, we create one Virtual Function per Physical Function. So for the H100 and A100 bare metal shapes, you will see 16 VFs per node exposed as a resource.
176-
64+
```yaml
65+
securityContext:
66+
privileged: true
67+
capabilities:
68+
add: [ "IPC_LOCK" ]
17769
```
178-
kubectl get nodes -l 'node.kubernetes.io/instance-type in (BM.GPU.H100.8, BM.GPU.A100-v2.8, BM.GPU4.8, BM.GPU.B4.8)' --sort-by=.status.capacity."nvidia\.com/gpu" -o=custom-columns='NODE:metadata.name,GPUs:status.capacity.nvidia\.com/gpu,RDMA-VFs:status.capacity.nvidia\.com/sriov_rdma_vf'
179-
180-
NODE GPUs RDMA-VFs
181-
10.79.148.115 8 16
182-
10.79.151.167 8 16
183-
10.79.156.205 8 16
70+
```yaml
71+
volumeMounts:
72+
- { mountPath: /dev/infiniband, name: devinf }
73+
- { mountPath: /dev/shm, name: shm }
18474
```
185-
186-
### Requesting VFs in manifests
187-
Network Operator exposes the RDMA Virtual Functions (VFs) as allocatable resources. To use them, you need to add the following annotation to your manifests. The next step in this guide has an example for running the NCCL test, you can use that manifest as an example.
75+
Here's a simple example. You can also look at the NCCL test manifests in the repo [here.](../manifests/)
18876
18977
```yaml
190-
template:
191-
metadata:
192-
annotations:
193-
k8s.v1.cni.cncf.io/networks: oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov
78+
apiVersion: v1
79+
kind: Pod
80+
metadata:
81+
name: rdma-test-pod-1
82+
spec:
83+
hostNetwork: true
84+
dnsPolicy: ClusterFirstWithHostNet
85+
volumes:
86+
- { name: devinf, hostPath: { path: /dev/infiniband }}
87+
- { name: shm, emptyDir: { medium: Memory, sizeLimit: 32Gi }}
88+
restartPolicy: OnFailure
89+
containers:
90+
- image: oguzpastirmaci/mofed-perftest:5.4-3.6.8.1-ubuntu20.04-amd64
91+
name: mofed-test-ctr
92+
securityContext:
93+
privileged: true
94+
capabilities:
95+
add: [ "IPC_LOCK" ]
96+
volumeMounts:
97+
- { mountPath: /dev/infiniband, name: devinf }
98+
- { mountPath: /dev/shm, name: shm }
99+
resources:
100+
requests:
101+
cpu: 8
102+
ephemeral-storage: 32Gi
103+
memory: 2Gi
104+
command:
105+
- sh
106+
- -c
107+
- |
108+
ls -l /dev/infiniband /sys/class/net
109+
sleep 1000000
194110
```
195111
196112
### Optional - Deploy Volcano and run the NCCL test
@@ -207,16 +123,26 @@ kubectl create rolebinding default-view --namespace default --serviceaccount def
207123

208124
#### Run the NCCL test
209125
> [!IMPORTANT]
210-
> The NCCL parameters are different between the H100 and A100 shapes. Please make sure that you are using the correct manifest.
126+
> The NCCL parameters are different between the H100 and A100 shapes. Please make sure that you are using the correct manifest for your bare metal GPU shapes.
127+
128+
##### BM.GPU.H100
129+
```
130+
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/BM.GPU.H100.8-nccl-test.yaml
131+
```
132+
133+
##### BM.GPU.A100-v2.8
134+
```
135+
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/BM.GPU.A100-v2.8-nccl-test.yaml
136+
```
211137

212-
##### H100
138+
##### BM.GPU4.8
213139
```
214-
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/h100-nccl-test.yaml
140+
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/BM.GPU4.8-nccl-test.yaml
215141
```
216142

217-
##### A100
143+
##### BM.GPU.B4.8
218144
```
219-
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/a100-nccl-test.yaml
145+
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/BM.GPU.B4.8-nccl-test.yaml
220146
```
221147

222148
The initial pull of the container will take long. Once the master pod `nccl-allreduce-job0-mpimaster-0` starts running, you can check it logs for the NCCL test result.

0 commit comments

Comments
 (0)