You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Running RDMA (remote direct memory access) GPU workloads on OKE using GPU Operator and Network Operator
2
-
1
+
# Running RDMA (remote direct memory access) GPU workloads on OKE
3
2
Oracle Cloud Infrastructure Container Engine for Kubernetes (OKE) is a fully-managed, scalable, and highly available service that you can use to deploy your containerized applications to the cloud.
4
3
5
4
Please visit the [OKE documentation page](https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengoverview.htm) for more information.
6
5
7
-
This guide has the instructions for deploying an OKE cluster using H100 & A100 bare metal nodes with RDMA connectivity using the [GPU Operator](https://github.com/NVIDIA/gpu-operator) and [Network Operator](https://github.com/Mellanox/network-operator).
8
-
9
-
> [!IMPORTANT]
10
-
> Currently, creating SR-IOV Virtual Functions is supported in limited regions. For H100, all regions with H100s are supported. For A100s, Phoenix (PHX) and Osaka (KIX) regions are supported. For other regions, please contact your sales representative.
11
-
12
-
### What is NVIDIA GPU Operator?
13
-
Kubernetes provides access to special hardware resources such as NVIDIA GPUs, NICs, Infiniband adapters and other devices through the device plugin framework. However, configuring and managing nodes with these hardware resources requires configuration of multiple software components such as drivers, container runtimes or other libraries which are difficult and prone to errors. The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM based monitoring and others.
14
-
15
-
### What is NVIDIA Network Operator?
16
-
NVIDIA Network Operator leverages Kubernetes CRDs and Operator SDK to manage Networking related Components in order to enable Fast networking, RDMA and GPUDirect for workloads in a Kubernetes cluster.
17
-
18
-
The Goal of Network Operator is to manage all networking related components to enable execution of RDMA and GPUDirect RDMA workloads in a kubernetes cluster.
19
-
20
6
### Supported Operating Systems
21
-
For the A100 and H100 shapes (BM.GPU.H100.8, BM.GPU.A100-v2.8, BM.GPU4.8), Oracle Linux 8 with the Red Hat Compatible Kernel (RHCK) is supported.
7
+
For the A100 and H100 shapes (BM.GPU.H100.8, BM.GPU.A100-v2.8, BM.GPU4.8, BM.GPU.B4.8), Ubuntu 22.04 is supported.
22
8
23
9
### Required policies
24
-
The Terraform deployment template uses the [Self Managed Nodes](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengworkingwithselfmanagednodes.htm) functionality of OKE.
10
+
The OCI Resource Manager stack template uses the [Self Managed Nodes](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengworkingwithselfmanagednodes.htm) functionality of OKE.
25
11
26
-
You must create the necessary OKE policies:
12
+
Below policies are required. The OCI Resource Manager stack will create them for you if you have the necessary permissions. If you don't have the permissions, please find more information about the policies below.
27
13
28
14
-[Policy Configuration for Cluster Creation and Deployment](https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengpolicyconfig.htm)
29
15
-[Creating a Dynamic Group and a Policy for Self-Managed Nodes](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengdynamicgrouppolicyforselfmanagednodes.htm)
30
16
31
17
## Instructions for deploying an OKE cluster with GPUs and RDMA connectivity
18
+
You will need a CPU pool and a GPU pool. The OCI Resource Manager stack deploys an operational worker pool by default and you choose to deploy addidional CPU/GPU worker pools.
32
19
33
-
You will need a CPU and a GPU pool. The Terraform template deploys an operational/system worker pool (CPU) and a GPU worker pool.
34
-
35
-
The GPU pool requires you to use an image provided by the Oracle HPC team, you can find the import link below. This image included the OFED drivers and necessary packages configured for RDMA.
36
-
37
-
For the non-GPU worker pools, you can use the default OKE images (no need to specify them in the Terraform template).
20
+
You can use the below image for both CPU and GPU pools.
38
21
39
22
> [!NOTE]
40
-
> The GPU image has the GPU drivers pre-installed (GPU driver version 535.154.05 with CUDA 12.2). Deploying the GPU driver as a container with the GPU Operator is currently not supported.
23
+
> The GPU image has the GPU drivers pre-installed (GPU driver version 535.154.05 with CUDA 12.2).
41
24
42
25
#### Image to import and use for the H100 and A100 nodes
You can use the instructions [here.](https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/imageimportexport.htm#Importing) for importing the below image to your tenancy.
27
+
28
+
[Image to import](https://objectstorage.ca-toronto-1.oraclecloud.com/p/oXC6BcCkB0lXhycxV-0UuDqGGnVtFWfLOkwuJWA5WbsBDb4FkHwnsOHa_ElRcfL2/n/hpc_limited_availability/b/images/o/Ubuntu-22-OCA-OFED-23.10-2.1.3.1-GPU-535-CUDA-12.2-2024.03.15-0)
44
29
45
-
### Deploy the cluster using the Terraform template
46
-
You can find the template in the [terraform directory](./terraform/).
30
+
### Deploy the cluster using the Oracle Cloud Resource Manager template
31
+
You can easily deploy the cluster using the **Deploy to Oracle Cloud** button below.
47
32
48
-
Make sure to update the variables in the `worker pools` blocks.
33
+
[](https://cloud.oracle.com/resourcemanager/stacks/create?zipUrl=https://github.com/oracle-quickstart/oci-hpc-oke/releases/download/v24.6.0/oke-rdma-quickstart-v24.6.0.zip)
49
34
50
-
You can find more information on setting up Terraform for OCI [here](https://docs.oracle.com/en-us/iaas/developer-tutorials/tutorials/tf-provider/01-summary.htm).
35
+
For the image ID, use the ID of the image that you imported in the previous step.
51
36
52
37
The template will deploy a `bastion` instance and an `operator` instance. The `operator` instance will have access to the OKE cluster. You can connect to the `operator` instance via SSH with `ssh -J opc@<bastion IP> opc@<operator IP>`.
53
38
@@ -61,136 +46,67 @@ NAME STATUS ROLES AGE VERSION
Wait until all network operator pods are running with `kubectl get pods -n gpu-operator`.
94
-
95
-
### Deploy Network Operator
96
-
97
-
> [!IMPORTANT]
98
-
> The device name you will use when deploying the Network Operator is different between A100 and H100 shapes. Please make sure that you are running the correct command based on your shape.
This step creates a ConfigMap that can be used as the NCCL topology file when running your jobs that use NCCL as the backend.
52
+
### Using the host RDMA network interfaces in manifests
53
+
In order to use the RDMA interfaces on the host in your pods, you should have the below sections in your manifests:
161
54
162
-
You can find the topology files in the [topology directory](https://github.com/oracle-quickstart/oci-hpc-oke/tree/main/manifests/topology) in this repo. Please make sure you use the correct topology file based on your shape when creating the ConfigMap.
### Confirm that the GPUs are Virtual Functions (VFs) are correctly exposed
173
-
Once the Network Operator pods are deployed, the GPU nodes with RDMA NICs will start reporting `nvidia.com/sriov_rdma_vf` as an available resource. You can request that resource in your pod manifests for assigning RDMA VFs to pods.
174
-
175
-
By default, we create one Virtual Function per Physical Function. So for the H100 and A100 bare metal shapes, you will see 16 VFs per node exposed as a resource.
176
-
64
+
```yaml
65
+
securityContext:
66
+
privileged: true
67
+
capabilities:
68
+
add: [ "IPC_LOCK" ]
177
69
```
178
-
kubectl get nodes -l 'node.kubernetes.io/instance-type in (BM.GPU.H100.8, BM.GPU.A100-v2.8, BM.GPU4.8, BM.GPU.B4.8)' --sort-by=.status.capacity."nvidia\.com/gpu" -o=custom-columns='NODE:metadata.name,GPUs:status.capacity.nvidia\.com/gpu,RDMA-VFs:status.capacity.nvidia\.com/sriov_rdma_vf'
179
-
180
-
NODE GPUs RDMA-VFs
181
-
10.79.148.115 8 16
182
-
10.79.151.167 8 16
183
-
10.79.156.205 8 16
70
+
```yaml
71
+
volumeMounts:
72
+
- { mountPath: /dev/infiniband, name: devinf }
73
+
- { mountPath: /dev/shm, name: shm }
184
74
```
185
-
186
-
### Requesting VFs in manifests
187
-
Network Operator exposes the RDMA Virtual Functions (VFs) as allocatable resources. To use them, you need to add the following annotation to your manifests. The next step in this guide has an example for running the NCCL test, you can use that manifest as an example.
75
+
Here's a simple example. You can also look at the NCCL test manifests in the repo [here.](../manifests/)
> The NCCL parameters are different between the H100 and A100 shapes. Please make sure that you are using the correct manifest.
126
+
> The NCCL parameters are different between the H100 and A100 shapes. Please make sure that you are using the correct manifest for your bare metal GPU shapes.
The initial pull of the container will take long. Once the master pod `nccl-allreduce-job0-mpimaster-0` starts running, you can check it logs for the NCCL test result.
0 commit comments