Skip to content

Commit 67e0816

Browse files
Update docs
1 parent 086f86e commit 67e0816

File tree

2 files changed

+21
-7
lines changed

2 files changed

+21
-7
lines changed

README.md

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,20 @@ curl -s -o ./topo.xml https://raw.githubusercontent.com/oracle-quickstart/oci-hp
174174
kubectl create configmap nccl-topology --from-file ./topo.xml
175175
```
176176

177-
### Optional - Deploy Volcano
177+
### Requesting the Virtual Functions in manifests
178+
Network Operator exposes the RDMA Virtual Functions (VFs) as allocatable resources. In order to use them, you need to add the following annotation to your manifests. The next step in this guide has an example for running the NCCL test, you can use that manifest as an example.
179+
180+
```yaml
181+
template:
182+
metadata:
183+
annotations:
184+
k8s.v1.cni.cncf.io/networks: oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov,oci-rdma-sriov
185+
```
186+
187+
### Optional - Deploy Volcano and run the NCCL test
188+
Volcano is needed for running the optional NCCL test. It's not required for the regular operation of the cluster, you can remove it after you finish running the NCCL test.
189+
190+
#### Deploy Volcano
178191
```sh
179192
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
180193
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace
@@ -183,7 +196,7 @@ kubectl create serviceaccount -n default mpi-worker-view
183196
kubectl create rolebinding default-view --namespace default --serviceaccount default:mpi-worker-view --clusterrole view
184197
```
185198

186-
### Optional - Run the NCCL test
199+
#### Run the NCCL test
187200
> [!IMPORTANT]
188201
> The NCCL parameters are different between the H100 and A100 shapes. Please make sure that you are using the correct manifest.
189202
@@ -199,8 +212,6 @@ kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke
199212

200213
The initial pull of the container will take long. Once the master pod `nccl-allreduce-job0-mpimaster-0` starts running, you can check it logs for the NCCL test result.
201214

202-
203-
204215
```sh
205216
Defaulted container "mpimaster" out of: mpimaster, wait-for-workers (init)
206217
Warning: Permanently added 'nccl-allreduce-job0-mpiworker-0.nccl-allreduce-job0' (ED25519) to the list of known hosts.

terraform/variables.tf

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,11 @@ variable "compartment_id" { type = string }
66
variable "ssh_public_key_path" { type = string }
77
variable "ssh_private_key_path" { type = string }
88

9-
variable gpu_image { default = "" }
10-
variable gpu_shape { default = "" }
9+
variable system_pool_image { default = "" }
10+
variable a100_image { default = "" }
11+
variable a100_shape { default = "" }
1112
variable kubernetes_version { default = "v1.27.2" }
1213
variable cluster_type { default = "enhanced" }
13-
variable cni_type {default = "flannel"}
14+
variable cluster_name { default = "a100-cluster" }
15+
variable cni_type {default = "flannel"}
16+
variable cluster_name { default = "oke-gpu-rdma-quickstart" }

0 commit comments

Comments
 (0)