Skip to content

Commit f1456d8

Browse files
Update docs
1 parent 67e0816 commit f1456d8

File tree

1 file changed

+14
-15
lines changed

1 file changed

+14
-15
lines changed

README.md

Lines changed: 14 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -135,21 +135,6 @@ Wait until all network operator pods are running with `kubectl get pods -n netwo
135135
```
136136
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/main/manifests/vf-config.yaml
137137
```
138-
139-
### Confirm that the GPUs are VFs are correctly exposed
140-
Once the Network Operator pods are deployed, the GPU nodes with RDMA NICs will start reporting `nvidia.com/sriov_rdma_vf` as an available resource. You can request that resource in your pod manifests for assigning RDMA VFs to pods.
141-
142-
By default, we create one Virtual Function per Physical Function. So for the H100 and A100 bare metal shapes, you will see 16 VFs per node exposed as a resource.
143-
144-
```
145-
kubectl get nodes -l 'node.kubernetes.io/instance-type in (BM.GPU.H100.8, BM.GPU.A100-v2.8, BM.GPU4.8, BM.GPU.B4.8)' --sort-by=.status.capacity."nvidia\.com/gpu" -o=custom-columns='NODE:metadata.name,GPUs:status.capacity.nvidia\.com/gpu,RDMA-VFs:status.capacity.nvidia\.com/sriov_rdma_vf'
146-
147-
NODE GPUs RDMA-VFs
148-
10.79.148.115 8 16
149-
10.79.151.167 8 16
150-
10.79.156.205 8 16
151-
```
152-
153138
### Create Network Attachment Definition
154139

155140
```sh
@@ -174,6 +159,20 @@ curl -s -o ./topo.xml https://raw.githubusercontent.com/oracle-quickstart/oci-hp
174159
kubectl create configmap nccl-topology --from-file ./topo.xml
175160
```
176161

162+
### Confirm that the GPUs are VFs are correctly exposed
163+
Once the Network Operator pods are deployed, the GPU nodes with RDMA NICs will start reporting `nvidia.com/sriov_rdma_vf` as an available resource. You can request that resource in your pod manifests for assigning RDMA VFs to pods.
164+
165+
By default, we create one Virtual Function per Physical Function. So for the H100 and A100 bare metal shapes, you will see 16 VFs per node exposed as a resource.
166+
167+
```
168+
kubectl get nodes -l 'node.kubernetes.io/instance-type in (BM.GPU.H100.8, BM.GPU.A100-v2.8, BM.GPU4.8, BM.GPU.B4.8)' --sort-by=.status.capacity."nvidia\.com/gpu" -o=custom-columns='NODE:metadata.name,GPUs:status.capacity.nvidia\.com/gpu,RDMA-VFs:status.capacity.nvidia\.com/sriov_rdma_vf'
169+
170+
NODE GPUs RDMA-VFs
171+
10.79.148.115 8 16
172+
10.79.151.167 8 16
173+
10.79.156.205 8 16
174+
```
175+
177176
### Requesting the Virtual Functions in manifests
178177
Network Operator exposes the RDMA Virtual Functions (VFs) as allocatable resources. In order to use them, you need to add the following annotation to your manifests. The next step in this guide has an example for running the NCCL test, you can use that manifest as an example.
179178

0 commit comments

Comments
 (0)