Skip to content

Commit b5126cf

Browse files
Update using-rdma-network-locality-when-running-workloads-on-oke.md
1 parent 294cd71 commit b5126cf

File tree

1 file changed

+103
-1
lines changed

1 file changed

+103
-1
lines changed

docs/using-rdma-network-locality-when-running-workloads-on-oke.md

Lines changed: 103 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,7 @@ spec:
173173
174174
```
175175

176-
### Using `kueue`
176+
### Using Kueue
177177
You will need to [enable the feature gate](https://kueue.sigs.k8s.io/docs/installation/#change-the-feature-gates-configuration) for [Topology Aware Scheduling (TAS)](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling) in Kueue. Topology Aware Scheduling is currently in alpha state since Kueue v0.9.
178178

179179
The following example uses `node.kubernetes.io/instance-type: "BM.GPU.H100.8"` to select H100s, but you can use any label that exists on all your nodes that you're targeting with the Resource Flavor.
@@ -263,4 +263,106 @@ spec:
263263
restartPolicy: Never
264264
```
265265

266+
### Using Node Ordering script as an Init Container with MPI Operator
267+
If your workload can use an ordered list of hosts or a rankfile (e.g. `mpirun`), you can use the Python script to generate that file using an Init Container and then use the generated ordered host list or rankfile in your job.
268+
269+
The script creates the files using the same information available in instance metadata service.
270+
271+
Example with MPI Operator:
272+
273+
```yaml
274+
apiVersion: kubeflow.org/v2beta1
275+
kind: MPIJob
276+
metadata:
277+
name: nccl-tests
278+
spec:
279+
slotsPerWorker: 8
280+
runPolicy:
281+
cleanPodPolicy: Running
282+
mpiReplicaSpecs:
283+
Launcher:
284+
replicas: 1
285+
template:
286+
spec:
287+
initContainers:
288+
- name: node-ordering
289+
image: iad.ocir.io/hpc_limited_availability/node-ordering:mpi-operator-v0.1
290+
volumeMounts:
291+
- name: node-ordering
292+
mountPath: "/node-ordering"
293+
- name: mpi-job-config
294+
mountPath: /etc/mpi
295+
- name: ssh-auth
296+
mountPath: /root/.ssh
297+
volumes:
298+
- name: node-ordering
299+
emptyDir: {}
300+
containers:
301+
- image: iad.ocir.io/hpc_limited_availability/nccl-tests:pytorch-24.11-nccl-2.23.4-1
302+
name: nccl-tests
303+
volumeMounts:
304+
- name: node-ordering
305+
mountPath: "/node-ordering"
306+
env:
307+
- name: OMPI_ALLOW_RUN_AS_ROOT
308+
value: "1"
309+
- name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
310+
value: "1"
311+
command: ["/bin/bash", "-c"]
312+
args: ["mpirun \
313+
-mca coll ^hcoll -mca plm_rsh_args "-p 2222" \
314+
-mca coll_hcoll_enable 0 \
315+
--bind-to numa \
316+
-hostfile /node-ordering/ordered_hostfile \
317+
-x NCCL_SOCKET_NTHREADS=16 \
318+
-x NCCL_DEBUG=WARN \
319+
-x NCCL_CUMEM_ENABLE=0 \
320+
-x NCCL_IB_SPLIT_DATA_ON_QPS=0 \
321+
-x NCCL_IB_QPS_PER_CONNECTION=16 \
322+
-x NCCL_IB_GID_INDEX=3 \
323+
-x NCCL_IB_HCA==mlx5_0,mlx5_1,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_9,mlx5_10,mlx5_12,mlx5_13,mlx5_14,mlx5_15,mlx5_16,mlx5_17 \
324+
-x NCCL_IB_TC=41 \
325+
-x NCCL_IB_SL=0 \
326+
-x NCCL_IB_TIMEOUT=22 \
327+
-x HCOLL_ENABLE_MCAST_ALL=0 \
328+
-x UCX_TLS=tcp \
329+
-x UCX_NET_DEVICES=eth0 \
330+
-x RX_QUEUE_LEN=8192 \
331+
-x IB_RX_QUEUE_LEN=8192 \
332+
-x NCCL_SOCKET_IFNAME=eth0 \
333+
-x NCCL_IGNORE_CPU_AFFINITY=1 \
334+
/workspace/nccl-tests/build/all_reduce_perf -b 8 -f 2 -g 1 -e 4G -c 1
335+
"]
336+
resources:
337+
requests:
338+
cpu: 2
339+
memory: 128Mi
340+
Worker:
341+
replicas: 2
342+
template:
343+
metadata:
344+
spec:
345+
containers:
346+
- image: iad.ocir.io/hpc_limited_availability/nccl-tests:pytorch-24.11-nccl-2.23.4-1
347+
securityContext:
348+
capabilities:
349+
add: [ "IPC_LOCK" ]
350+
name: nccl
351+
resources:
352+
requests:
353+
cpu: 100
354+
memory: 750Gi
355+
nvidia.com/gpu: 8
356+
limits:
357+
nvidia.com/gpu: 8
358+
volumeMounts:
359+
- mountPath: /dev/shm
360+
name: dshm
361+
volumes:
362+
- emptyDir:
363+
medium: Memory
364+
name: dshm
365+
```
366+
367+
266368

0 commit comments

Comments
 (0)