Within an SSH session to the Slurm cluster, go to the quickcheck directory:
cd /opt/slurm-test/quickcheckThere are 4 sbatch scripts
-
hello.shPerforms basic checks of the Slurm cluster: jobs can be executed and resources can be allocated.
-
containers.shLaunches jobs inside enroot containers.
-
nccl_single_node.shExecutes single-node NCCL test "all_reduce_perf" twice: using NVLink and using the closest Infiniband switch.
-
nccl_multi_node.shExecutes a multi-node NCCL test "all_reduce_perf_mpi".
To run them, execute following commands:
sbatch hello.sh && \
tail -f results/hello.outsbatch containers.sh && \
tail -f results/containers.outsbatch nccl_single_node.sh && \
tail -f results/nccl_single_node.outsbatch --nodes=4 nccl_multi_node.sh && \
tail -f results/nccl_multi_node.out