I wonder how much overhead does soperator introduces in ML, compared with native slurm. This is an important concern and I want to know if you have any statistics.
Some scenarios
Single machine
- 1 GPU training benchmark
- 8 GPUs distributed training benchmark (Nvlink involved)
Distributed
- 16 GPUs distributed training benchmark (both Nvlink and IB involved)