perf: request some benchmarks and compare them with results in native slurm

I wonder how much overhead does soperator introduces in ML, compared with **native slurm**. This is an important concern and I want to know if you have any statistics.

## Some scenarios

### Single machine
- 1 GPU training benchmark
- 8 GPUs distributed training benchmark (Nvlink involved)

### Distributed
- 16 GPUs distributed training benchmark (both Nvlink and IB involved)