Skip to content

perf: request some benchmarks and compare them with results in native slurm #156

@CrackedPoly

Description

@CrackedPoly

I wonder how much overhead does soperator introduces in ML, compared with native slurm. This is an important concern and I want to know if you have any statistics.

Some scenarios

Single machine

  • 1 GPU training benchmark
  • 8 GPUs distributed training benchmark (Nvlink involved)

Distributed

  • 16 GPUs distributed training benchmark (both Nvlink and IB involved)

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions