Skip to content

Slurm Scheduler doesn't support multi-cluster #3559

@JimPaine

Description

@JimPaine

Slurm Scheduler doesn't support multi-cluster

I am trying to run a series of tests across multiple clusters in Slurm.

  • I have accounting configured and can interact with the clusters using sbatch, sinfo, sacct .... using -M flag to specify the cluster
  • I use the access property to specify my cluster in my partitions in my reframe cluster config and can see the jobs are submitted to the correct cluster
  • when running with verbose output I can see that sacct arguments are pre-defined and there is no way to pass additional values to sacct as seen
    here

Snippet of core/schedulers/slurm.py

completed = _run_strict(
  f'sacct -S {t_start} -P '
  f'-j {",".join(job.jobid for job in jobs)} '
  f'-o jobid,state,exitcode,end,nodelist'
)

Am I missing something obvious? If not could we do something that lets us set additional args in scheduler options per reframe system partition something similar to sched_access_in_submit or addidional_args?

Section of cluster.py

site_configuration = {
    'systems': [
        {

...

            'partitions': [
                {
                    'name': 'clusterA',
                    'descr': 'clusterA',
                    'scheduler': 'slurm',
                    'launcher': 'srun',
                    'environs': ['clusterA'],
                    'access': ['-M clusterA']
                },
                {
                    'name': 'clusterB',
                    'descr': 'clusterB',
                    'scheduler': 'slurm',
                    'launcher': 'srun',
                    'environs': ['clusterB'],
                    'access': ['-M clusterB']
                }
            ]
        }
    ],
    'environments': [
        {
            'name': 'clusterA',
            
            ...
        },
        {
            'name': 'clusterB',

            ...
        }
    ]
}

Cluster A is my default cluster so tests run fine here.

Output of sacct

No -M means it is running on default cluster.

Entering stage: run
Entering stage: run
Entering stage: run_wait
[CMD] 'sacct -S 2025-09-30 -P -j 54 -o jobid,state,exitcode,end,nodelist'

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions