Skip to content

Remove/revise Slurm launch integrationΒ #12471

@rhc54

Description

@rhc54

I've been getting reports lately of problems launching OMPI and PRRTE jobs under Slurm. Apparently, Slurm is now attempting to modify the ORTE/PRRTE internal srun command line used to launch the mpirun daemons by injecting an MCA parameter into the environment when the allocation is created.

This isn't the first time we have encountered problems with either discovering the Slurm allocation (where we rely on specific envars) or launching the daemons (where we rely on a stable srun cmd line). It has been a recurring problem that rears its head every couple of years. The other two MPIs that support the Slurm environment (MPICH and IntelMPI) both follow our same integration approach, so this is something that has impacted us all for some time.

In talking with SchedMD, the feeling on their side is that the various MPIs are interfering with their ability to develop Slurm, particularly when trying to add features. The requirement that specific envars and the srun cmd line remain stable is simply no longer going to be acceptable - some of their new features introduced last year just cannot accommodate it.

What they were trying to do was use the MCA param system to add a flag to the internal srun command that instructed srun to fallback to a "legacy" mode. However, that really doesn't work in a reliable manner. They can only inject it at the envar level, but we read values from default param files (which then get silently overwritten by their value) and take the final value of the param from the cmd line - which then overwrites the Slurm setting. So either the user unwittingly removes the new flag (and subsequently gets bizarre behavior as the daemons don't properly launch), or sys admins and users wind up confused when their flags simply "disappear".

One could argue that the param in question (plm_slurm_args) isn't that heavily used. However, that may not be universally true - and is impossible to really validate. My initial report came from an AWS user who couldn't get PRRTE to launch. Wenduo and I tried to advise and help debug the situation, to no avail. It was only after the user gave up that I finally figured out the root cause of the problem - aided by getting other reports of similar issues.

I have tried to propose alternative approaches based on what other environments have done. For example, LSF and PBS provide us with a launch API that is stable, but leaves them free to modify the underlying implementation at will to conform to their own changes. For resource discovery, pretty much everyone else provides a hostfile that we can read.

Unfortunately, SchedMD has rejected those proposals. Their basic position at the moment is that (a) we shouldn't be constraining their development path, and (b) they shouldn't have to provide us with alternative integration methods. I'm still pressing them for a positive proposal, but not getting very far at the moment. My sense is that they are frustrated by the problems over the years and are basically just giving up.

Assuming their stance doesn't change, the only alternative I can see is to remove the RAS and PLM Slurm components, and drop back to using hostfiles and ssh for starting the job. Since SchedMD is refusing to provide a hostfile upon making the allocation, it would place the burden on the user to generate it (note that executing srun hostname may not result in the correct hostfile due to recent changes in the srun behavior).

Beyond the issue of generating the hostfile, a move to using ssh means the loss of fine-grained accounting since Slurm will not see the processes launched by mpirun. When we launch the daemons via srun, the accounting system "sees" the daemon and therefore tracks all resources used by the daemon and its children (i.e., the app procs). When launched by ssh, this doesn't happen. So the overall job still gets accounted as the system "sees" mpirun itself and the use of resources on the nodes - but it doesn't provide per-proc tracking.

So we should discuss how we want to resolve this. The release of Slurm 23.11 is where the breakage occurs, so it is still early in the adoption process - but it breaks all existing OMPI releases, which is a problem we cannot control.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions