Conversation
There was a problem hiding this comment.
We run bats in zinal or other cscs systems, it is fair to assume that slurm is available. I would recommend to write the test to assume that slurm is present (along with scontrol) to run against the actual slurm and validate that pmix is working with mpi applications.
There was a problem hiding this comment.
I'm working on it. I'll let you know after I fix tests to use the real scontrol.
There was a problem hiding this comment.
I adjusted the test so that the test uses the system scontrol if it's there, although I don't think the test result should depend on the system setting if possible.
There was a problem hiding this comment.
I still dont see an mpi application test.
src/pmix/PMIxHook.cpp
Outdated
| libsarus::mount::validatedBindMount(pathAppdirJobStep, pathAppdirJobStep, userIdentity, pathRootFS, mount_flags); | ||
| log(boost::format("Mounted spmix_appdir: %s") % pathAppdirJobStep, libsarus::LogLevel::INFO); | ||
| } catch (...) { | ||
| // Respecfully ignore. ("nofail") |
There was a problem hiding this comment.
This catch should not ignore everything raised from the validatedBindMount.
If the pmi dirs fail to mount, the hook basically failed to do what it needed to do.
If the container requires pmix functionality, then the container should fail.
There was a problem hiding this comment.
I saw the "nofail" mount option in the reference hook for Enroot, meaning that it wouldn't have caused an error if the hook couldn't mount the spmix_appdir directory (enclosed in this try-cache block). If you think this should raise an error, we can easily remove the try-cache block, or generate an error in the try block.
There was a problem hiding this comment.
(I thought there must have been a reason why there was a "nofail" option before.)
There was a problem hiding this comment.
What is this mount used for in order to enable the PMIx functionality? do we need it? if not, why not just skip it from the start?
There was a problem hiding this comment.
I don't know about the purpose of this mount at all. I opened a Slack thread to collect more information on this.
There was a problem hiding this comment.
@Madeeks said on Slack that the 'nofail' option came from the initial reference from NVIDIA. Since we don't know why they put it there, and we're (re)developing hooks now, it's a good chance to make it an error and see how it goes in testing and during pre-production. I removed the try-cache block accordingly.
|
@fcruzcscs The tests of this PR can only pass after PR #9 is merged first (and it's rebased to it). |
|
merged and re-run but pipelines failed on PMIx test not passing |
…lti-node testing)
… (stopgap for the 'precreate' hook)
… (stopgap for the 'precreate' hook)
…so that srun-spawned process can see on different nodes)
83e7e14 to
d8feb4b
Compare
|
There was something wrong while rebasing. Now all tests are passing. |
Resolves: VCUE-1062
Added a PMIx hook, based on the equivalent hook for Enroot (https://git.cscs.ch/alps-platforms/vservices/vs-container-engine/-/blob/main/ce-enroot/templates/50-slurm-pmix.sh.tftpl?ref_type=heads). Also added a BATS test.
It passed a local test using a mock "scontrol." The main pipeline is waiting for 'zinal' as of writing.