Skip to content

Add PMIx hook#7

Merged
fcruzcscs merged 42 commits intomainfrom
VCUE-1062_pmix-hook
Oct 24, 2025
Merged

Add PMIx hook#7
fcruzcscs merged 42 commits intomainfrom
VCUE-1062_pmix-hook

Conversation

@gwangmu
Copy link
Copy Markdown
Contributor

@gwangmu gwangmu commented Sep 16, 2025

Resolves: VCUE-1062

Added a PMIx hook, based on the equivalent hook for Enroot (https://git.cscs.ch/alps-platforms/vservices/vs-container-engine/-/blob/main/ce-enroot/templates/50-slurm-pmix.sh.tftpl?ref_type=heads). Also added a BATS test.

It passed a local test using a mock "scontrol." The main pipeline is waiting for 'zinal' as of writing.

@gwangmu gwangmu requested a review from fcruzcscs September 16, 2025 17:50
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We run bats in zinal or other cscs systems, it is fair to assume that slurm is available. I would recommend to write the test to assume that slurm is present (along with scontrol) to run against the actual slurm and validate that pmix is working with mpi applications.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on it. I'll let you know after I fix tests to use the real scontrol.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adjusted the test so that the test uses the system scontrol if it's there, although I don't think the test result should depend on the system setting if possible.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still dont see an mpi application test.

libsarus::mount::validatedBindMount(pathAppdirJobStep, pathAppdirJobStep, userIdentity, pathRootFS, mount_flags);
log(boost::format("Mounted spmix_appdir: %s") % pathAppdirJobStep, libsarus::LogLevel::INFO);
} catch (...) {
// Respecfully ignore. ("nofail")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This catch should not ignore everything raised from the validatedBindMount.
If the pmi dirs fail to mount, the hook basically failed to do what it needed to do.
If the container requires pmix functionality, then the container should fail.

Copy link
Copy Markdown
Contributor Author

@gwangmu gwangmu Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the "nofail" mount option in the reference hook for Enroot, meaning that it wouldn't have caused an error if the hook couldn't mount the spmix_appdir directory (enclosed in this try-cache block). If you think this should raise an error, we can easily remove the try-cache block, or generate an error in the try block.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I thought there must have been a reason why there was a "nofail" option before.)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this mount used for in order to enable the PMIx functionality? do we need it? if not, why not just skip it from the start?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about the purpose of this mount at all. I opened a Slack thread to collect more information on this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Madeeks said on Slack that the 'nofail' option came from the initial reference from NVIDIA. Since we don't know why they put it there, and we're (re)developing hooks now, it's a good chance to make it an error and see how it goes in testing and during pre-production. I removed the try-cache block accordingly.

@gwangmu gwangmu mentioned this pull request Oct 22, 2025
@gwangmu
Copy link
Copy Markdown
Contributor Author

gwangmu commented Oct 24, 2025

@fcruzcscs The tests of this PR can only pass after PR #9 is merged first (and it's rebased to it).

@fcruzcscs
Copy link
Copy Markdown
Member

merged and re-run but pipelines failed on PMIx test not passing

@gwangmu gwangmu force-pushed the VCUE-1062_pmix-hook branch from 83e7e14 to d8feb4b Compare October 24, 2025 13:29
@gwangmu
Copy link
Copy Markdown
Contributor Author

gwangmu commented Oct 24, 2025

There was something wrong while rebasing. Now all tests are passing.

@fcruzcscs fcruzcscs merged commit e8bb1d8 into main Oct 24, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants