Skip to content

Commit bb65c3f

Browse files
author
Piotr Lesnicki
committed
pmi2: check failures and warn. Singloton check must be elsewhere.
Depending on slurm versions, PMI2 can fail differently when there is no server. But it can fail to contact its server for different reasons: - no slurm flag `--mpi=pmi2` given. This needs a warning - singleton : no server when no srun. There should be no warning here Here both are handle with a warning, so disqualifying singleton must be made before.
1 parent 9c83f9d commit bb65c3f

File tree

2 files changed

+3
-8
lines changed

2 files changed

+3
-8
lines changed

opal/mca/common/pmi/common_pmi.c

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,6 @@ bool mca_common_pmi_init (void) {
4646
rank = -1;
4747
appnum = -1;
4848

49-
5049
/* if we can't startup PMI, we can't be used */
5150
if (PMI2_Initialized ()) {
5251
return true;
@@ -57,17 +56,13 @@ bool mca_common_pmi_init (void) {
5756
mca_common_pmi_init_count--;
5857
return false;
5958
}
60-
if (size < 0 || rank < 0 ) {
59+
/* depending on slurm versions, we may get bad rank/size or bad jobid */
60+
if (size < 0 || rank < 0 || PMI2_SUCCESS != PMI2_Job_GetId(buf, PMI2_MAX_VALLEN)) {
6161
opal_show_help("help-common-pmi.txt", "pmi2-init-returned-bad-values", true);
6262
mca_common_pmi_init_count--;
6363
return false;
6464
}
6565

66-
if (PMI2_SUCCESS != PMI2_Job_GetId(buf, PMI2_MAX_VALLEN)) {
67-
/* PMI2 can't be used if no job in singloton mode */
68-
mca_common_pmi_init_count--;
69-
return false;
70-
}
7166
mca_common_pmi_init_size = size;
7267
mca_common_pmi_init_rank = rank;
7368
mca_common_pmi_init_count--;

opal/mca/common/pmi/help-common-pmi.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ We cannot use PMI2 at this time, and your job will
1313
likely abort.
1414
#
1515
[pmi2-init-returned-bad-values]
16-
PMI2 initialized but returned bad values for size and rank.
16+
PMI2 initialized but returned bad values for size/rank/jobid.
1717
This is symptomatic of either a failure to use the
1818
"--mpi=pmi2" flag in SLURM, or a borked PMI2 installation.
1919
If running under SLURM, try adding "-mpi=pmi2" to your

0 commit comments

Comments
 (0)