MPI_Probe crash with messages from large number of ranks on Aurora


Instructions for reproducing using default modules on Aurora: (confirmed by @colleeneb)
```
git clone https://github.com/ryanstocks00/DynaMPI.git
git checkout 2ed2c4a8afb1b6a9dc377c22cd9adf1eabec9aaa
./benchmark/aurora/aurora_compile.sh # (Note you will likely need to do this on a login node the first time as it automatically downloads google test dependencies)
qsub -I -l select=16 -l walltime=30:00 -l filesystems=home:flare -A XXX -q debug-scaling NODE_LIST=16 
mpirun -n 1632 --ppn 102 ./build/benchmark/strong_scaling_distribution_rate --expected_us 1000 --distribution naive --nodes 16
```
This runs a dynamic master-slave distribution (i.e. all ranks send a message to rank 0 requesting a task, and rank 0 receives and sends a "task" back. The worker then spins for `expected_us` microseconds before requesting another task). This gives a non-deterministic error such as
```
x4220c5s7b0n0.hsn.cm.aurora.alcf.anl.gov: rank 0 died from signal 6
x4220c6s0b0n0.hsn.cm.aurora.alcf.anl.gov: rank 118 died from signal 15
```
Sometimes the output will add a bunch of
`Abort(15) on node 1328 (rank 1328 in comm 496): Fatal error in internal_Probe: Other MPI error`

Interestingly, reducing the duration of the tasks (e.g. `--expected_us 1`) which should mean more conflicting messages seems to be more likely to run successfully.

The issue goes away if I use `export FI_CXI_RX_MATCH_MODE=software`, however then the performance seems to be significantly lacking (not sure if this is due to the matching mode or something else) and up to two orders of magnitude slower than Frontier.

May be related to https://github.com/pmodels/mpich/issues/7427

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI_Probe crash with messages from large number of ranks on Aurora #7711

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MPI_Probe crash with messages from large number of ranks on Aurora #7711

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions