Skip to content

MPI_Probe crash with messages from large number of ranks on Aurora #7711

@ryanstocks00

Description

@ryanstocks00

Instructions for reproducing using default modules on Aurora: (confirmed by @colleeneb)

git clone https://github.com/ryanstocks00/DynaMPI.git
git checkout 2ed2c4a8afb1b6a9dc377c22cd9adf1eabec9aaa
./benchmark/aurora/aurora_compile.sh # (Note you will likely need to do this on a login node the first time as it automatically downloads google test dependencies)
qsub -I -l select=16 -l walltime=30:00 -l filesystems=home:flare -A XXX -q debug-scaling NODE_LIST=16 
mpirun -n 1632 --ppn 102 ./build/benchmark/strong_scaling_distribution_rate --expected_us 1000 --distribution naive --nodes 16

This runs a dynamic master-slave distribution (i.e. all ranks send a message to rank 0 requesting a task, and rank 0 receives and sends a "task" back. The worker then spins for expected_us microseconds before requesting another task). This gives a non-deterministic error such as

x4220c5s7b0n0.hsn.cm.aurora.alcf.anl.gov: rank 0 died from signal 6
x4220c6s0b0n0.hsn.cm.aurora.alcf.anl.gov: rank 118 died from signal 15

Sometimes the output will add a bunch of
Abort(15) on node 1328 (rank 1328 in comm 496): Fatal error in internal_Probe: Other MPI error

Interestingly, reducing the duration of the tasks (e.g. --expected_us 1) which should mean more conflicting messages seems to be more likely to run successfully.

The issue goes away if I use export FI_CXI_RX_MATCH_MODE=software, however then the performance seems to be significantly lacking (not sure if this is due to the matching mode or something else) and up to two orders of magnitude slower than Frontier.

May be related to #7427

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions