-
Notifications
You must be signed in to change notification settings - Fork 318
Description
Instructions for reproducing using default modules on Aurora: (confirmed by @colleeneb)
git clone https://github.com/ryanstocks00/DynaMPI.git
git checkout 2ed2c4a8afb1b6a9dc377c22cd9adf1eabec9aaa
./benchmark/aurora/aurora_compile.sh # (Note you will likely need to do this on a login node the first time as it automatically downloads google test dependencies)
qsub -I -l select=16 -l walltime=30:00 -l filesystems=home:flare -A XXX -q debug-scaling NODE_LIST=16
mpirun -n 1632 --ppn 102 ./build/benchmark/strong_scaling_distribution_rate --expected_us 1000 --distribution naive --nodes 16
This runs a dynamic master-slave distribution (i.e. all ranks send a message to rank 0 requesting a task, and rank 0 receives and sends a "task" back. The worker then spins for expected_us microseconds before requesting another task). This gives a non-deterministic error such as
x4220c5s7b0n0.hsn.cm.aurora.alcf.anl.gov: rank 0 died from signal 6
x4220c6s0b0n0.hsn.cm.aurora.alcf.anl.gov: rank 118 died from signal 15
Sometimes the output will add a bunch of
Abort(15) on node 1328 (rank 1328 in comm 496): Fatal error in internal_Probe: Other MPI error
Interestingly, reducing the duration of the tasks (e.g. --expected_us 1) which should mean more conflicting messages seems to be more likely to run successfully.
The issue goes away if I use export FI_CXI_RX_MATCH_MODE=software, however then the performance seems to be significantly lacking (not sure if this is due to the matching mode or something else) and up to two orders of magnitude slower than Frontier.
May be related to #7427