-
Notifications
You must be signed in to change notification settings - Fork 937
Description
There are remaining issues related to spawn when running the mpi4py testsuite. I'm able to reproduce them locally.
First, you need to switch to branch testing/ompi-dpm, otherwise some of the reproducers below will be skip as know failures.
cd mpi4py # git repo clone
git fetch && git checkout testing/ompi-dpm
I'm configuring ompi@main the following way:
options=(
--prefix=/home/devel/mpi/openmpi/dev
--without-ofi
--without-ucx
--without-psm2
--without-cuda
--without-rocm
--with-pmix=internal
--with-prrte=internal
--with-libevent=internal
--with-hwloc=internal
--enable-debug
--enable-mem-debug
--disable-man-pages
--disable-sphinx
)
./configure "${options[@]}"
I've enabled oversubscription via both Open MPI and PRTE config files.
$ cat ~/.openmpi/mca-params.conf
rmaps_default_mapping_policy = :oversubscribe
$ cat ~/.prte/mca-params.conf
rmaps_default_mapping_policy = :oversubscribe
Afterwards, try the following:
- I cannot run in singleton mode:
$ python test/test_spawn.py -v
[kw61149:525865] shmem: mmap: an error occurred while determining whether or not /tmp/ompi.kw61149.1000/jf.0/3608084480/shared_mem_cuda_pool.kw61149 could be created.
[kw61149:525865] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728
[0@kw61149] Python 3.12.1 (/usr/bin/python)
[0@kw61149] numpy 1.26.3 (/home/dalcinl/.local/lib/python3.12/site-packages/numpy)
[0@kw61149] MPI 3.1 (Open MPI 5.1.0)
[0@kw61149] mpi4py 4.0.0.dev0 (/home/dalcinl/Devel/mpi4py/src/mpi4py)
testArgsBad (__main__.TestSpawnMultipleSelf.testArgsBad) ... ok
testArgsOnlyAtRoot (__main__.TestSpawnMultipleSelf.testArgsOnlyAtRoot) ... ok
testCommSpawn (__main__.TestSpawnMultipleSelf.testCommSpawn) ... ok
testCommSpawnDefaults1 (__main__.TestSpawnMultipleSelf.testCommSpawnDefaults1) ... prte: ../../../../../ompi/3rd-party/openpmix/src/class/pmix_list.c:62: pmix_list_item_destruct: Assertion `0 == item->pmix_list_item_refcount' failed.
ERROR
testCommSpawnDefaults2 (__main__.TestSpawnMultipleSelf.testCommSpawnDefaults2) ... ERROR
...
- The following test fails when using a large number of MPI processes, let say 10, you may need more:
mpiexec -n 10 python test/test_spawn.py -v
Sometimes I get a segfault, sometimes a deadlock, and a few times the run may run to completion.
The following narrowing of tests may help figure out the problem:
mpiexec -n 10 python test/test_spawn.py -v -k testArgsOnlyAtRoot
It may run OK many times, but eventually I get a failure and the following output:
testArgsOnlyAtRoot (__main__.TestSpawnSingleSelfMany.testArgsOnlyAtRoot) ... [kw61149:00000] *** An error occurred in Socket closed3) The following test deadlocks when running in 4 or more MPI processes:
This other narrowed down test also have issues, but it does not always fail:
mpiexec -n 10 python test/test_spawn.py -v -k testNoArgs
[kw61149:1826801] *** Process received signal ***
[kw61149:1826801] Signal: Segmentation fault (11)
[kw61149:1826801] Signal code: Address not mapped (1)
[kw61149:1826801] Failing at address: 0x180
[kw61149:1826801] [ 0] /lib64/libc.so.6(+0x3e9a0)[0x7fea10eaa9a0]
[kw61149:1826801] [ 1] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(+0x386b4a)[0x7fea02786b4a]
[kw61149:1826801] [ 2] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_match+0x123)[0x7fea02788d32]
[kw61149:1826801] [ 3] /home/devel/mpi/openmpi/dev/lib/libopen-pal.so.0(+0xc7661)[0x7fea02384661]
[kw61149:1826801] [ 4] /home/devel/mpi/openmpi/dev/lib/libevent_core-2.1.so.7(+0x1c645)[0x7fea02ea6645]
[kw61149:1826801] [ 5] /home/devel/mpi/openmpi/dev/lib/libevent_core-2.1.so.7(event_base_loop+0x47f)[0x7fea02ea6ccf]
[kw61149:1826801] [ 6] /home/devel/mpi/openmpi/dev/lib/libopen-pal.so.0(+0x23ef1)[0x7fea022e0ef1]
[kw61149:1826801] [ 7] /home/devel/mpi/openmpi/dev/lib/libopen-pal.so.0(opal_progress+0xa7)[0x7fea022e0faa]
[kw61149:1826801] [ 8] /home/devel/mpi/openmpi/dev/lib/libopen-pal.so.0(ompi_sync_wait_mt+0x1cd)[0x7fea0232ca1a]
[kw61149:1826801] [ 9] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(+0x6c4bf)[0x7fea0246c4bf]
[kw61149:1826801] [10] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(ompi_comm_nextcid+0x7a)[0x7fea0246e0cf]
[kw61149:1826801] [11] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(ompi_dpm_connect_accept+0x3cd6)[0x7fea0247ffca]
[kw61149:1826801] [12] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(ompi_dpm_dyn_init+0xc6)[0x7fea02489df8]
[kw61149:1826801] [13] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(ompi_mpi_init+0x750)[0x7fea024abd47]
[kw61149:1826801] [14] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(PMPI_Init_thread+0xdc)[0x7fea02513c4a]
- The following test deadlocks when running in 4 or more MPI processes:
mpiexec -n 4 python test/test_dynproc.py -v
It may run occasionally, but most of the times it deadlocks.
[kw61149:00000] *** reported by process [3119841281,6]
[kw61149:00000] *** on a NULL communicator
[kw61149:00000] *** Unknown error
[kw61149:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[kw61149:00000] *** and MPI will try to terminate your MPI job as well)
cc @hppritcha