-
Notifications
You must be signed in to change notification settings - Fork 929
Description
We observing a problem with MPICH comm_idup test.
I found it when I was running MTT verifications on my PRs #3375, #3376, #3377 and originally thought that they are causing this problem because we were not observing this behavior in the SLURM/PMIx MTT that we providing to the community.
However I noticed that I was running verification MTT on the different cluster (with mlx5 adapters) while usually we use another one (with mlx4). When I launched our regular MTT set that is testing current ompi/v2.x on this mlx5 cluster, I found that this error occurs there as well: https://mtt.open-mpi.org/index.php?do_redir=2415
And here is git log for the corresponding MTT directory (that shows that no CID tweak commit is there):
$ pwd
/hpc/mtr_scrap/users/boriska/scratch/pmi/20170420_105919_24997_119393_vegas10/mpi-install/mkNl/src/ompi.git
$ git log --oneline
90c1d76 Merge pull request #3333 from rhc54/cmr2x/backward
55dea27 Add PMIx commit that disables build of pmi-1 and pmi-2 backward compatiblity support, and update OPAL pmix112 configure.m4 to se
9d7e7a8 Merge pull request #3318 from bwbarrett/v2.x
83b0a37 Merge pull request #3314 from hjelmn/v2.x_osc_rdma
5116232 osc/rdma: fix typo in atomic code
16fde29 build: Fix platform detection on FreeBSD
Looking closer into a problem I see the following sympthoms:
- Some procs are waiting in
MPI_Waithere:
Thread 1 (Thread 0x7ffff7faf740 (LWP 16680)):
#0 0x00007fffebdd2172 in mlx5_poll_cq () from /usr/lib64/libmlx5-rdmav2.so
#1 0x00007fffea9571e3 in ibv_poll_cq (wc=0x7fffffff9b60, num_entries=<optimized out>, cq=<optimized out>) at /usr/include/infiniband/ve
#2 poll_device (device=device@entry=0x7a6c30, count=count@entry=0) at btl_openib_component.c:3581
#3 0x00007fffea957d55 in progress_one_device (device=0x7a6c30) at btl_openib_component.c:3714
#4 btl_openib_component_progress () at btl_openib_component.c:3738
#5 0x00007ffff703ed7c in opal_progress () at runtime/opal_progress.c:225
#6 0x00007ffff7b5a13d in ompi_request_wait_completion (req=0x93ced8) at ../ompi/request/request.h:392
#7 ompi_request_default_wait (req_ptr=0x7fffffffcc78, status=0x0) at request/req_wait.c:41
#8 0x00007ffff7b768ca in PMPI_Wait (request=0x7fffffffcc78, status=<optimized out>) at pwait.c:70
#9 0x000000000040277b in main (argc=1, argv=0x7fffffffcdb8) at comm_idup.c:62
- And some of them are hanging in
PMPI_Ssendhere:
Thread 1 (Thread 0x7ffff7faf740 (LWP 6028)):
#0 0x00007fffebbaca6e in mlx5_poll_cq () from /usr/lib64/libmlx5-rdmav2.so
#1 0x00007fffea9571e3 in ibv_poll_cq (wc=0x7fffffff99d0, num_entries=<optimized out>, cq=<optimized out>) at /usr/include/infiniband/ve
#2 poll_device (device=device@entry=0x7a70c0, count=count@entry=0) at btl_openib_component.c:3581
#3 0x00007fffea957d55 in progress_one_device (device=0x7a70c0) at btl_openib_component.c:3714
#4 btl_openib_component_progress () at btl_openib_component.c:3738
#5 0x00007ffff7041d7c in opal_progress () at runtime/opal_progress.c:225
#6 0x00007ffff7b5be53 in ompi_request_default_test_all (count=<optimized out>, requests=0x99d130, completed=<optimized out>, statuses=<
#7 0x00007fffe98ced4a in NBC_Progress (handle=handle@entry=0x99cf68) at nbc.c:329
#8 0x00007fffe98ce30b in ompi_coll_libnbc_progress () at coll_libnbc_component.c:275
#9 0x00007ffff7041d7c in opal_progress () at runtime/opal_progress.c:225
#10 0x00007fffe9f04e3d in ompi_request_wait_completion (req=<optimized out>) at ../../../../ompi/request/request.h:392
#11 mca_pml_ob1_send (buf=<optimized out>, count=<optimized out>, datatype=<optimized out>, dst=<optimized out>, tag=0, sendmode=<optimi
#12 0x00007ffff7b761c2 in PMPI_Ssend (buf=<optimized out>, count=<optimized out>, type=<optimized out>, dest=<optimized out>, tag=<optim
#13 0x000000000040276a in main (argc=1, argv=0x7fffffffcdb8) at comm_idup.c:61
Checking previous days I see that cluster with mlx4 doesn't have that problem (no timed-out test) https://mtt.open-mpi.org/index.php?do_redir=2416