BTL/openib issues on mlx5

We observing a problem with MPICH [comm_idup test](https://github.com/pmodels/mpich/blob/master/test/mpi/comm/comm_idup.c).
I found it when I was running MTT verifications on my PRs #3375, #3376, #3377 and originally thought that they are causing this problem because we were not observing this behavior in the SLURM/PMIx MTT that we providing to the community.
However I noticed that I was running verification MTT on the different cluster (with **mlx5** adapters) while  usually we use another one (with **mlx4**). When I launched our regular MTT set that is testing current ompi/v2.x on this mlx5 cluster, I found that this error occurs there as well: https://mtt.open-mpi.org/index.php?do_redir=2415

And here is git log for the corresponding MTT directory (that shows that **no CID tweak commit is there**):
```
$ pwd
/hpc/mtr_scrap/users/boriska/scratch/pmi/20170420_105919_24997_119393_vegas10/mpi-install/mkNl/src/ompi.git
$ git log --oneline
90c1d76 Merge pull request #3333 from rhc54/cmr2x/backward
55dea27 Add PMIx commit that disables build of pmi-1 and pmi-2 backward compatiblity support, and update OPAL pmix112 configure.m4 to se
9d7e7a8 Merge pull request #3318 from bwbarrett/v2.x
83b0a37 Merge pull request #3314 from hjelmn/v2.x_osc_rdma
5116232 osc/rdma: fix typo in atomic code
16fde29 build: Fix platform detection on FreeBSD
```

Looking closer into a problem I see the following sympthoms:
* Some procs are waiting in `MPI_Wait` [here](https://github.com/pmodels/mpich/blob/master/test/mpi/comm/comm_idup.c#L62):
```
Thread 1 (Thread 0x7ffff7faf740 (LWP 16680)):
#0  0x00007fffebdd2172 in mlx5_poll_cq () from /usr/lib64/libmlx5-rdmav2.so
#1  0x00007fffea9571e3 in ibv_poll_cq (wc=0x7fffffff9b60, num_entries=<optimized out>, cq=<optimized out>) at /usr/include/infiniband/ve
#2  poll_device (device=device@entry=0x7a6c30, count=count@entry=0) at btl_openib_component.c:3581
#3  0x00007fffea957d55 in progress_one_device (device=0x7a6c30) at btl_openib_component.c:3714
#4  btl_openib_component_progress () at btl_openib_component.c:3738
#5  0x00007ffff703ed7c in opal_progress () at runtime/opal_progress.c:225
#6  0x00007ffff7b5a13d in ompi_request_wait_completion (req=0x93ced8) at ../ompi/request/request.h:392
#7  ompi_request_default_wait (req_ptr=0x7fffffffcc78, status=0x0) at request/req_wait.c:41
#8  0x00007ffff7b768ca in PMPI_Wait (request=0x7fffffffcc78, status=<optimized out>) at pwait.c:70
#9  0x000000000040277b in main (argc=1, argv=0x7fffffffcdb8) at comm_idup.c:62
```

* And some of them are hanging in `PMPI_Ssend` [here](https://github.com/pmodels/mpich/blob/master/test/mpi/comm/comm_idup.c#L61):
```
Thread 1 (Thread 0x7ffff7faf740 (LWP 6028)):
#0  0x00007fffebbaca6e in mlx5_poll_cq () from /usr/lib64/libmlx5-rdmav2.so
#1  0x00007fffea9571e3 in ibv_poll_cq (wc=0x7fffffff99d0, num_entries=<optimized out>, cq=<optimized out>) at /usr/include/infiniband/ve
#2  poll_device (device=device@entry=0x7a70c0, count=count@entry=0) at btl_openib_component.c:3581
#3  0x00007fffea957d55 in progress_one_device (device=0x7a70c0) at btl_openib_component.c:3714
#4  btl_openib_component_progress () at btl_openib_component.c:3738
#5  0x00007ffff7041d7c in opal_progress () at runtime/opal_progress.c:225
#6  0x00007ffff7b5be53 in ompi_request_default_test_all (count=<optimized out>, requests=0x99d130, completed=<optimized out>, statuses=<
#7  0x00007fffe98ced4a in NBC_Progress (handle=handle@entry=0x99cf68) at nbc.c:329
#8  0x00007fffe98ce30b in ompi_coll_libnbc_progress () at coll_libnbc_component.c:275
#9  0x00007ffff7041d7c in opal_progress () at runtime/opal_progress.c:225
#10 0x00007fffe9f04e3d in ompi_request_wait_completion (req=<optimized out>) at ../../../../ompi/request/request.h:392
#11 mca_pml_ob1_send (buf=<optimized out>, count=<optimized out>, datatype=<optimized out>, dst=<optimized out>, tag=0, sendmode=<optimi
#12 0x00007ffff7b761c2 in PMPI_Ssend (buf=<optimized out>, count=<optimized out>, type=<optimized out>, dest=<optimized out>, tag=<optim
#13 0x000000000040276a in main (argc=1, argv=0x7fffffffcdb8) at comm_idup.c:61
```

Checking previous days I see that cluster with mlx4 doesn't have that problem (no timed-out test) https://mtt.open-mpi.org/index.php?do_redir=2416

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BTL/openib issues on mlx5 #3385

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BTL/openib issues on mlx5 #3385

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions