-
Notifications
You must be signed in to change notification settings - Fork 929
Description
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/2638/console -- which is a Jenkins run off #1821 -- shows a problem that we've been seeing in a few Jenkins runs: the rdmacm CPC in the openib BTL hangs during finalize.
Here's the command that is run:
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/mpirun -np 8 -bind-to core --report-state-on-timeout --get-stack-traces --timeout 600 -mca btl_openib_receive_queues P,65536,256,192,128:S,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64 -mca btl_openib_cpc_include rdmacm -mca pml ob1 -mca btl self,openib -mca btl_if_include mlx4_0:2 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c
This warning comes up; I don't know if it's significant:
11:17:09 --------------------------------------------------------------------------
11:17:09 No OpenFabrics connection schemes reported that they were able to be
11:17:09 used on a specific port. As such, the openib BTL (OpenFabrics
11:17:09 support) will be disabled for this port.
11:17:09
11:17:09 Local host: jenkins01
11:17:09 Local device: mlx5_0
11:17:09 Local port: 1
11:17:09 CPCs attempted: rdmacm
11:17:09 --------------------------------------------------------------------------
But then all procs have a backtrace like this during finalize:
11:27:13 Thread 1 (Thread 0x7ffff73e3700 (LWP 10492)):
11:27:13 #0 0x0000003d6980b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
11:27:13 #1 0x00007fffeeb81c6d in rdmacm_endpoint_finalize (endpoint=0x7fbdb0) at connect/btl_openib_connect_rdmacm.c:1229
11:27:13 #2 0x00007fffeeb6cdeb in mca_btl_openib_endpoint_destruct (endpoint=0x7fbdb0) at btl_openib_endpoint.c:368
11:27:13 #3 0x00007fffeeb559b7 in opal_obj_run_destructors (object=0x7fbdb0) at ../../../../opal/class/opal_object.h:460
11:27:13 #4 0x00007fffeeb5ae97 in mca_btl_openib_del_procs (btl=0x748ed0, nprocs=1, procs=0x7fffffffc768, peers=0x802fa0) at btl_openib.c:1328
11:27:13 #5 0x00007fffeefb2159 in mca_bml_r2_del_procs (nprocs=8, procs=0x78bf60) at bml_r2.c:623
11:27:13 #6 0x00007fffee2ba612 in mca_pml_ob1_del_procs (procs=0x78bf60, nprocs=8) at pml_ob1.c:455
11:27:13 #7 0x00007ffff7ca4b94 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:333
11:27:13 #8 0x00007ffff7cd6fe1 in PMPI_Finalize () at pfinalize.c:47
11:27:13 #9 0x0000000000400890 in main (argc=1, argv=0x7fffffffcb38) at hello_c.c:24
@jladd-mlnx @artpol84 Can you look into this?
@larrystevenwise @bharatpotnuri Is this happening on the v2.x branch for iWARP?
@hppritcha Is this a v2.0.0 or v2.0.1 item? IIRC, rdmacm is a non-default CPC for IB, and you have to add a per-peer QP -- I thought we had this conversation before about a previous rdmacm CPC error: that we would push it to v2.0.1 (but then we didn't because it also affected iWARP). Is my memory correct?