Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Apr 27, 2016

This commit fixes a race between a thread calling the tcp btl's
add_procs and a thread processing an incomming connection. The race
occured because the add_procs thread adds a newly created proc object
to the hash table before the object is fully initialized. The
connection thread then attempts to use the object before the endpoints
array on the object has beeen allocation. The fix is to only add the
proc to the hash table after it has been completely initialized.

Signed-off-by: Nathan Hjelm [email protected]

This commit fixes a race between a thread calling the tcp btl's
add_procs and a thread processing an incomming connection. The race
occured because the add_procs thread adds a newly created proc object
to the hash table *before* the object is fully initialized. The
connection thread then attempts to use the object before the endpoints
array on the object has beeen allocation. The fix is to only add the
proc to the hash table after it has been completely initialized.

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn
Copy link
Member Author

hjelmn commented Apr 27, 2016

@bosilca I was able to reproduce the bug found in Jeff's MTT using the thread-tests-1.1 suite. Here is an example backtrace:

Core was generated by `./overlap'.
Program terminated with signal 11, Segmentation fault.
#0  0x00002adb6ed2b8a0 in mca_btl_tcp_proc_insert (btl_proc=0x2adb880010e0, btl_endpoint=0x3b6eba0) at ../../../../../opal/mca/btl/tcp/btl_tcp_proc.c:416
416     btl_proc->proc_endpoints[btl_proc->proc_endpoint_count++] = btl_endpoint;
Missing separate debuginfos, use: debuginfo-install glib2-2.28.8-4.el6.x86_64 hcoll-3.3.768-1.x86_64 infiniband-diags-1.6.4-2chaos.ch5.2.x86_64 infinipath-psm-3.1c-3chaos.ch5.3.x86_64 libgcc-4.4.7-16.el6.x86_64 libibcm-1.0.5-3.el6.x86_64 libibmad-1.3.11-1.el6.x86_64 libibumad-1.3.9-1.el6.x86_64 libibverbs-1.1.8-4.el6.x86_64 libipathverbs-1.3-3.el6_5.x86_64 libmlx4-1.0.6-7.el6.x86_64 libmlx5-1.0.2-1.el6.x86_64 libmthca-1.0.6-4.el6.x86_64 libnl-1.1.4-2.el6.x86_64 librdmacm-1.0.19.1-1.el6.x86_64 libudev-147-2.63.el6.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-libs-0.5.11-1.ch5.1.1.x86_64 mxm-3.4.3065-1.x86_64 numactl-2.0.9-2.el6.x86_64 opensm-libs-3.3.19-1chaos.ch5.3.x86_64 slurm-2.3.3-1.21chaos.ch5.4.x86_64 tcp_wrappers-libs-7.6-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x00002adb6ed2b8a0 in mca_btl_tcp_proc_insert (btl_proc=0x2adb880010e0, btl_endpoint=0x3b6eba0) at ../../../../../opal/mca/btl/tcp/btl_tcp_proc.c:416
#1  0x00002adb6ed20b4c in mca_btl_tcp_add_procs (btl=0xf084b0, nprocs=1, procs=0x7ffebdb7a458, peers=0x7ffebdb7a468, reachable=0x0) at ../../../../../opal/mca/btl/tcp/btl_tcp.c:117
#2  0x00002adb6c4b050a in mca_bml_r2_add_proc (proc=0x2adb88000a70) at ../../../../../ompi/mca/bml/r2/bml_r2.c:396
#3  0x00002adb6c65768b in mca_bml_base_get_endpoint (proc=0x2adb88000a70) at ../../../../../ompi/mca/bml/base/base.h:72
#4  0x00002adb6c658986 in mca_pml_ob1_isend (buf=0x0, count=0, datatype=0x601660, dst=29, tag=-16, sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x601860, request=0x7ffebdb7a658)
    at ../../../../../ompi/mca/pml/ob1/pml_ob1_isend.c:139
#5  0x00002adb6c4ce854 in ompi_coll_base_sendrecv_zero (dest=29, stag=-16, source=29, rtag=-16, comm=0x601860) at ../../../../ompi/mca/coll/base/coll_base_barrier.c:59
#6  0x00002adb6c4ced91 in ompi_coll_base_barrier_intra_recursivedoubling (comm=0x601860, module=0xf6b2d0) at ../../../../ompi/mca/coll/base/coll_base_barrier.c:228
#7  0x00002adb6c4dea31 in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x601860, module=0xf6b2d0) at ../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:212
#8  0x00002adb6c40a679 in PMPI_Barrier (comm=0x601860) at pbarrier.c:63
#9  0x0000000000400eb0 in main ()

Here is the conflicting thread's backtrace:

(gdb) t 4
[Switching to thread 4 (Thread 0x2adb86d5b700 (LWP 14681))]#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
136 2:  movl    %edx, %eax
(gdb) bt
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1  0x00002adb6c9eb5d8 in _L_lock_854 () from /var/lib/perceus/vnfs/compute/rootfs/lib64/libpthread-2.12.so
#2  0x00002adb6c9eb4a7 in __pthread_mutex_lock (mutex=0x2adb88001178) at pthread_mutex_lock.c:61
#3  0x00002adb6ed209c5 in mca_btl_tcp_add_procs (btl=0xf084b0, nprocs=1, procs=0x2adb86d5aa88, peers=0x2adb86d5aa90, reachable=0x0) at ../../../../../opal/mca/btl/tcp/btl_tcp.c:95
#4  0x00002adb6ed2cf85 in mca_btl_tcp_proc_lookup (name=0x2adb86d5ab70) at ../../../../../opal/mca/btl/tcp/btl_tcp_proc.c:758
#5  0x00002adb6ed25eb7 in mca_btl_tcp_component_recv_handler (sd=20, flags=2050, user=0x2adb88000a70) at ../../../../../opal/mca/btl/tcp/btl_tcp_component.c:1301
#6  0x00002adb6ed4035c in event_process_active_single_queue (base=0xcad810, flags=2) at ../../../../../../opal/mca/event/libevent2022/libevent/event.c:1370
#7  event_process_active (base=0xcad810, flags=2) at ../../../../../../opal/mca/event/libevent2022/libevent/event.c:1440
#8  opal_libevent2022_event_base_loop (base=0xcad810, flags=2) at ../../../../../../opal/mca/event/libevent2022/libevent/event.c:1644
#9  0x00002adb6ec738c2 in opal_progress () at ../../opal/runtime/opal_progress.c:171
#10 0x00002adb6c652f6a in opal_condition_wait (c=0x2adb6c9ba600, m=0x2adb6c9ba580) at ../../../../../opal/threads/condition.h:63
#11 0x00002adb6c653527 in ompi_request_wait_completion (req=0xf50300) at ../../../../../ompi/request/request.h:383
#12 0x00002adb6c6548a0 in mca_pml_ob1_recv (addr=0x0, count=0, datatype=0x601660, src=13, tag=34532, comm=0x601860, status=0x0) at ../../../../../ompi/mca/pml/ob1/pml_ob1_irecv.c:123
#13 0x00002adb6c43c940 in PMPI_Recv (buf=0x0, count=0, type=0x601660, source=13, tag=34532, comm=0x601860, status=0x0) at precv.c:79
#14 0x0000000000400c93 in threadfunc ()
#15 0x00002adb6c9e9aa1 in start_thread (arg=0x2adb86d5b700) at pthread_create.c:301
#16 0x00002adb6cce793d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

@hjelmn
Copy link
Member Author

hjelmn commented Apr 27, 2016

Since this is passing I will merge now. @bosilca Please review the 2.0.0 PR once it is open.

@hjelmn hjelmn merged commit 936dfe5 into open-mpi:master Apr 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants