Skip to content

MTT pmix/libevent failure #5336

@PeterGottesman

Description

@PeterGottesman

Many MTT runs are failing with a SIGABRT. The example I will be using is https://mtt.open-mpi.org/index.php?do_redir=2642:

$ mpirun -np 32 --mca orte_startup_timeout 10000 --mca oob tcp --mca btl vader,tcp,self --mca btl_tcp_progress_thread 1 datatype/aint
[mpi006:14144] OPAL ERROR: Error in file pmix2x.c at line 326
<snip>

I am unable to reproduce the error, but have the coredump from the above run. The issue seems to be in the pmix2x event loop, but I am unsure of what event is failing.

(gdb) bt
#0  0x0000003f4bc32925 in raise () from /lib64/libc.so.6
#1  0x0000003f4bc34105 in abort () from /lib64/libc.so.6
#2  0x00002aaaab4edaf1 in opal_mutex_unlock (m=0x2aaaab9a5220 <mutex>) at ../../opal/threads/mutex_unix.h:155
#3  0x00002aaaab4eff1c in output (output_id=0, format=0x2aaaab718d80 "OPAL ERROR: %s in file %s at line %d", arglist=0x2aaaac3d8938) at output.c:1007
#4  0x00002aaaab4ee2fd in opal_output (output_id=0, format=0x2aaaab718d80 "OPAL ERROR: %s in file %s at line %d") at output.c:372
#5  0x00002aaaab5ae4f0 in pmix2x_event_hdlr (evhdlr_registration_id=3, status=-147, source=0x2aaab401dfec, info=0x2aaab401e150, ninfo=18446744073709551615, results=0x0, nresults=0,
    cbfunc=0x2aaaab646294 <progress_local_event_hdlr>, cbdata=0x2aaab401df10) at pmix2x.c:326
#6  0x00002aaaab6483b8 in pmix_invoke_local_event_hdlr (chain=0x2aaab401df10) at event/pmix_event_notification.c:770
#7  0x00002aaaab64f553 in check_cached_events (cd=0x6abb50) at event/pmix_event_registration.c:411
#8  0x00002aaaab64d82a in regevents_cbfunc (peer=0x690890, hdr=0x2aaab401dd80, buf=0x2aaaac3d8d00, cbdata=0x2aaab401db20) at event/pmix_event_registration.c:116
#9  0x00002aaaab6b2c16 in pmix_ptl_base_process_msg (fd=-1, flags=4, cbdata=0x2aaab401dca0) at base/ptl_base_sendrecv.c:711
#10 0x00002aaaab554fb3 in event_process_active_single_queue (base=0x690220, activeq=0x68dbf0) at event.c:1370
#11 0x00002aaaab555227 in event_process_active (base=0x690220) at event.c:1440
#12 0x00002aaaab55587a in opal_libevent2022_event_base_loop (base=0x690220, flags=1) at event.c:1644
#13 0x00002aaaab66a82b in progress_engine (obj=0x6901d8) at runtime/pmix_progress_threads.c:109
#14 0x0000003f4c0079d1 in start_thread () from /lib64/libpthread.so.0
#15 0x0000003f4bce8b6d in clone () from /lib64/libc.so.6
#16 0x0000000000000000 in ?? ()

I can also provide other information from the coredump on request.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions