-
Notifications
You must be signed in to change notification settings - Fork 929
Closed
Labels
RTEIssue likely is in RTE or PMIx areasIssue likely is in RTE or PMIx areasSeverity: criticalState: Awaiting merge to release branchesTarget: v3.1.xbug
Description
Many MTT runs are failing with a SIGABRT. The example I will be using is https://mtt.open-mpi.org/index.php?do_redir=2642:
$ mpirun -np 32 --mca orte_startup_timeout 10000 --mca oob tcp --mca btl vader,tcp,self --mca btl_tcp_progress_thread 1 datatype/aint
[mpi006:14144] OPAL ERROR: Error in file pmix2x.c at line 326
<snip>
I am unable to reproduce the error, but have the coredump from the above run. The issue seems to be in the pmix2x event loop, but I am unsure of what event is failing.
(gdb) bt
#0 0x0000003f4bc32925 in raise () from /lib64/libc.so.6
#1 0x0000003f4bc34105 in abort () from /lib64/libc.so.6
#2 0x00002aaaab4edaf1 in opal_mutex_unlock (m=0x2aaaab9a5220 <mutex>) at ../../opal/threads/mutex_unix.h:155
#3 0x00002aaaab4eff1c in output (output_id=0, format=0x2aaaab718d80 "OPAL ERROR: %s in file %s at line %d", arglist=0x2aaaac3d8938) at output.c:1007
#4 0x00002aaaab4ee2fd in opal_output (output_id=0, format=0x2aaaab718d80 "OPAL ERROR: %s in file %s at line %d") at output.c:372
#5 0x00002aaaab5ae4f0 in pmix2x_event_hdlr (evhdlr_registration_id=3, status=-147, source=0x2aaab401dfec, info=0x2aaab401e150, ninfo=18446744073709551615, results=0x0, nresults=0,
cbfunc=0x2aaaab646294 <progress_local_event_hdlr>, cbdata=0x2aaab401df10) at pmix2x.c:326
#6 0x00002aaaab6483b8 in pmix_invoke_local_event_hdlr (chain=0x2aaab401df10) at event/pmix_event_notification.c:770
#7 0x00002aaaab64f553 in check_cached_events (cd=0x6abb50) at event/pmix_event_registration.c:411
#8 0x00002aaaab64d82a in regevents_cbfunc (peer=0x690890, hdr=0x2aaab401dd80, buf=0x2aaaac3d8d00, cbdata=0x2aaab401db20) at event/pmix_event_registration.c:116
#9 0x00002aaaab6b2c16 in pmix_ptl_base_process_msg (fd=-1, flags=4, cbdata=0x2aaab401dca0) at base/ptl_base_sendrecv.c:711
#10 0x00002aaaab554fb3 in event_process_active_single_queue (base=0x690220, activeq=0x68dbf0) at event.c:1370
#11 0x00002aaaab555227 in event_process_active (base=0x690220) at event.c:1440
#12 0x00002aaaab55587a in opal_libevent2022_event_base_loop (base=0x690220, flags=1) at event.c:1644
#13 0x00002aaaab66a82b in progress_engine (obj=0x6901d8) at runtime/pmix_progress_threads.c:109
#14 0x0000003f4c0079d1 in start_thread () from /lib64/libpthread.so.0
#15 0x0000003f4bce8b6d in clone () from /lib64/libc.so.6
#16 0x0000000000000000 in ?? ()
I can also provide other information from the coredump on request.
Metadata
Metadata
Assignees
Labels
RTEIssue likely is in RTE or PMIx areasIssue likely is in RTE or PMIx areasSeverity: criticalState: Awaiting merge to release branchesTarget: v3.1.xbug