-
Couldn't load subscription status.
- Fork 928
Create a separate OPAL event base #676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Refer to this link for build results (access rights to CI server needed): Build Log |
|
Refer to this link for build results (access rights to CI server needed): Build Log |
|
Refer to this link for build results (access rights to CI server needed): Build Log |
|
:bot:retest: |
|
the mlnx jenkins seems to be out to lunch |
|
Yeah, not sure what to do. I can't get this branch to fail on mustang (ConnectX 3) with the tests that were failing for the Mellanox Jenkins. |
|
@miked-mellanox @alinask @jladd-mlnx Can you guys dig into what is going on with this one? No one can replicate / figure out what the problem is. Thanks! |
|
|
bot:retest |
|
@miked-mellanox Many thanks. |
|
bot:retest |
…s if we only have sm,self BTLs enabled, which is a rather unique use-case, so just disable it for now.
|
Test FAILed. |
1 similar comment
|
Test FAILed. |
|
@miked-mellanox I'm afraid I'll need help here, Mike, if we want to close on the last "to-do" from the Jan meeting. This is failing in the mxm MTL, and I have no way of pursuing it. |
|
@rhc54 -
jenkins@jenkins01 /tmp
$timeout -s SIGKILL 10m /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-2/ompi_install1/bin/mpirun -np 2 -bind-to core -x MXM_HANDLE_ERRORS=debug -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -mca pml cm -mca mtl mxm /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-2/ompi_install1/examples/ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
jenkins@jenkins01 /tmp
$timeout -s SIGKILL 10m /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-2/ompi_install1/bin/mpirun -np 2 -bind-to core -x MXM_HANDLE_ERRORS=debug -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -mca pml cm -mca mtl mxm /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-2/ompi_install1/examples/ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node jenkins01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
...
Program terminated with signal 11, Segmentation fault.
#0 0x00007ffff69ea284 in ?? ()
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6.x86_64 libibverbs-1.1.8mlnx1-OFED.3.0.1.5.4.x86_64 libmlx4-1.0.6mlnx1-OFED.3.0.1.5.4.x86_64 libmlx5-1.0.2mlnx1-OFED.3.0.1.5.3.x86_64 libnl-1.1.4-2.el6.x86_64 libpciaccess-0.13.1-2.el6.x86_64 numactl-2.0.7-8.el6.x86_64
(gdb) bt
#0 0x00007ffff69ea284 in ?? ()
#1 0x00007ffff69d9d20 in ?? ()
#2 0x00007ffff5c37440 in ?? ()
#3 0x00007fffffffffff in ?? ()
#4 0x0000017bf775c580 in ?? ()
#5 0x00007ffff69ededa in ?? ()
#6 0x00007ffff775c560 in orte_finalize () at runtime/orte_finalize.c:95
#7 0x0000000000000001 in ?? ()
#8 0x0000000000679180 in ?? ()
#9 0x00007ffff69d9d90 in ?? ()
#10 0x00007ffff69eaaaf in ?? ()
#11 0x0000000000000000 in ?? ()
(gdb)
mpirun noticed that process rank 1 with PID 0 on node jenkins01
|
|
Also, in dmesg there is following: ring_c[26301]: segfault at 7ffff69eb0af ip 00007ffff69eb0af sp 00007ffff69d9d30 error 14 in mca_sec_basic.so[7ffff6ff9000+2000]
ring_c[3092]: segfault at 7ffff69ea284 ip 00007ffff69ea284 sp 00007ffff69d9ce0 error 14 in mca_sec_basic.so[7ffff6ff9000+2000]
ring_c[13567]: segfault at 7ffff69eb0af ip 00007ffff69eb0af sp 00007ffff69d9d30 error 14 in mca_sec_basic.so[7ffff6ff9000+2000]
ring_c[13747]: segfault at 7ffff69eb0af ip 00007ffff69eb0af sp 00007ffff69d9d30 error 14 in mca_sec_basic.so[7ffff6ff9000+2000]
ring_c[13958]: segfault at 7ffff69eb0af ip 00007ffff69eb0af sp 00007ffff69d9d30 error 14 in mca_sec_basic.so[7ffff6ff9000+2000]
ring_c[13962]: segfault at 7ffff69eb0af ip 00007ffff69eb0af sp 00007ffff69d9d30 error 14 in mca_ess_pmi.so[7ffff6bf1000+4000]
ring_c[14098]: segfault at 7ffff69ea284 ip 00007ffff69ea284 sp 00007ffff69d9ce0 error 14 in mca_sec_basic.so[7ffff6ff9000+2000] |
…essed only via call to opal_progress. Move the opal_event_base into an async progress thread. Update all the BTLs to use the opal_progress_event_base so they retain their prior behavior - BTL authors may want to evaluate their events to see if any should move to the async thread. Ensure the security credentials are initialized and properly checked before free'ing
|
Can you enable debug in your build and see a better stack trace? We're totally blocked here -- Mellanox is the only one who is able to reproduce this problem. :-( Can you push this a little further? |
|
I tried debug build. It failed with segfault 6 times out of 100. run command: ( single node, no MXM) backtrace: |
…mutex-plus-mutex-static-initializer v2.0.0: init finalize mutex plus mutex static initializer
Create a separate OPAL event base (opal_progress_event_base) that is progressed only via call to opal_progress. Move the opal_event_base into an async progress thread. Update all the BTLs to use the opal_progress_event_base so they retain their prior behavior - BTL authors may want to evaluate their events to see if any should move to the async thread.
This completes the Jan meeting deliverable
@jsquyres @hjelmn please take a look