Skip to content

Conversation

@rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jun 28, 2015

Create a separate OPAL event base (opal_progress_event_base) that is progressed only via call to opal_progress. Move the opal_event_base into an async progress thread. Update all the BTLs to use the opal_progress_event_base so they retain their prior behavior - BTL authors may want to evaluate their events to see if any should move to the async thread.

This completes the Jan meeting deliverable

@jsquyres @hjelmn please take a look

@rhc54 rhc54 added this to the Future milestone Jun 28, 2015
@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/job/gh-ompi-master-pr/677/

Build Log
last 50 lines

[...truncated 38775 lines...]
Hello, world, I am 6 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 1 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 3 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 5 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 7 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 0 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 2 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
++ ompi_info --param pml all --level 9
++ grep yalla
++ wc -l
+ local val=4
++ ibstat -l
+ for hca_dev in '$(ibstat -l)'
+ '[' -f /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c ']'
+ local hca=mlx4_0:1
+ mca='-bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1'
+ echo 'Running /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c '
Running /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c 
+ '[' yes == yes ']'
+ timeout -s SIGKILL 10m mpirun -np 8 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -mca pml ob1 -mca btl self,openib /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c
Hello, world, I am 2 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 0 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 4 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 6 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 3 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 5 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 7 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 1 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
+ timeout -s SIGKILL 10m mpirun -np 8 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -mca pml cm -mca mtl mxm /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c
Hello, world, I am 0 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 2 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 4 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 6 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 1 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 3 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 5 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
Hello, world, I am 7 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1994-g5355927, Unreleased developer copy, 138)
--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 0 on node jenkins01 exited on signal 13 (Broken pipe).
--------------------------------------------------------------------------
Build step 'Execute shell' marked build as failure
[htmlpublisher] Archiving HTML reports...
[htmlpublisher] Archiving at BUILD level /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/cov_build to /var/lib/jenkins/jobs/gh-ompi-master-pr/builds/677/htmlreports/Coverity_Report
Setting commit status on GitHub for https://github.com/open-mpi/ompi/commit/53559271dc6625a35a28e79ab0edc263672fefc6
[BFA] Scanning build for known causes...
[BFA] No failure causes found
[BFA] Done. 0s
Setting status of f530cbe9442526d241b3df75e1771756c0ee6c09 to FAILURE with url http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr/677/ and message: 'Build finished.'
Using conext: Mellanox

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/job/gh-ompi-master-pr/680/

Build Log
last 50 lines

[...truncated 38788 lines...]
+ local val=4
++ ibstat -l
+ for hca_dev in '$(ibstat -l)'
+ '[' -f /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c ']'
+ local hca=mlx4_0:1
+ mca='-bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1'
+ echo 'Running /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c '
Running /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c 
+ '[' yes == yes ']'
+ timeout -s SIGKILL 10m mpirun -np 8 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -mca pml ob1 -mca btl self,openib /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c
Hello, world, I am 7 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 2 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 1 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 3 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 5 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 0 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 4 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 6 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
+ timeout -s SIGKILL 10m mpirun -np 8 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -mca pml cm -mca mtl mxm /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c
Hello, world, I am 0 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 2 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 4 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 6 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 5 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 7 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 3 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 1 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
+ '[' 4 -gt 0 ']'
+ timeout -s SIGKILL 10m mpirun -np 8 -bind-to core -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -mca pml yalla /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c
Hello, world, I am 6 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 5 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 7 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 0 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 3 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 1 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 2 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
Hello, world, I am 4 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-ga80c037, Unreleased developer copy, 138)
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 0 on node jenkins01 exited on signal 13 (Broken pipe).
--------------------------------------------------------------------------
Build step 'Execute shell' marked build as failure
[htmlpublisher] Archiving HTML reports...
[htmlpublisher] Archiving at BUILD level /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/cov_build to /var/lib/jenkins/jobs/gh-ompi-master-pr/builds/680/htmlreports/Coverity_Report
Setting commit status on GitHub for https://github.com/open-mpi/ompi/commit/a80c037a97eb00da1b79261bc74ed446395e51f2
[BFA] Scanning build for known causes...
[BFA] No failure causes found
[BFA] Done. 0s
Setting status of 5e01773aa57a1673a7c7a6bacfb570d7e123ef66 to FAILURE with url http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr/680/ and message: 'Build finished.'
Using conext: Mellanox

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/job/gh-ompi-master-pr/681/

Build Log
last 50 lines

[...truncated 38728 lines...]
make[2]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples'
shmemcc -g oshmem_max_reduction.c -o oshmem_max_reduction
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples'
make[2]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples'
shmemcc -g oshmem_strided_puts.c -o oshmem_strided_puts
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples'
make[2]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples'
shmemcc -g oshmem_symmetric_data.c -o oshmem_symmetric_data
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples'
make[2]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples'
shmemfort -g hello_oshmemfh.f90 -o hello_oshmemfh
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples'
make[2]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples'
shmemfort -g ring_oshmemfh.f90 -o ring_oshmemfh
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples'
make[1]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples'
make: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples'
+ for exe in hello_c ring_c
+ exe_path=/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c
+ PATH=/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin:/hpc/local/bin::/usr/local/bin:/bin:/usr/bin:/usr/sbin:/hpc/local/bin:/hpc/local/bin/:/hpc/local/bin/:/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/ibutils/bin
+ LD_LIBRARY_PATH=/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/lib:
+ mpi_runner 8 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c
+ local np=8
+ local exe_path=/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c
+ local exe_args=
+ local 'common_mca=-bind-to core'
+ local 'mca=-bind-to core'
+ '[' yes == yes ']'
+ timeout -s SIGKILL 10m mpirun -np 8 -bind-to core -mca pml ob1 -mca btl self,tcp /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c
Hello, world, I am 2 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-g4701b31, Unreleased developer copy, 138)
Hello, world, I am 4 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-g4701b31, Unreleased developer copy, 138)
Hello, world, I am 6 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-g4701b31, Unreleased developer copy, 138)
Hello, world, I am 1 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-g4701b31, Unreleased developer copy, 138)
Hello, world, I am 3 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-g4701b31, Unreleased developer copy, 138)
Hello, world, I am 5 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-g4701b31, Unreleased developer copy, 138)
Hello, world, I am 7 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-g4701b31, Unreleased developer copy, 138)
Hello, world, I am 0 of 8, (Open MPI v2.0a1, package: Open MPI jenkins@jenkins01 Distribution, ident: 2.0.0a1, repo rev: dev-1998-g4701b31, Unreleased developer copy, 138)
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 0 on node jenkins01 exited on signal 13 (Broken pipe).
--------------------------------------------------------------------------
Build step 'Execute shell' marked build as failure
[htmlpublisher] Archiving HTML reports...
[htmlpublisher] Archiving at BUILD level /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/cov_build to /var/lib/jenkins/jobs/gh-ompi-master-pr/builds/681/htmlreports/Coverity_Report
Setting commit status on GitHub for https://github.com/open-mpi/ompi/commit/4701b31d8907352f03dbbe9d0ba8fcca66f2c3d3
[BFA] Scanning build for known causes...
[BFA] No failure causes found
[BFA] Done. 0s
Setting status of 352b0d60f385c75537d7ee14e029c963c5dc04ab to FAILURE with url http://bgate.mellanox.com:8888/jenkins/job/gh-ompi-master-pr/681/ and message: 'Build finished.'
Using conext: Mellanox

@hjelmn
Copy link
Member

hjelmn commented Jul 1, 2015

:bot:retest:

@hppritcha
Copy link
Member

the mlnx jenkins seems to be out to lunch

@hjelmn
Copy link
Member

hjelmn commented Jul 1, 2015

Yeah, not sure what to do. I can't get this branch to fail on mustang (ConnectX 3) with the tests that were failing for the Mellanox Jenkins.

@jsquyres
Copy link
Member

jsquyres commented Jul 1, 2015

@miked-mellanox @alinask @jladd-mlnx Can you guys dig into what is going on with this one? No one can replicate / figure out what the problem is. Thanks!

@mike-dubman
Copy link
Member

  • the good thing - the status context is correct now.
  • i tried it manually - and it works now
  • the time of failure correlates w/ jenkins machine nfs issues, will retry it now.

@mike-dubman
Copy link
Member

bot:retest

@jsquyres
Copy link
Member

jsquyres commented Jul 1, 2015

@miked-mellanox Many thanks.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 7, 2015

bot:retest

…s if we only have sm,self BTLs enabled, which is a rather unique use-case, so just disable it for now.
@lanl-ompi
Copy link
Contributor

Test FAILed.

1 similar comment
@lanl-ompi
Copy link
Contributor

Test FAILed.

@rhc54
Copy link
Contributor Author

rhc54 commented Jul 7, 2015

@miked-mellanox I'm afraid I'll need help here, Mike, if we want to close on the last "to-do" from the Jan meeting. This is failing in the mxm MTL, and I have no way of pursuing it.

@mike-dubman
Copy link
Member

@rhc54 -

  • I tried command line manually, it fails with sigpipe or segv occasionally (with np=8,2)
  • trying to extract meaningful stacktrace, so far got this:
jenkins@jenkins01 /tmp
$timeout -s SIGKILL 10m /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-2/ompi_install1/bin/mpirun -np 2 -bind-to core -x MXM_HANDLE_ERRORS=debug -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -mca pml cm -mca mtl mxm /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-2/ompi_install1/examples/ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
jenkins@jenkins01 /tmp
$timeout -s SIGKILL 10m /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-2/ompi_install1/bin/mpirun -np 2 -bind-to core -x MXM_HANDLE_ERRORS=debug -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -mca pml cm -mca mtl mxm /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace-2/ompi_install1/examples/ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node jenkins01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
...
Program terminated with signal 11, Segmentation fault.
#0  0x00007ffff69ea284 in ?? ()
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6.x86_64 libibverbs-1.1.8mlnx1-OFED.3.0.1.5.4.x86_64 libmlx4-1.0.6mlnx1-OFED.3.0.1.5.4.x86_64 libmlx5-1.0.2mlnx1-OFED.3.0.1.5.3.x86_64 libnl-1.1.4-2.el6.x86_64 libpciaccess-0.13.1-2.el6.x86_64 numactl-2.0.7-8.el6.x86_64
(gdb) bt
#0  0x00007ffff69ea284 in ?? ()
#1  0x00007ffff69d9d20 in ?? ()
#2  0x00007ffff5c37440 in ?? ()
#3  0x00007fffffffffff in ?? ()
#4  0x0000017bf775c580 in ?? ()
#5  0x00007ffff69ededa in ?? ()
#6  0x00007ffff775c560 in orte_finalize () at runtime/orte_finalize.c:95
#7  0x0000000000000001 in ?? ()
#8  0x0000000000679180 in ?? ()
#9  0x00007ffff69d9d90 in ?? ()
#10 0x00007ffff69eaaaf in ?? ()
#11 0x0000000000000000 in ?? ()
(gdb)
  • is there way in OMPI to print stack trace when mpirun detected failure and printing this?
mpirun noticed that process rank 1 with PID 0 on node jenkins01
  • why PID zero?

@mike-dubman
Copy link
Member

Also, in dmesg there is following:

ring_c[26301]: segfault at 7ffff69eb0af ip 00007ffff69eb0af sp 00007ffff69d9d30 error 14 in mca_sec_basic.so[7ffff6ff9000+2000]
ring_c[3092]: segfault at 7ffff69ea284 ip 00007ffff69ea284 sp 00007ffff69d9ce0 error 14 in mca_sec_basic.so[7ffff6ff9000+2000]
ring_c[13567]: segfault at 7ffff69eb0af ip 00007ffff69eb0af sp 00007ffff69d9d30 error 14 in mca_sec_basic.so[7ffff6ff9000+2000]
ring_c[13747]: segfault at 7ffff69eb0af ip 00007ffff69eb0af sp 00007ffff69d9d30 error 14 in mca_sec_basic.so[7ffff6ff9000+2000]
ring_c[13958]: segfault at 7ffff69eb0af ip 00007ffff69eb0af sp 00007ffff69d9d30 error 14 in mca_sec_basic.so[7ffff6ff9000+2000]
ring_c[13962]: segfault at 7ffff69eb0af ip 00007ffff69eb0af sp 00007ffff69d9d30 error 14 in mca_ess_pmi.so[7ffff6bf1000+4000]

ring_c[14098]: segfault at 7ffff69ea284 ip 00007ffff69ea284 sp 00007ffff69d9ce0 error 14 in mca_sec_basic.so[7ffff6ff9000+2000]

…essed only via call to opal_progress. Move the opal_event_base into an async progress thread. Update all the BTLs to use the opal_progress_event_base so they retain their prior behavior - BTL authors may want to evaluate their events to see if any should move to the async thread.

Ensure the security credentials are initialized and properly checked before free'ing
@jsquyres
Copy link
Member

jsquyres commented Jul 7, 2015

Can you enable debug in your build and see a better stack trace?

We're totally blocked here -- Mellanox is the only one who is able to reproduce this problem. :-( Can you push this a little further?

@bureddy
Copy link
Member

bureddy commented Jul 7, 2015

I tried debug build. It failed with segfault 6 times out of 100.
configure:
./configure --enable-debug --prefix=$PWD/ompi/install

run command: ( single node, no MXM)
$mpirun -np 2 -bind-to core -mca pml ob1 -mca btl self,tcp ./hello_c

backtrace:
mir4:12483] *** Process received signal ***
[mir4:12483] Signal: Segmentation fault (11)
[mir4:12483] Signal code: Address not mapped (1)
[mir4:12483] Failing at address: 0x7fac643328d8
[mir4:12483] [ 0] /lib64/libpthread.so.0[0x35e000f710]
[mir4:12483] [ 1] /hpc/home/USERS/devendar11/ompi/install/lib/libopen-pal.so.0(opal_libevent2022_evmap_io_active+0x3b)[0x7fac64bdf79b]
[mir4:12483] [ 2] /hpc/home/USERS/devendar11/ompi/install/lib/libopen-pal.so.0(+0x9f05e)[0x7fac64be405e]
[mir4:12483] [ 3] /hpc/home/USERS/devendar11/ompi/install/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x1f6)[0x7fac64bddf36]
[mir4:12483] [ 4] /hpc/home/USERS/devendar11/ompi/install/lib/libopen-pal.so.0(+0x3b1e3)[0x7fac64b801e3]
[mir4:12483] [ 5] /lib64/libpthread.so.0[0x35e00079d1]
[mir4:12483] [ 6] /lib64/libc.so.6(clone+0x6d)[0x35dfce886d]
[mir4:12483] *** End of error message ***

@rhc54 rhc54 closed this Jul 8, 2015
jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Nov 10, 2015
…mutex-plus-mutex-static-initializer

v2.0.0: init finalize mutex plus mutex static initializer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants