Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

Conversation

@jsquyres
Copy link
Member

@jsquyres jsquyres commented Sep 6, 2016

From open-mpi/ompi#2050:

We commonly see messages on the users list where a peer has hung up because it has crashed. Instead of having just a BTL_ERROR message, make this a real opal_show_help() message that tells the user that the peer unexpectedly hung up, and they should look into why that peer hung up.

Signed-off-by: Jeff Squyres [email protected]

There's a second commit on this PR that disentangles two help messages that accidentally look like they got entangled.

It looks like one help message was accidentally pasted in the middle
of another.  Disentangle the two messages from each other, and
slightly tweak the one message to say that the job may also crash (in
addition to hanging).

Signed-off-by: Jeff Squyres <[email protected]>

(cherry picked from commit open-mpi/ompi@95c6f6c)
We commonly see messages on the users list where a peer has hung up
because it has crashed.  Instead of having just a BTL_ERROR message,
make this a real opal_show_help() message that tells the user that the
peer unexpectedly hung up, and they should look into *why* that peer
hung up.

Signed-off-by: Jeff Squyres <[email protected]>

(cherry picked from commit open-mpi/ompi@1953e34)
@jsquyres jsquyres added the bug label Sep 6, 2016
@jsquyres jsquyres added this to the v2.0.2 milestone Sep 6, 2016
@jsquyres
Copy link
Member Author

jsquyres commented Sep 6, 2016

@bosilca If you could verify this PR for v2.x, that'd be great. Thanks.

@bosilca
Copy link
Member

bosilca commented Sep 6, 2016

👍

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/2156/ for details.

@jsquyres
Copy link
Member Author

jsquyres commented Sep 6, 2016

@rhc54 Getting a PMIx bind failure on v2.x in Mellanox Jenkins (unrelated to this PR):

10:16:10 + taskset -c 10,11 timeout -s SIGSEGV 15m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/mpirun -np 8 -bind-to none -mca btl_openib_if_include mlx4_0:1 -x MXM_RDMA_PORTS=mlx4_0:1 -x UCX_NET_DEVICES=mlx4_0:1 -x UCX_TLS=rc,cm -mca pml ob1 -mca btl self,openib taskset -c 10,11 /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/examples/hello_c
10:16:10 [jenkins03:19613] [[30694,0],0] ORTE_ERROR_LOG: Error in file orted/pmix/pmix_server.c at line 254
10:16:10 [jenkins03:19613] [[30694,0],0] ORTE_ERROR_LOG: Error in file ess_hnp_module.c at line 656
10:16:10 --------------------------------------------------------------------------
10:16:10 It looks like orte_init failed for some reason; your parallel process is
10:16:10 likely to abort.  There are many reasons that a parallel process can
10:16:10 fail during orte_init; some of which are due to configuration or
10:16:10 environment problems.  This failure appears to be an internal failure;
10:16:10 here's some additional information (which may only be relevant to an
10:16:10 Open MPI developer):
10:16:10 
10:16:10   pmix server init failed
10:16:10   --> Returned value Error (-1) instead of ORTE_SUCCESS
10:16:10 --------------------------------------------------------------------------
10:16:10 src/server/pmix_server_listener.c:92 bind() failed

@jsquyres
Copy link
Member Author

jsquyres commented Sep 6, 2016

@hppritcha Good to go

@jsquyres
Copy link
Member Author

bot:mellanox:retest

@hppritcha
Copy link
Member

the mlnx jenkins failure doesn't have anything to do with this, merging.

@hppritcha hppritcha merged commit 1529f93 into open-mpi:v2.x Sep 12, 2016
@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/2167/ for details.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants