Skip to content

TCP: unexpected process identifier in connect_ack #6240

@abouteiller

Description

@abouteiller

Upon job startup, the program deadlocks with the following output. Upon quick investigation, the opal_proc name is valid, but does not match the number that came from the socket (same jobid, but different (valid) rank).

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

3a4a1f93 (HEAD -> master, origin/master, origin/HEAD) Merge pull request #6239 from hppritcha/topic/swat_orte_shutdown.... Pritchard  2 hours ago

Please describe the system on which you are running

  • Operating system/version: CentOS7
  • Computer hardware: x86_64
  • Network type: TCP

Details of the problem

 salloc -N4 -Ccauchy /home/bouteill/ompi/master.debug/bin/mpirun -mca btl tcp,self IMB-MPI1 pingpong
salloc: Granted job allocation 245932
salloc: Waiting for resource configuration
salloc: Nodes c[00-03] are ready for job
#------------------------------------------------------------
#    Intel(R) MPI Benchmarks 2019 Update 1, MPI-1 part
#------------------------------------------------------------
# Date                  : Fri Jan  4 18:52:06 2019
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-514.26.1.el7.x86_64
# Version               : #1 SMP Wed Jun 28 15:10:01 CDT 2017
# MPI Version           : 3.1
# MPI Thread Environment:
[...]

# PingPong
[c01][[19494,1],8][../../../../../master/opal/mca/btl/tcp/btl_tcp_endpoint.c:630:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[19494,1],13]

The same run with options -mca btl openib,vader,self completes successfully.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions