Skip to content

PRTE detected a mismatch in versions between two processes. #13093

@Youpen-y

Description

@Youpen-y

I am writing a simple test about MPI_send and MPI_recv. and running it on two machines(master, slave1) with different OS.
After install Open MPI 5.0.6, and run with

mpirun --display-map -np 2 --mca plm_base_verbose 30 --mca oob_base_verbose 10 -hostfile mpi_hosts ./openmpi_test

get the following output, i have no idea about PRTE version, and why they are different(3.0.7 vs 3.0.8), any have advice to solve it? what do i need to do?

[master:1636264] mca: base: component_find: searching NULL for plm components
[master:1636264] mca: base: find_dyn_components: checking NULL for plm components
[master:1636264] pmix:mca: base: components_register: registering framework plm components
[master:1636264] pmix:mca: base: components_register: found loaded component slurm
[master:1636264] pmix:mca: base: components_register: component slurm register function successful
[master:1636264] pmix:mca: base: components_register: found loaded component ssh
[master:1636264] pmix:mca: base: components_register: component ssh register function successful
[master:1636264] mca: base: components_open: opening plm components
[master:1636264] mca: base: components_open: found loaded component slurm
[master:1636264] mca: base: components_open: component slurm open function successful
[master:1636264] mca: base: components_open: found loaded component ssh
[master:1636264] mca: base: components_open: component ssh open function successful
[master:1636264] mca:base:select: Auto-selecting plm components
[master:1636264] mca:base:select:(  plm) Querying component [slurm]
[master:1636264] mca:base:select:(  plm) Querying component [ssh]
[master:1636264] [[INVALID],UNDEFINED] plm:ssh_lookup on agent ssh : rsh path NULL
[master:1636264] mca:base:select:(  plm) Query of component [ssh] set priority to 10
[master:1636264] mca:base:select:(  plm) Selected component [ssh]
[master:1636264] mca: base: close: component slurm closed
[master:1636264] mca: base: close: unloading component slurm
[master:1636264] mca: base: component_find: searching NULL for oob components
[master:1636264] mca: base: find_dyn_components: checking NULL for oob components
[master:1636264] pmix:mca: base: components_register: registering framework oob components
[master:1636264] pmix:mca: base: components_register: found loaded component tcp
[master:1636264] pmix:mca: base: components_register: component tcp register function successful
[master:1636264] mca: base: components_open: opening oob components
[master:1636264] mca: base: components_open: found loaded component tcp
[master:1636264] mca: base: components_open: component tcp open function successful
[master:1636264] mca:oob:select: checking available component tcp
[master:1636264] mca:oob:select: Querying component [tcp]
[master:1636264] oob:tcp: component_available called
[master:1636264] [prterun-master-1636264@0,0] oob:tcp:init adding 10.90.50.196 to our list of V4 connections
[master:1636264] [prterun-master-1636264@0,0] oob:tcp:init adding 10.170.133.119 to our list of V4 connections
[master:1636264] [prterun-master-1636264@0,0] oob:tcp:init adding 192.168.103.1 to our list of V4 connections
[master:1636264] [prterun-master-1636264@0,0] oob:tcp:init adding 172.18.0.1 to our list of V4 connections
[master:1636264] [prterun-master-1636264@0,0] oob:tcp:init adding 172.19.0.1 to our list of V4 connections
[master:1636264] [prterun-master-1636264@0,0] TCP STARTUP
[master:1636264] [prterun-master-1636264@0,0] attempting to bind to IPv4 port 0
[master:1636264] [prterun-master-1636264@0,0] assigned IPv4 port 59189
[master:1636264] mca:oob:select: Adding component to end
[master:1636264] mca:oob:select: Found 1 active transports
[master:1636264] [prterun-master-1636264@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[master:1636264] [prterun-master-1636264@0,0] plm:base:receive start comm
[master:1636264] [prterun-master-1636264@0,0] plm:base:setup_vm
[master:1636264] [prterun-master-1636264@0,0] plm:base:setup_vm creating map
[master:1636264] [prterun-master-1636264@0,0] setup:vm: working unmanaged allocation
[master:1636264] [prterun-master-1636264@0,0] using hostfile /home/yyp/jiajia/apps/mpitest/openmpi/mpi_hosts
[master:1636264] [prterun-master-1636264@0,0] checking node master
[master:1636264] [prterun-master-1636264@0,0] ignoring myself
[master:1636264] [prterun-master-1636264@0,0] checking node slave1
[master:1636264] [prterun-master-1636264@0,0] plm:base:setup_vm add new daemon [prterun-master-1636264@0,1]
[master:1636264] [prterun-master-1636264@0,0] plm:base:setup_vm assigning new daemon [prterun-master-1636264@0,1] to node slave1
[master:1636264] [prterun-master-1636264@0,0] plm:ssh: launching vm
[master:1636264] [prterun-master-1636264@0,0] plm:ssh: local shell: 5 (sh)
[master:1636264] [prterun-master-1636264@0,0] plm:ssh: assuming same remote shell as local shell
[master:1636264] [prterun-master-1636264@0,0] plm:ssh: remote shell: 5 (sh)
[master:1636264] [prterun-master-1636264@0,0] plm:ssh: final template argv:
        /usr/bin/ssh <template> ( test ! -r ./.profile || . ./.profile; PRTE_PREFIX=/usr;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/lib:/usr/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/lib:/usr/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-master-1636264@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "[email protected];tcp://10.90.50.196,10.170.133.119,192.168.103.1,172.18.0.1,172.19.0.1:59189:24,24,24,16,16" --prtemca PREFIXES "errmgr,ess,filem,grpcomm,iof,odls,oob,plm,prtebacktrace,prtedl,prteinstalldirs,prtereachable,ras,rmaps,rtc,schizo,state,hwloc,if,reachable" --prtemca plm_base_verbose "30" --prtemca oob_base_verbose "10" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "[email protected];tcp://10.90.50.196,10.170.133.119,192.168.103.1,172.18.0.1,172.19.0.1:59189:24,24,24,16,16" )
[master:1636264] [prterun-master-1636264@0,0] plm:ssh:launch daemon 0 not a child of mine
[master:1636264] [prterun-master-1636264@0,0] plm:ssh: adding node slave1 to launch list
[master:1636264] [prterun-master-1636264@0,0] plm:ssh: activating launch event
[master:1636264] [prterun-master-1636264@0,0] plm:ssh: recording launch of daemon [prterun-master-1636264@0,1]
[master:1636264] [prterun-master-1636264@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh slave1 ( test ! -r ./.profile || . ./.profile; PRTE_PREFIX=/usr;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/lib:/usr/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/lib:/usr/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-master-1636264@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "[email protected];tcp://10.90.50.196,10.170.133.119,192.168.103.1,172.18.0.1,172.19.0.1:59189:24,24,24,16,16" --prtemca PREFIXES "errmgr,ess,filem,grpcomm,iof,odls,oob,plm,prtebacktrace,prtedl,prteinstalldirs,prtereachable,ras,rmaps,rtc,schizo,state,hwloc,if,reachable" --prtemca plm_base_verbose "30" --prtemca oob_base_verbose "10" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "[email protected];tcp://10.90.50.196,10.170.133.119,192.168.103.1,172.18.0.1,172.19.0.1:59189:24,24,24,16,16" )]
[master:1636264] [prterun-master-1636264@0,0] prte_oob_tcp_listen_thread: incoming connection: (18, 0) 10.170.133.32:41687
[master:1636264] [prterun-master-1636264@0,0] connection_handler: working connection (18, 0) 10.170.133.32:41687
[master:1636264] [prterun-master-1636264@0,0] accept_connection: 10.170.133.32:41687
[master:1636264] [prterun-master-1636264@0,0]:tcp:recv:handler called
[master:1636264] [prterun-master-1636264@0,0] RECV CONNECT ACK FROM UNKNOWN ON SOCKET 18
[master:1636264] [prterun-master-1636264@0,0] waiting for connect ack from UNKNOWN
[master:1636264] [prterun-master-1636264@0,0] connect ack received from UNKNOWN
[master:1636264] [prterun-master-1636264@0,0] connect-ack recvd from UNKNOWN
[master:1636264] [prterun-master-1636264@0,0] prte_oob_tcp_recv_connect: connection from new peer
[master:1636264] [prterun-master-1636264@0,0] connect-ack header from [prterun-master-1636264@0,1] is okay
[master:1636264] [prterun-master-1636264@0,0] waiting for connect ack from [prterun-master-1636264@0,1]
[master:1636264] [prterun-master-1636264@0,0] connect ack received from [prterun-master-1636264@0,1]
[master:1636264] [prterun-master-1636264@0,0] tcp_peer_close for [prterun-master-1636264@0,1] sd -1 state FAILED
[master:1636264] [prterun-master-1636264@0,0] tcp:lost connection called for peer [prterun-master-1636264@0,1]
[master:1636264] [prterun-master-1636264@0,0] plm:base:receive stop comm
[master:1636264] mca: base: close: component ssh closed
[master:1636264] mca: base: close: unloading component ssh
--------------------------------------------------------------------------
PRTE detected a mismatch in versions between two processes.  This
[master:1636264] [prterun-master-1636264@0,0] TCP SHUTDOWN
typically means that you executed "mpirun" (or "mpiexec") from one
version of PRTE on on node, but your default path on one of the
other nodes upon which you launched found a different version of Open
MPI.

PRTE only supports running exactly the same version between all
processes in a single job.

This will almost certainly cause unpredictable behavior, and may end
up aborting your job.

  Local host:             master
  Local process name:     [prterun-master-1636264@0,0]
  Local PRTE version: 3.0.7
  Peer host:              Unknown
  Peer process name:      [prterun-master-1636264@0,1]
  Peer PRTE version:  3.0.8
--------------------------------------------------------------------------
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-master-1636264@0,0] on node master
  Remote daemon: [prterun-master-1636264@0,1] on node slave1

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[master:1636264] [prterun-master-1636264@0,0] TCP SHUTDOWN done
[master:1636264] mca: base: close: component tcp closed
[master:1636264] mca: base: close: unloading component tcp
make: *** [Makefile:13: run] Error 250

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions