Skip to content

orte_init failed for some reason #12027

@rohithkrn

Description

@rohithkrn

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.1.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

apt install openmpi-bin libopenmpi-dev

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: Ubuntu 22.04
  • Computer hardware: aws g5.48xlarge
  • Network type:

Details of the problem

I am launching mpi python processes and in rank 1, I launch another python subprocess using subprocess.check_call(cmd). When launching the subprocess, it fails with

It looks like orte_init failed for some reason; your parallel process is                                                                                                          
likely to abort.  There are many reasons that a parallel process can                                                                                                              
fail during orte_init; some of which are due to configuration or                                                                                                                  
environment problems.  This failure appears to be an internal failure;                                                                                                            
here's some additional information (which may only be relevant to an                                                                                                              
Open MPI developer):                                                                                                                                                              
                                                                                                                                                                                  
  getting local rank failed                                                                                                                                                       
  --> Returned value No permission (-17) instead of ORTE_SUCCESS                                                                                                                  
--------------------------------------------------------------------------                                                                                                        
--------------------------------------------------------------------------                                                                                                        
It looks like orte_init failed for some reason; your parallel process is                                                                                                          
likely to abort.  There are many reasons that a parallel process can                                                                                                              
fail during orte_init; some of which are due to configuration or                                                                                                                  
environment problems.  This failure appears to be an internal failure;                                                                                                            
here's some additional information (which may only be relevant to an                                                                                                              
Open MPI developer):

  orte_ess_init failed
  --> Returned value No permission (-17) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "No permission" (-17) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

It works fine if I don't launch subprocess in my python program.

Command I am using

mpirun -n 4 --allow-run-as-root  python test.py

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions