Skip to content

Memory leak in mpirun when attaching a debugger using MPIR causes host to run out of memory  #5454

@James-A-Clark

Description

@James-A-Clark

Background information

When attempting to startup and attach a debugger using the MPIR interface, a memory leak in the mpirun process (about 1GB every 10 seconds) causes the system to run out of memory. This impacts the ability to debug MPI programs.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

openmpi-3.1.1 (distribution tarball)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

$ tar -xjf openmpi-3.1.1.tar.bz2
$ ./configure --enable-mpirun-prefix-by-default --without-openib --with-psm=no --with-scif=no
$ make

Please describe the system on which you are running

  • Operating system/version: Ubuntu 16.04 and Redhat 7.0
  • Compiler: GCC 6.0, GCC 7.3, PGI 16.5 and PGI 17.3 all exhibit the same behaviour
  • Computer hardware: x86-64
  • Network type: n/a, reproduces on a single node

Details of the problem

The general steps to reproduce are to:

  • start mpirun under a debugger
  • set MPIR_being_debugged=1
  • break at MPIR_Breakpoint
  • attach to the user program (hello_c) and keep it paused
  • then resume the mpirun process
  • the memory leak starts

Expected behaviour is that the child process should be able to be held at any point indefinitely without causing the mpirun process to leak memory.

A reproducer using GDB scripts is attached here gdb-only.tar.gz and can be started with this command:

make
gdb -x mpirun.gdb --args mpirun -np 1 ./hello_c

The reproducer only holds the child process for 30 seconds to prevent system instability, but this can be modified in the hello.gdb file.

Here is a snapshot from top showing mpirun's memory usage:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                               
 7452 jamcla02  20   0 2256364 1.889g  14628 R 127.3  9.9   0:37.42 mpirun    

Possible source of the leak

It seems that the leak could be coming from the following line
opal/mca/pmix/pmix2x/pmix/src/event/pmix_event_notification.c:968

PMIX_INFO_CREATE(chain->info, chain->nallocated);

There has been some modification around this part of the code from version 3.0 to 3.1.

Previous behaviour

This seems to be a regression, because doing the same test with OpenMPI 3.0 doesn't have the same issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions