-
Notifications
You must be signed in to change notification settings - Fork 929
Description
Background information
When attempting to startup and attach a debugger using the MPIR interface, a memory leak in the mpirun process (about 1GB every 10 seconds) causes the system to run out of memory. This impacts the ability to debug MPI programs.
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
openmpi-3.1.1 (distribution tarball)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
$ tar -xjf openmpi-3.1.1.tar.bz2
$ ./configure --enable-mpirun-prefix-by-default --without-openib --with-psm=no --with-scif=no
$ makePlease describe the system on which you are running
- Operating system/version: Ubuntu 16.04 and Redhat 7.0
- Compiler: GCC 6.0, GCC 7.3, PGI 16.5 and PGI 17.3 all exhibit the same behaviour
- Computer hardware: x86-64
- Network type: n/a, reproduces on a single node
Details of the problem
The general steps to reproduce are to:
- start mpirun under a debugger
- set MPIR_being_debugged=1
- break at MPIR_Breakpoint
- attach to the user program (hello_c) and keep it paused
- then resume the mpirun process
- the memory leak starts
Expected behaviour is that the child process should be able to be held at any point indefinitely without causing the mpirun process to leak memory.
A reproducer using GDB scripts is attached here gdb-only.tar.gz and can be started with this command:
make
gdb -x mpirun.gdb --args mpirun -np 1 ./hello_cThe reproducer only holds the child process for 30 seconds to prevent system instability, but this can be modified in the hello.gdb file.
Here is a snapshot from top showing mpirun's memory usage:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7452 jamcla02 20 0 2256364 1.889g 14628 R 127.3 9.9 0:37.42 mpirun Possible source of the leak
It seems that the leak could be coming from the following line
opal/mca/pmix/pmix2x/pmix/src/event/pmix_event_notification.c:968
PMIX_INFO_CREATE(chain->info, chain->nallocated);There has been some modification around this part of the code from version 3.0 to 3.1.
Previous behaviour
This seems to be a regression, because doing the same test with OpenMPI 3.0 doesn't have the same issue.