- 
                Notifications
    You must be signed in to change notification settings 
- Fork 929
Description
Background information
Using openmpi direct GPU communication with CUDA, a memory growth is observed throughout the duration of the run.
After tracing the issue using the ompi logs --mca mpi_common_cuda_verbose 10 there appears to be a mismatch between the calls to cuIpcOpenMemHandle and cuIpcCloseMemHandle (where cuIpcCloseMemHandle is rarely called).
This behavior seems largely unaffected by various mca options, as if opal is losing tracks of some allocations.
A similar memory growth patern is observed on HPE Cray EX235a nodes
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
$ ompi_info 
                 Package: Debian OpenMPI
                Open MPI: 4.1.4
  Open MPI repo revision: v4.1.4
   Open MPI release date: May 26, 2022
                Open RTE: 4.1.4
  Open RTE repo revision: v4.1.4
   Open RTE release date: May 26, 2022
                    OPAL: 4.1.4
      OPAL repo revision: v4.1.4
       OPAL release date: May 26, 2022
                 MPI API: 3.1.0
            Ident string: 4.1.4
                  Prefix: /usr
 Configured architecture: x86_64-pc-linux-gnu
          Configure host: hostname
           Configured by: username
           Configured on: Wed Oct 12 11:52:34 UTC 2022
          Configure host: hostname
  Configure command line: '--build=x86_64-linux-gnu' '--prefix=/usr'
                          '--includedir=${prefix}/include'
                          '--mandir=${prefix}/share/man'
                          '--infodir=${prefix}/share/info'
                          '--sysconfdir=/etc' '--localstatedir=/var'
                          '--disable-option-checking'
                          '--disable-silent-rules'
                          '--libdir=${prefix}/lib/x86_64-linux-gnu'
                          '--runstatedir=/run' '--disable-maintainer-mode'
                          '--disable-dependency-tracking'
                          '--disable-silent-rules'
                          '--disable-wrapper-runpath'
                          '--with-package-string=Debian OpenMPI'
                          '--with-verbs' '--with-libfabric' '--with-psm'
                          '--with-psm2' '--with-ucx'
                          '--with-pmix=/usr/lib/x86_64-linux-gnu/pmix2'
                          '--with-jdk-dir=/usr/lib/jvm/default-java'
                          '--enable-mpi-java'
                          '--enable-opal-btl-usnic-unit-tests'
                          '--with-libevent=external' '--with-hwloc=external'
                          '--disable-silent-rules' '--enable-mpi-cxx'
                          '--enable-ipv6' '--with-devel-headers'
                          '--with-slurm' '--with-cuda=/usr/lib/cuda'
                          '--with-sge' '--without-tm'
                          '--sysconfdir=/etc/openmpi'
                          '--libdir=${prefix}/lib/x86_64-linux-gnu/openmpi/lib'
                          '--includedir=${prefix}/lib/x86_64-linux-gnu/openmpi/include'
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
apt package on Debian 6.1.99-1.
Please describe the system on which you are running
- Operating system/version: Linux dgx 6.1.0-23-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.99-1 (2024-07-15) x86_64 GNU/Linux
- Computer hardware: Nvidia DGX workstation, CPU : epyc 7742 64c, GPUs : 4x A100-SXM4-40Gb
- Network type: None
Details of the problem
The memory growth is observed during a section of the code using non-blocking direct GPU communications (Isend, Irecv) on CUDA memory.
Currently throughout the duration of a run the GPU memory usage is growing until crash due to CUDA_OUT_OF_MEMORY.
Typically the evolution of the memory usage of the code when graphed looks like this :

When tracing the issue i stumbled on this old post on the nvidia forum https://forums.developer.nvidia.com/t/memory-increase-in-gpu-aware-non-blocking-mpi-communications/275634/4, which pointed toward cuIpc handling within openmpi.
Running the same test with --mca mpi_common_cuda_verbose 10, I traced instances of cuIpcOpenMemHandle and cuIpcCloseMemHandle to follow the memory usage evolution which matches with the observed memory growth.
I tried running the following test cases:
mpirun --mca mpi_common_cuda_verbose 10 \
	-n 4 <application> \
	2> out10_ompi_default
	
mpirun --mca mpi_common_cuda_verbose 10 --mca mpool_rgpusm_rcache_empty_cache 1\
	-n 4 <application> \
	2> out10_ompi_empty_cache
	
mpirun --mca mpi_common_cuda_verbose 10 --mca mpool_rgpusm_rcache_size_limit 100000\
	-n 4 <application> \
	2> out10_ompi_szlim100000
	
mpirun --mca mpi_common_cuda_verbose 10 --mca mpool_rgpusm_rcache_empty_cache 1  --mca mpool_rgpusm_rcache_size_limit 100000\
	-n 4 <application> \
	2> out10_ompi_empty_cache_szlim100000	
	If we plot the memory evolution traced from calls to cuIpcOpenMemHandle and cuIpcCloseMemHandle we get the following:

Large communications in the beginning of the run are indeed freed correctly, however smaller communications does not appeared to be freed until the call to  MPI_Finalize.
Lastly if we set --mca btl_smcuda_use_cuda_ipc 0 no memory leaks are observed confirming the issue.
So far such behavior was reproduced with:
openmpi 4.1.4 debian
ucx-1.15.0.tar.gz + openmpi-4.1.6.tar.gz 
ucx-1.16.0.tar.gz + openmpi-4.1.6.tar.gz 
ucx-1.17.0.tar.gz + openmpi-4.1.6.tar.gz 
I'm also looking for hotfixes, since this issue is likely to impact us on many supercomputers.