Skip to content

with clang 21.1.3 as wrapper compiler for cuda-aware openmpi-5.0.8 and compile options -fopenmp and -fopenmp-targets=nvptx64-nvidia-cuda one gets the message "The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed." for every code with OpenMPI #13431

@bschulz81

Description

@bschulz81

Hi
[Update: the problem also appears for open-mpi-5.0.8]

I am using cuda-aware open-mpi-5.0.8
I want to write applications that use OpenMPI to send data to different nodes with several gpu devices.
Once the data is there, I want to do parallel loops on the gpus with teams of threads and OpenMP (yes without the I). Because clang recently has updated its OpenAcc and OpenMP library, i want to use clang 21.1.3 for this.

So I install open-mpi-5.0.8 and clang 21.1.3 and set

export OMPI_CXX=clang++

then

mpicxx --show

will yield

clang++ -Wl,-rpath -Wl,/usr/lib64 -Wl,--enable-new-dtags -lmpi

Then I compile a simple openMPI hello world application, which should be able to use OpenMPI and OpenMP together, with OpenMP used for the loops on the device:

mpicxx ./main.cpp -std=c++23 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda

Well, my real world library has implemented a Strassen Algorithm which sends the data to several nodes, fills buffers with device pointers on the nodes, and then locally computes the multiplications of the submatrices there. https://github.com/bschulz81/AcceleratedLinearAlgebra/tree/main

Unfortunately, with clang, an error message appears always. At first I thought this was due to my code. So i reduced the test application further and further, unless it is not calling openMP at all

This code here:


 #include <omp.h>
 #include <iostream>
 #include <stdio.h>
 #include <mpi.h>
 int main(int argc, char** argv)
 {
        int process_Rank, size_Of_Cluster;
 
     MPI_Init(&argc, &argv);
     MPI_Comm_size(MPI_COMM_WORLD, &size_Of_Cluster);
     MPI_Comm_rank(MPI_COMM_WORLD, &process_Rank);
 
     printf("Hello World from process %d of %d\n", process_Rank, size_Of_Cluster);
 
     MPI_Finalize();
     return 0;
 
 }

does just a simple hello world. It has no parallel for loops on device at all. Nevertheless,

If compiled with

mpicxx ./main.cpp -std=c++23 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda

and run with

mpirun -np 8 ./gpu_compiler_test

it will yield

The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.
Host: localhost
cuMemHostRegister return value: 1
Registration cache: ��u
��[]A\�D
Hello World from process 5 of 8
Hello World from process 4 of 8
Hello World from process 0 of 8
Hello World from process 3 of 8
Hello World from process 1 of 8
Hello World from process 2 of 8
Hello World from process 7 of 8
Hello World from process 6 of 8

If compiled with gcc as wrapper compiler and the options

-fopenmp -foffload=nvptx-none -fno-stack-protector
the strange error message does not appear.

Note that in the code, i still can open buffers with omp_target_alloc on the gpus, use MPI_send to send something to the nodes, and then do OpenMP loops on the device. But the error message is irritating nevertheless, since one then thinks whether something is wrong with ones own code if one reads messages like

The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.
Host: localhost
cuMemHostRegister return value: 1
Registration cache: ��u
��[]A\�D

But this appears even if one does not even call the openmp runtime on device at all. It just suffices to compile aprogram that uses OpenMPI with clang and
-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda

and then this message appears..

Below is the hello world with a cmakelists.txt that sets the same options and calls clang directly and links to mpi.
But as shown above, the error message also appears if one goes the official route and uses clang as wrapper-compiler with mpicxx

main.cpp
CMakeLists.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions