with clang 21.1.3 as wrapper compiler for cuda-aware openmpi-5.0.8 and compile options -fopenmp and -fopenmp-targets=nvptx64-nvidia-cuda one gets the message "The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed." for every code with OpenMPI

Hi 
[Update: the problem also appears for open-mpi-5.0.8]

I am using cuda-aware open-mpi-5.0.8
I want to write applications that use OpenMPI to send data to different nodes with several gpu devices.
Once the data is there, I want to do parallel loops on the gpus with teams of threads and OpenMP (yes without the I). Because clang recently has updated its OpenAcc and OpenMP library, i want to use clang 21.1.3 for this.


So I install open-mpi-5.0.8 and clang 21.1.3 and set

> export OMPI_CXX=clang++

then

> mpicxx --show

will yield

> clang++ -Wl,-rpath -Wl,/usr/lib64 -Wl,--enable-new-dtags -lmpi

Then I compile a simple openMPI hello world application, which should be able to use OpenMPI and OpenMP together, with OpenMP used for the loops on the device:

> mpicxx ./main.cpp -std=c++23 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda

Well, my real world library has implemented a Strassen Algorithm which sends the data to several nodes, fills buffers with device pointers on the nodes, and then locally computes the multiplications of the submatrices there. https://github.com/bschulz81/AcceleratedLinearAlgebra/tree/main

Unfortunately, with clang, an error message appears always. At first I thought this was due to my code. So i reduced the test application further and further, unless it is not calling openMP at all


This code here:
```

 #include <omp.h>
 #include <iostream>
 #include <stdio.h>
 #include <mpi.h>
 int main(int argc, char** argv)
 {
        int process_Rank, size_Of_Cluster;
 
     MPI_Init(&argc, &argv);
     MPI_Comm_size(MPI_COMM_WORLD, &size_Of_Cluster);
     MPI_Comm_rank(MPI_COMM_WORLD, &process_Rank);
 
     printf("Hello World from process %d of %d\n", process_Rank, size_Of_Cluster);
 
     MPI_Finalize();
     return 0;
 
 }
```


does just a simple hello world. It has no parallel for loops on device at all.  Nevertheless, 

If compiled with 

> mpicxx ./main.cpp -std=c++23 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda

and run with

mpirun -np 8 ./gpu_compiler_test

it will yield 

> The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.
> Host: localhost
> cuMemHostRegister return value: 1
> Registration cache: ��u
> ��[]A\�D
> Hello World from process 5 of 8
> Hello World from process 4 of 8
> Hello World from process 0 of 8
> Hello World from process 3 of 8
> Hello World from process 1 of 8
> Hello World from process 2 of 8
> Hello World from process 7 of 8
> Hello World from process 6 of 8

If compiled with gcc as wrapper compiler and the options

` -fopenmp -foffload=nvptx-none -fno-stack-protector `
the strange error message does not appear.

Note that in the code, i still can open buffers with omp_target_alloc on the gpus, use MPI_send to send something to the nodes, and then do OpenMP loops on the device. But the error message is irritating nevertheless, since one then thinks whether something is wrong with ones own code if one reads messages like

> The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.
> Host: localhost
> cuMemHostRegister return value: 1
> Registration cache: ��u
> ��[]A\�D

But this appears even if one does not even call the openmp runtime on device at all. It just suffices to compile aprogram that uses OpenMPI with clang  and 
-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda 

and then this message appears..

Below is the hello world with a cmakelists.txt that sets the same options and calls clang directly and links to mpi.
But as shown above, the error message also appears if one goes the official route and uses clang as wrapper-compiler with mpicxx

[main.cpp](https://github.com/user-attachments/files/22830971/main.cpp)
[CMakeLists.txt](https://github.com/user-attachments/files/22831059/CMakeLists.txt)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

with clang 21.1.3 as wrapper compiler for cuda-aware openmpi-5.0.8 and compile options -fopenmp and -fopenmp-targets=nvptx64-nvidia-cuda one gets the message "The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed." for every code with OpenMPI #13431

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

with clang 21.1.3 as wrapper compiler for cuda-aware openmpi-5.0.8 and compile options -fopenmp and -fopenmp-targets=nvptx64-nvidia-cuda one gets the message "The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed." for every code with OpenMPI #13431

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions