clang 21.1.6 has problems with cuda-aware OpenMPI 5.0.8 (The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.) appears for any code with nvptx offload target that uses cuda aware OpenMPI  (message passing Interface),  even if OpenMP is not used in code.

Hi there, I have written a library which uses OpenMP on device and the Message Passing Interface. https://github.com/bschulz81/AcceleratedLinearAlgebra/tree/main 

I generally want to support many compilers and have everything standards compilant.

Because gcc appears to show problems with the recent kernel 6.17 that has improved memory debugging support, I switched to clang 21.1.6
. 

With its newest fixes for mapping of nested structs, i am pleased that my library now seems to run smootly on gpu, with the exception of one problem:

Unfortunately, for code that runs on device with OpenMP and uses a Cuda aware Open Message Passing Interface (OpenMPI) with recent stable version 5.0.8 and earlier versions , I recieve a strange error message at the beginning from cuda when compiling with clang 21.1.6;

At first, I thought this was due to my code. 

But actually, it turns out that it has something to do with the initialization of clang's openMP cuda code and the OpenMPI message passing interface.  It suffices to compile the  following, which does not even use any OpenMP or cuda kernel

```
#include <omp.h>
#include <iostream>
#include <stdio.h>
#include <mpi.h>
int main(int argc, char** argv)
{
       int process_Rank, size_Of_Cluster;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size_Of_Cluster);
    MPI_Comm_rank(MPI_COMM_WORLD, &process_Rank);

    printf("Hello World from process %d of %d\n", process_Rank, size_Of_Cluster);

    MPI_Finalize();
    return 0;

}

```
with the following cmakelists.txt (note the -fopenmp-targets=nvptx64-nvidia-cuda as compile parameter and the linkage to mpi and openmp)

```
 cmake_minimum_required (VERSION 3.10)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
set(CMAKE_CXX_STANDARD 23 )

project(gpu_compiler_test VERSION 1.0)
find_package(OpenMP REQUIRED)
#set project binary folder:
set(PROJECT_BINARY_DIR ${CMAKE_CURRENT_SOURCE_DIR}/build/)

#set project source folder:
include_directories(gpu_compiler_test ${CMAKE_CURRENT_SOURCE_DIR})

#add executable target name (if project is built as executable)
add_executable(gpu_compiler_test ${SOURCES} ${CMAKE_CURRENT_SOURCE_DIR}/main.cpp)

#Clang compiler flags
SET (CMAKE_CXX_COMPILER clang++  CACHE STRING "C++ compiler" FORCE)
SET (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}  -std=c++23 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda   -Wall")


if(CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
    message(STATUS "Configuring for Clang compiler")
    target_link_libraries(gpu_compiler_test        PRIVATE         rt m c stdc++ mpi OpenMP::OpenMP_CXX)
endif()

if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
    message(STATUS "Configuring for GNU compiler (gcc)")
    target_link_libraries(gpu_compiler_test        PRIVATE         rt m c stdc++ mpi  OpenMP::OpenMP_CXX)
endif()


```
If one runs this with

` mpirun -np 8 ./gpu_compiler_test`

one gets an output with this  strange error message at the beginning

> The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.
>   Host:  localhost
>   cuMemHostRegister return value:  1
>   Registration cache:  ��u
>                           ��[]A\�D
> --------------------------------------------------------------------------
> Hello World from process 5 of 8
> Hello World from process 4 of 8
> Hello World from process 0 of 8
> Hello World from process 3 of 8
> Hello World from process 1 of 8
> Hello World from process 2 of 8
> Hello World from process 7 of 8
> Hello World from process 6 of 8


Obviously, in the above example, own code with OpenMP that could call a cuda kernel, was not even written. 

But the code is completely legal. Replace the commands for the compiler in the cmakelists.txt
```
SET (CMAKE_CXX_COMPILER clang++  CACHE STRING "C++ compiler" FORCE)
SET (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}  -std=c++23 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda   -Wall")
```
by these lines (which would switch to gcc)

```
SET (CMAKE_CXX_COMPILER "g++" CACHE STRING "C++ compiler" FORCE)
SET (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fopenmp -foffload=nvptx-none -fno-stack-protector  -Wall")
```

And then gcc produces no warning and the output is 


> Hello World from process 1 of 8
> Hello World from process 6 of 8
> Hello World from process 7 of 8
> Hello World from process 5 of 8
> Hello World from process 3 of 8
> Hello World from process 4 of 8
> Hello World from process 0 of 8
> Hello World from process 2 of 8
> 

without that strange error message from cuda. 

From my tests, this strange output 

> The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.
>   Host:  localhost
>   cuMemHostRegister return value:  1
>   Registration cache:  ��u
>                           ��[]A\�D
> --------------------------------------------------------------------------

at the beginning does not seem to interfer with the later code, 

So one can write

` #pragma omp targed teams distribute `

loops that accept OpenMPI pointers from MPI_Recv then and these appear to work...

Of course OpenMPI has more selected gcc as default, but one is allowed to switch it for clang.

Currently, with the new linux kernel, I went into offloading and memory problems with gcc, and  clang does not show them, also the cuda code generated from clang's OpenMP seems to be faster than that of gcc. 

It could of course also be that I missed some configure option for the Message Passing Interface with clang, but the above problem appears to occur when using clang with nvptx-nvidia-cuda offload target together with cuda aware OpenMP. The code would work without that target.

[main.cpp](https://github.com/user-attachments/files/22788955/main.cpp)

[CMakeLists.txt](https://github.com/user-attachments/files/22789137/CMakeLists.txt)






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

clang 21.1.6 has problems with cuda-aware OpenMPI 5.0.8 (The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.) appears for any code with nvptx offload target that uses cuda aware OpenMPI (message passing Interface), even if OpenMP is not used in code. #162586

The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.
Host: localhost
cuMemHostRegister return value: 1
Registration cache: ��u
��[]A\�D

The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.
Host: localhost
cuMemHostRegister return value: 1
Registration cache: ��u
��[]A\�D

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

clang 21.1.6 has problems with cuda-aware OpenMPI 5.0.8 (The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.) appears for any code with nvptx offload target that uses cuda aware OpenMPI (message passing Interface), even if OpenMP is not used in code. #162586

Description

The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed. Host: localhost cuMemHostRegister return value: 1 Registration cache: ��u ��[]A\�D

The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed. Host: localhost cuMemHostRegister return value: 1 Registration cache: ��u ��[]A\�D

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.
Host: localhost
cuMemHostRegister return value: 1
Registration cache: ��u
��[]A\�D

The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.
Host: localhost
cuMemHostRegister return value: 1
Registration cache: ��u
��[]A\�D