Skip to content

clang has problems with cuda-aware OpenMPI (The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.) appears for any code with nvptx offload target that uses cuda aware open MPI (message passing Interface), even if OpenMP is not used in code. #162586

@bschulz81

Description

@bschulz81

Hi there, I have written a library which uses OpenMP on device and the Message Passing Interface. https://github.com/bschulz81/AcceleratedLinearAlgebra/tree/main

I generally want to support many compilers and have everything standards compilant.

Because gcc appears to show problems with the recent kernel 6.17 that has improved memory debugging support, I switched to clang 21.1.3
.

With its newest fixes for mapping of nested structs, i am pleased that my library now seems to run smootly on gpu, with the exception of one problem:

Unfortunately, for code that runs on device with OpenMP and uses a Cuda aware Open Message Passing Interface (OpenMPI) , I recieve a strange error message at the beginning from cuda when compiling with clang 21.1.3;

At first, I thought this was due to my code.

But actually, it turns out that it has something to do with the initialization of clang's openMP cuda code and the OpenMPI message passing interface. It suffices to compile the following, which does not even use any OpenMP or cuda kernel

#include <omp.h>
#include <iostream>
#include <stdio.h>
#include <mpi.h>
int main(int argc, char** argv)
{
       int process_Rank, size_Of_Cluster;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size_Of_Cluster);
    MPI_Comm_rank(MPI_COMM_WORLD, &process_Rank);

    printf("Hello World from process %d of %d\n", process_Rank, size_Of_Cluster);

    MPI_Finalize();
    return 0;

}

with the following cmakelists.txt (note the -fopenmp-targets=nvptx64-nvidia-cuda as compile parameter and the linkage to mpi and openmp)

 cmake_minimum_required (VERSION 3.10)
set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
set(CMAKE_CXX_STANDARD 23 )

project(gpu_compiler_test VERSION 1.0)
find_package(OpenMP REQUIRED)
#set project binary folder:
set(PROJECT_BINARY_DIR ${CMAKE_CURRENT_SOURCE_DIR}/build/)

#set project source folder:
include_directories(gpu_compiler_test ${CMAKE_CURRENT_SOURCE_DIR})

#add executable target name (if project is built as executable)
add_executable(gpu_compiler_test ${SOURCES} ${CMAKE_CURRENT_SOURCE_DIR}/main.cpp)

#Clang compiler flags
SET (CMAKE_CXX_COMPILER clang++  CACHE STRING "C++ compiler" FORCE)
SET (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}  -std=c++23 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda   -Wall")


if(CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
    message(STATUS "Configuring for Clang compiler")
    target_link_libraries(gpu_compiler_test        PRIVATE         rt m c stdc++ mpi OpenMP::OpenMP_CXX)
endif()

if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
    message(STATUS "Configuring for GNU compiler (gcc)")
    target_link_libraries(gpu_compiler_test        PRIVATE         rt m c stdc++ mpi  OpenMP::OpenMP_CXX)
endif()


If one runs this with

mpirun -np 8 ./gpu_compiler_test

one gets an output with this strange error message at the beginning

The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.
Host: localhost
cuMemHostRegister return value: 1
Registration cache: ��u
��[]A\�D

Hello World from process 5 of 8
Hello World from process 4 of 8
Hello World from process 0 of 8
Hello World from process 3 of 8
Hello World from process 1 of 8
Hello World from process 2 of 8
Hello World from process 7 of 8
Hello World from process 6 of 8

Obviously, in the above example, own code with OpenMP that could call a cuda kernel, was not even written.

But the code is completely legal. Replace the commands for the compiler in the cmakelists.txt

SET (CMAKE_CXX_COMPILER clang++  CACHE STRING "C++ compiler" FORCE)
SET (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}  -std=c++23 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda   -Wall")

by these lines (which would switch to gcc)

SET (CMAKE_CXX_COMPILER "g++" CACHE STRING "C++ compiler" FORCE)
SET (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fopenmp -foffload=nvptx-none -fno-stack-protector  -Wall")

And then gcc produces no warning and the output is

Hello World from process 1 of 8
Hello World from process 6 of 8
Hello World from process 7 of 8
Hello World from process 5 of 8
Hello World from process 3 of 8
Hello World from process 4 of 8
Hello World from process 0 of 8
Hello World from process 2 of 8

without that strange error message from cuda.

From my tests, this strange output

The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.
Host: localhost
cuMemHostRegister return value: 1
Registration cache: ��u
��[]A\�D

at the beginning does not seem to interfer with the later code,

So one can write

#pragma omp targed teams distribute

loops that accept OpenMPI pointers from MPI_Recv then and these appear to work...

Of course OpenMPI has more selected gcc as default, but one is allowed to switch it for clang.

Currently, with the new linux kernel, I went into offloading and memory problems with gcc, and clang does not show them, also the cuda code generated from clang's OpenMP seems to be faster than that of gcc.

It could of course also be that I missed some configure option for the Message Passing Interface with clang, but the above problem appears to occur when using clang with nvptx-nvidia-cuda offload target together with cuda aware OpenMP. The code would work without that target.

main.cpp

CMakeLists.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions