-
Notifications
You must be signed in to change notification settings - Fork 928
Description
Hi
[Update: the problem also appears for open-mpi-5.0.8]
I am using cuda-aware open-mpi-5.0.8
I want to write applications that use OpenMPI to send data to different nodes with several gpu devices.
Once the data is there, I want to do parallel loops on the gpus with teams of threads and OpenMP (yes without the I). Because clang recently has updated its OpenAcc and OpenMP library, i want to use clang 21.1.3 for this.
So I install open-mpi-5.0.8 and clang 21.1.3 and set
export OMPI_CXX=clang++
then
mpicxx --show
will yield
clang++ -Wl,-rpath -Wl,/usr/lib64 -Wl,--enable-new-dtags -lmpi
Then I compile a simple openMPI hello world application, which should be able to use OpenMPI and OpenMP together, with OpenMP used for the loops on the device:
mpicxx ./main.cpp -std=c++23 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda
Well, my real world library has implemented a Strassen Algorithm which sends the data to several nodes, fills buffers with device pointers on the nodes, and then locally computes the multiplications of the submatrices there. https://github.com/bschulz81/AcceleratedLinearAlgebra/tree/main
Unfortunately, with clang, an error message appears always. At first I thought this was due to my code. So i reduced the test application further and further, unless it is not calling openMP at all
This code here:
#include <omp.h>
#include <iostream>
#include <stdio.h>
#include <mpi.h>
int main(int argc, char** argv)
{
int process_Rank, size_Of_Cluster;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size_Of_Cluster);
MPI_Comm_rank(MPI_COMM_WORLD, &process_Rank);
printf("Hello World from process %d of %d\n", process_Rank, size_Of_Cluster);
MPI_Finalize();
return 0;
}
does just a simple hello world. It has no parallel for loops on device at all. Nevertheless,
If compiled with
mpicxx ./main.cpp -std=c++23 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda
and run with
mpirun -np 8 ./gpu_compiler_test
it will yield
The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.
Host: localhost
cuMemHostRegister return value: 1
Registration cache: ��u
��[]A\�D
Hello World from process 5 of 8
Hello World from process 4 of 8
Hello World from process 0 of 8
Hello World from process 3 of 8
Hello World from process 1 of 8
Hello World from process 2 of 8
Hello World from process 7 of 8
Hello World from process 6 of 8
If compiled with gcc as wrapper compiler and the options
-fopenmp -foffload=nvptx-none -fno-stack-protector
the strange error message does not appear.
Note that in the code, i still can open buffers with omp_target_alloc on the gpus, use MPI_send to send something to the nodes, and then do OpenMP loops on the device. But the error message is irritating nevertheless, since one then thinks whether something is wrong with ones own code if one reads messages like
The call to cuMemHostRegister(0x7fb38c000000, 134217728, 0) failed.
Host: localhost
cuMemHostRegister return value: 1
Registration cache: ��u
��[]A\�D
But this appears even if one does not even call the openmp runtime on device at all. It just suffices to compile aprogram that uses OpenMPI with clang and
-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda
and then this message appears..
Below is the hello world with a cmakelists.txt that sets the same options and calls clang directly and links to mpi.
But as shown above, the error message also appears if one goes the official route and uses clang as wrapper-compiler with mpicxx