Segfault and CUDA context errors in cuda_copy_md.c (UCX v1.18.1) #10838
-
I'm trying to train resnet18 model in distributed way on multiple nodes with 1 GPU each . I have been using pytorch(V2.6 with MPI) build from source,OpenMPI( CUDA aware) with UCX I’m running into a segmentation fault when using UCX with CUDA. The application crashes on multiple GPUs (2 Nodes each with 1 GPU) and the logs show repeated warnings and errors coming from cuda_copy_md.c: [gpu15:85054:0:85088] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fc62b3b0c00) [1756790188.787634] [gpu15:85054:0] cuda_copy_md.c:355 UCX WARN cuda_cuCtxSetFlags_func(CU_CTX_SYNC_MEMOPS) failed: invalid device context [1756790199.681397] [gpu35:46469:0] cuda_copy_md.c:355 UCX WARN cuda_cuCtxSetFlags_func(CU_CTX_SYNC_MEMOPS) failed: invalid device context It looks like UCX is trying to call cuCtxSetFlags(CU_CTX_SYNC_MEMOPS) and cuMemGetAddressRange(), but both are failing with invalid device context, leading to a segfault. Environment: UCX version: v1.18.1 CUDA version: v12.4 OpenMPI version: v5.0.8 Command used to run : mpirun -np 2 --hostfile ~/host2 --npernode 1 --mca pml ucx -x LD_LIBRARY_PATH=/home/skakde/miniconda3/lib:$LD_LIBRARY_PATH -x OMPI_MCA_coll_hcoll_enable=0 -x UCX_TLS=rc,cuda_copy,cuda_ipc bash -c ". /home/skakde/miniconda3/bin/activate && conda activate torch_env && python /home/skakde/resnet18.py" Has anyone seen this before? Could this be related to CUDA context management (e.g. UCX not picking up the right context), or is it a known issue in this branch? Any suggestions for debugging or workarounds would be appreciated. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi, |
Beta Was this translation helpful? Give feedback.
Hi,
Using multiple GPUs within a single process is not supported in version 1.18.1. Support was added in 1.19.0.