Segfault and CUDA context errors in cuda_copy_md.c (UCX v1.18.1) #10838

shubhamkakde111 · 2025-09-02T05:49:22Z

shubhamkakde111
Sep 2, 2025

I'm trying to train resnet18 model in distributed way on multiple nodes with 1 GPU each . I have been using pytorch(V2.6 with MPI) build from source,OpenMPI( CUDA aware) with UCX I’m running into a segmentation fault when using UCX with CUDA. The application crashes on multiple GPUs (2 Nodes each with 1 GPU) and the logs show repeated warnings and errors coming from cuda_copy_md.c:

[gpu15:85054:0:85088] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fc62b3b0c00)
[gpu35:46469:0:46491] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fd1fb3b0c00)

[1756790188.787634] [gpu15:85054:0] cuda_copy_md.c:355 UCX WARN cuda_cuCtxSetFlags_func(CU_CTX_SYNC_MEMOPS) failed: invalid device context
[1756790188.787649] [gpu15:85054:0] cuda_copy_md.c:705 UCX ERROR cuMemGetAddressRange(0x7fc62b3b1208) error: invalid device context
[1756790188.787657] [gpu15:85054:0] cuda_copy_md.c:355 UCX WARN cuda_cuCtxSetFlags_func(CU_CTX_SYNC_MEMOPS) failed: invalid device context
[1756790188.787659] [gpu15:85054:0] cuda_copy_md.c:705 UCX ERROR cuMemGetAddressRange(0x7fc62b3b0c00) error: invalid device context

[1756790199.681397] [gpu35:46469:0] cuda_copy_md.c:355 UCX WARN cuda_cuCtxSetFlags_func(CU_CTX_SYNC_MEMOPS) failed: invalid device context
[1756790199.681413] [gpu35:46469:0] cuda_copy_md.c:705 UCX ERROR cuMemGetAddressRange(0x7fd1fb3b1200) error: invalid device context
[1756790199.681422] [gpu35:46469:0] cuda_copy_md.c:355 UCX WARN cuda_cuCtxSetFlags_func(CU_CTX_SYNC_MEMOPS) failed: invalid device context
[1756790199.681425] [gpu35:46469:0] cuda_copy_md.c:705 UCX ERROR cuMemGetAddressRange(0x7fd1fb3b0c00) error: invalid device context

It looks like UCX is trying to call cuCtxSetFlags(CU_CTX_SYNC_MEMOPS) and cuMemGetAddressRange(), but both are failing with invalid device context, leading to a segfault.

Environment:

UCX version: v1.18.1

CUDA version: v12.4

OpenMPI version: v5.0.8

Command used to run : mpirun -np 2 --hostfile ~/host2 --npernode 1 --mca pml ucx -x LD_LIBRARY_PATH=/home/skakde/miniconda3/lib:$LD_LIBRARY_PATH -x OMPI_MCA_coll_hcoll_enable=0 -x UCX_TLS=rc,cuda_copy,cuda_ipc bash -c ". /home/skakde/miniconda3/bin/activate && conda activate torch_env && python /home/skakde/resnet18.py"

Has anyone seen this before? Could this be related to CUDA context management (e.g. UCX not picking up the right context), or is it a known issue in this branch? Any suggestions for debugging or workarounds would be appreciated.

Thanks!

Answered by rakhmets

Sep 9, 2025

Hi,
Using multiple GPUs within a single process is not supported in version 1.18.1. Support was added in 1.19.0.

View full answer

rakhmets · 2025-09-09T09:38:06Z

rakhmets
Sep 9, 2025
Collaborator

Hi,
Using multiple GPUs within a single process is not supported in version 1.18.1. Support was added in 1.19.0.

1 reply

shubhamkakde111 Sep 9, 2025
Author

it worked, thanks!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Segfault and CUDA context errors in cuda_copy_md.c (UCX v1.18.1) #10838

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Segfault and CUDA context errors in cuda_copy_md.c (UCX v1.18.1) #10838

Uh oh!

Uh oh!

shubhamkakde111 Sep 2, 2025

Replies: 1 comment · 1 reply

Uh oh!

rakhmets Sep 9, 2025 Collaborator

Uh oh!

shubhamkakde111 Sep 9, 2025 Author

shubhamkakde111
Sep 2, 2025

Replies: 1 comment 1 reply

rakhmets
Sep 9, 2025
Collaborator

shubhamkakde111 Sep 9, 2025
Author