Segmentation fault on systems with CUDA 13.x drivers due to UCX/RAPIDS dependency chain

## Summary
 
`segger segment` crashes with a segmentation fault on systems running NVIDIA driver 590.x (CUDA 13.1). The crash is caused by the UCX communication library, which is loaded transitively through `cugraph` at import time. UCX calls `cuCtxGetDevice_v2` in the system's `libcuda.so.1` (CUDA 13.1 driver) before a CUDA context is initialized, causing a segfault before any segger code actually runs.
 
This issue is **not fixable by adjusting the CUDA toolkit or conda environment** — `libcuda.so.1` is always the system-global driver library. It is also not a transient problem: **cuspatial has been archived** (July 2025, read-only) and will never receive CUDA 13 builds, meaning segger cannot run on any system with a CUDA 13.x driver without changes to its dependencies.
 
## Environment
 
- **OS:** Linux (RHEL-based), x86_64
- **GPU:** 2× NVIDIA RTX A4000 (16 GB each)
- **NVIDIA Driver:** 590.48.01 (CUDA 13.1)
- **Python:** 3.11.15 (conda-forge)
- **segger:** 0.1.0 (installed from `dpeerlab/segger` main branch)
- **PyTorch:** 2.5.0+cu121 (works correctly — `torch.cuda.is_available()` returns `True`)
- **RAPIDS:** 25.4.x (`cudf-cu12`, `cuml-cu12`, `cugraph-cu12`, `cuspatial-cu12`)
- **UCX:** `ucx-py-cu12 0.43.0`, `libucx-cu12 1.18.1`
 
## Reproducing the issue
 
```bash
segger segment -i /path/to/ist/data/ -o /path/to/output/
```
 
Immediately crashes with:
 
```
[photon:912752:0:912752] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 912752) ====
 0  .../libucs.so(ucs_handle_error+0x294)
 ...
 4  /lib64/libcuda.so.1(+0x31a708)
 5  /lib64/libcuda.so.1(cuCtxGetDevice_v2+0x20)
 6  .../libffi.so.8(+0x702a)
 ...
Segmentation fault (core dumped)
```
 
## Root cause analysis
 
The crash occurs in the NVIDIA driver's `cuCtxGetDevice_v2` function, called via `ctypes`/`libffi` by the UCX library during Python module import. The import chain is:
 
```
segger segment
  → segger.cli.segment
    → segger.data.ISTDataModule
      → segger.data.utils.anndata
        → segger.data.utils.neighbors  (line 8: `import cugraph`)
          → cugraph.__init__
            → cugraph.structure.graph_primtypes_wrapper
              → cugraph.dask.__init__
                → cugraph.dask.comms.comms
                  → raft_dask.common.comms
                    → UCX (libucp.so / libucs.so)
                      → libcuda.so.1 cuCtxGetDevice_v2  ← SEGFAULT
```
 
### Why it happens
 
1. **`libcuda.so.1` is always system-global.** It is provided by the NVIDIA kernel module and cannot be installed per-environment via conda or pip. On this system it is the CUDA 13.1 driver.
 
2. **UCX probes the CUDA driver at import time** by calling `cuCtxGetDevice_v2` before any CUDA context has been created. On the CUDA 13.1 driver, this results in a segfault instead of a graceful error return.
 
3. **RAPIDS `cu12` packages ship UCX libraries** compiled against CUDA 12.x, creating a mismatch with the CUDA 13.1 system driver.
 
4. **PyTorch handles this correctly** — `torch.cuda.is_available()` works fine with the same driver, demonstrating that the CUDA 13.1 driver is functional and backward-compatible for well-behaved clients.
 
5. **`cuspatial` is archived and will never have CUDA 13 builds.** The [cuspatial repository](https://github.com/rapidsai/cuspatial) was archived by RAPIDS on July 28, 2025. The `cuspatial-cu13` entry on PyPI is a zero-version placeholder. This means segger's dependency on cuspatial is a permanent blocker for CUDA 13.x systems — not a temporary gap that will be filled by a future release.
 
6. **UCX is not needed by segger.** UCX provides multi-node multi-GPU communication for Dask distributed workloads. Segger runs single-node and does not use Dask distributed, yet UCX is loaded unconditionally because `cugraph` imports its `dask` submodule at package init time.
 
### What was tried (and failed)
 
| Attempt | Result |
|---|---|
| Downgrade conda `cuda-toolkit` to 12.1 | Same segfault — the toolkit is irrelevant, `libcuda.so.1` is always system-global |
| `export UCX_MEMTYPE_CACHE=n; export UCX_TLS=tcp,self` | Same segfault — crash happens before UCX config is read |
| `unset LD_LIBRARY_PATH` | Same segfault |
| Install RAPIDS via conda (`mamba install -c rapidsai`) | Same segfault — conda UCX also calls into system `libcuda.so.1` |
| `CUDA_VISIBLE_DEVICES=""` | **No segfault**, but then no GPU is available for computation |
| Uninstall UCX packages (`ucx-py-cu12`, `libucx-cu12`, etc.) | **No segfault**, but `import cugraph` fails with `ImportError: libucp.so.0` because cugraph unconditionally imports its Dask/distributed submodule which requires UCX |

## Possible solution
1. The most long-term solution would be to replace cuspatial, but I guess it would be hard to replicate its functions
2. Replace cugraph so that one could remove UCX, but it doesn't seem to be a very stable solution

Attempt	Result
Downgrade conda `cuda-toolkit` to 12.1	Same segfault — the toolkit is irrelevant, `libcuda.so.1` is always system-global
`export UCX_MEMTYPE_CACHE=n; export UCX_TLS=tcp,self`	Same segfault — crash happens before UCX config is read
`unset LD_LIBRARY_PATH`	Same segfault
Install RAPIDS via conda (`mamba install -c rapidsai`)	Same segfault — conda UCX also calls into system `libcuda.so.1`
`CUDA_VISIBLE_DEVICES=""`	No segfault, but then no GPU is available for computation
Uninstall UCX packages (`ucx-py-cu12`, `libucx-cu12`, etc.)	No segfault, but `import cugraph` fails with `ImportError: libucp.so.0` because cugraph unconditionally imports its Dask/distributed submodule which requires UCX

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault on systems with CUDA 13.x drivers due to UCX/RAPIDS dependency chain #30

Summary

Environment

Reproducing the issue

Root cause analysis

Why it happens

What was tried (and failed)

Possible solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Segmentation fault on systems with CUDA 13.x drivers due to UCX/RAPIDS dependency chain #30

Description

Summary

Environment

Reproducing the issue

Root cause analysis

Why it happens

What was tried (and failed)

Possible solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions