Add RAPIDS Doctor Check for cuGraph-PyG and pylibwholegraph by alexbarghi-nv · Pull Request #418 · rapidsai/cugraph-gnn

alexbarghi-nv · 2026-03-03T23:34:05Z

Adds a RAPIDS Doctor check for cuGraph-PyG and pylibwholegraph.

This commit introduces new entry points for smoke checks in both the `pylibwholegraph` and `cugraph-pyg` projects. The entry points are defined in their respective `pyproject.toml` files, allowing for easier integration and testing of smoke checks. Changes: - Added `pylibwholegraph_smoke_check` entry point in `python/pylibwholegraph/pyproject.toml` - Added `cugraph_pyg_smoke_check` entry point in `python/cugraph-pyg/pyproject.toml`

This commit modifies the entry points for smoke checks in the `pyproject.toml` files of both `pylibwholegraph` and `cugraph-pyg` to enhance testing capabilities. Changes: - Updated `pylibwholegraph_smoke_check` entry point in `python/pylibwholegraph/pyproject.toml` - Updated `cugraph_pyg_smoke_check` entry point in `python/cugraph-pyg/pyproject.toml`

greptile-apps · 2026-03-03T23:37:23Z

Greptile Summary

Adds rapids doctor smoke check entrypoints for cugraph-pyg and pylibwholegraph. The cuGraph-PyG check imports the package and its core submodules, verifies __version__, and — when PyTorch with CUDA is available — runs a short distributed GraphStore round-trip with proper env-var save/restore and NCCL process-group lifecycle management. The pylibwholegraph check is lighter: import + version check + optional PyTorch/CUDA probe. Entry points are wired up in each package's pyproject.toml.

Most previous review feedback has been addressed: env-var restoration now uses None sentinel, initialized flag guards destroy_process_group, and AssertionError is caught in the pylibwholegraph CUDA probe.
The bare assert edge_index.shape == torch.Size([2, 2]) in the cuGraph-PyG check (line 82) is still present and silently becomes a no-op under python -O.
assert torch.cuda.is_available() in the pylibwholegraph check (line 32) has the same -O stripping issue: the CUDA warning is silently skipped.

Confidence Score: 4/5

Safe to merge; remaining issues are minor style concerns around bare asserts under -O.
The core logic is sound — env vars are restored, process group is only destroyed when successfully initialized, and error paths are handled. The only remaining open issue is bare assert statements that are stripped under Python's -O flag, which is unlikely to affect typical RAPIDS Doctor usage but is still a best-practice gap.
python/pylibwholegraph/pylibwholegraph/_doctor_check.py (bare assert at line 32), python/cugraph-pyg/cugraph_pyg/_doctor_check.py (bare assert at line 82)

Important Files Changed

Filename	Overview
python/cugraph-pyg/cugraph_pyg/_doctor_check.py	Adds RAPIDS Doctor smoke check for cuGraph-PyG with proper env var save/restore, NCCL process group lifecycle guarding via `initialized` flag, and optional distributed GraphStore smoke test. One bare `assert` remains at line 82 (previously flagged).
python/pylibwholegraph/pylibwholegraph/_doctor_check.py	Adds RAPIDS Doctor smoke check for pylibwholegraph. The `assert torch.cuda.is_available()` pattern is silently stripped under `-O`, causing the CUDA warning to be skipped without notice.
python/cugraph-pyg/pyproject.toml	Registers `cugraph_pyg_smoke_check` as a `rapids_doctor_check` entry point. No issues.
python/pylibwholegraph/pyproject.toml	Registers `pylibwholegraph_smoke_check` as a `rapids_doctor_check` entry point. No issues.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[rapids doctor invokes entrypoint] --> B{which check?}
    B --> C[cugraph_pyg_smoke_check]
    B --> D[pylibwholegraph_smoke_check]

    C --> C1[import cugraph_pyg + submodules]
    C1 --> C2{ImportError?}
    C2 -- yes --> C_FAIL[raise ImportError]
    C2 -- no --> C3[check __version__]
    C3 --> C4[import_optional torch]
    C4 --> C5{MissingModule or no CUDA?}
    C5 -- yes --> C_WARN[warnings.warn PyTorch needed]
    C5 -- no --> C6[save env vars]
    C6 --> C7[set distributed env vars]
    C7 --> C8[init_process_group nccl]
    C8 --> C9[GraphStore put/get edge_index]
    C9 --> C10[assert shape]
    C10 --> C11[finally: restore env vars\n+ destroy_process_group if initialized]

    D --> D1[import pylibwholegraph]
    D1 --> D2{ImportError?}
    D2 -- yes --> D_FAIL[raise ImportError]
    D2 -- no --> D3[check __version__]
    D3 --> D4[import torch\nassert cuda.is_available]
    D4 --> D5{ImportError or AssertionError?}
    D5 -- yes --> D_WARN[warnings.warn PyTorch/CUDA needed]
    D5 -- no --> D_OK[check passes]

_{Last reviewed commit: 02c66e4}

greptile-apps · 2026-03-03T23:37:27Z

python/cugraph-pyg/cugraph_pyg/_doctor_check.py

+    # Ensure core submodules load (touches pylibwholegraph, torch-geometric, etc.)
+    import cugraph_pyg.data
+    import cugraph_pyg.tensor


Redundant submodule imports

import cugraph_pyg on line 17 already executes cugraph_pyg/__init__.py, which unconditionally imports cugraph_pyg.data, cugraph_pyg.loader, cugraph_pyg.sampler, and cugraph_pyg.tensor. The explicit imports on lines 30–31 are therefore no-ops (Python returns the already-cached modules), and the comment is misleading about what they accomplish.

If the intent is to surface per-submodule import errors, consider wrapping each in its own try-catch for clarity:

Suggested change

# Ensure core submodules load (touches pylibwholegraph, torch-geometric, etc.)

import cugraph_pyg.data

import cugraph_pyg.tensor

# Validate that core submodules can be explicitly imported.

try:

import cugraph_pyg.data

import cugraph_pyg.tensor

except ImportError as e:

raise ImportError(f"cugraph-pyg submodule could not be imported: {e}") from e

Otherwise, remove these lines and the misleading comment entirely.

python/pylibwholegraph/pylibwholegraph/_doctor_check.py

linhu-nv

LGTM. thx

jacobtomlinson

Thanks for opening this, this is exactly the kind of thing we need.

The most common failure points we see in testing are import failures, memory allocation and kernel launch errors.

Is there some very small thing that could be added which allocates some memory and launches a kernel?

alexbarghi-nv · 2026-03-04T17:09:27Z

Thanks for opening this, this is exactly the kind of thing we need.

The most common failure points we see in testing are import failures, memory allocation and kernel launch errors.

Is there some very small thing that could be added which allocates some memory and launches a kernel?

We have a bit of a problem there. Because of DLFW and other packaging constraints, we can't have PyTorch as a hard dependency, but in order to actually launch a kernel, we need PyTorch installed.

jacobtomlinson · 2026-03-05T10:36:24Z

That feels like a good thing to check with rapids doctor though no? If pytorch isn't a hard dependency that sounds like a big footgun for users. It must be possible to accidentally construct an environment that is missing pytorch?

You could add a check for import torch and raise a helpful error if it's missing "You have cugraph_pyg installed but you are missing pytorch, make sure you install pytorch with [link to steps here]".

alexbarghi-nv · 2026-03-05T15:36:30Z

That feels like a good thing to check with rapids doctor though no? If pytorch isn't a hard dependency that sounds like a big footgun for users. It must be possible to accidentally construct an environment that is missing pytorch?

You could add a check for import torch and raise a helpful error if it's missing "You have cugraph_pyg installed but you are missing pytorch, make sure you install pytorch with [link to steps here]".

Are you positive this won't break DLFW? They say the packages must be importable without PyTorch installed.

jacobtomlinson · 2026-03-05T17:48:12Z

I'm not suggesting changing how you handle importing, just proposing adding a check for RAPIDS doctor that warns users that pytorch is missing.

You could add another check along the lines of

def cugraph_pyg_torch_check(**kwargs):
    """
    A quick check to ensure pytorch is importable.
    """
    try:
        import torch
        # Maybe make a small GPU tensor and do an operation here just to exercise torch a bit?
    except ImportError as e:
        raise ImportError(
            "cugraph-pyg depends on pytorch but torch failed to import. "
            "Tip: install with ..."
        ) from e

Having a check that does this shouldn't affect DLFW, but would be useful for users who have installed cudf-pyg and forgot pytorch.

python/cugraph-pyg/cugraph_pyg/_doctor_check.py

python/pylibwholegraph/pylibwholegraph/_doctor_check.py

python/cugraph-pyg/cugraph_pyg/_doctor_check.py

python/pylibwholegraph/pylibwholegraph/_doctor_check.py

python/cugraph-pyg/cugraph_pyg/_doctor_check.py

python/pylibwholegraph/pylibwholegraph/_doctor_check.py

python/cugraph-pyg/cugraph_pyg/_doctor_check.py

greptile-apps · 2026-03-06T00:13:52Z

python/pylibwholegraph/pylibwholegraph/_doctor_check.py

+    try:
+        import torch
+
+        assert torch.cuda.is_available()
+
+    except ImportError:


AssertionError not caught when CUDA is unavailable

When torch is installed but CUDA is not available, import torch succeeds, then assert torch.cuda.is_available() raises AssertionError. The except ImportError clause does not catch AssertionError, so the exception propagates unhandled to the caller — causing the check to hard-fail instead of emitting the intended warning.

Suggested change

try:

import torch

assert torch.cuda.is_available()

except ImportError:

try:

import torch

if not torch.cuda.is_available():

raise ImportError("torch.cuda is not available")

except (ImportError, AssertionError):

Or more directly: replace the assert with an explicit if/raise ImportError so the single except ImportError handles both the missing-package and no-CUDA cases.

greptile-apps · 2026-03-06T00:13:53Z

python/cugraph-pyg/cugraph_pyg/_doctor_check.py

+        addr = os.environ.get("MASTER_ADDR", "")
+        port = os.environ.get("MASTER_PORT", "")
+        local_rank = os.environ.get("LOCAL_RANK", "")
+        world_size = os.environ.get("WORLD_SIZE", "")
+        local_world_size = os.environ.get("LOCAL_WORLD_SIZE", "")
+        rank = os.environ.get("RANK", "")


Env-var restoration sets empty string instead of deleting

os.environ.get("KEY", "") returns "" for both "not set" and "set to empty string", so the saved value cannot distinguish between those two states. After the finally block, variables that were originally absent end up set to "" in the environment — which is semantically different from not existing, and could still interfere with downstream distributed code.

Use None as the sentinel:

addr = os.environ.get("MASTER_ADDR") port = os.environ.get("MASTER_PORT") local_rank = os.environ.get("LOCAL_RANK") world_size = os.environ.get("WORLD_SIZE") local_world_size = os.environ.get("LOCAL_WORLD_SIZE") rank = os.environ.get("RANK")

Then restore with:

for key, val in [ ("MASTER_ADDR", addr), ("MASTER_PORT", port), ("LOCAL_RANK", local_rank), ("WORLD_SIZE", world_size), ("LOCAL_WORLD_SIZE", local_world_size), ("RANK", rank), ]: if val is None: os.environ.pop(key, None) else: os.environ[key] = val

greptile-apps · 2026-03-06T00:13:54Z

python/cugraph-pyg/cugraph_pyg/_doctor_check.py

+        try:
+            os.environ["MASTER_ADDR"] = "localhost"
+            os.environ["MASTER_PORT"] = "29505"
+            os.environ["LOCAL_RANK"] = "0"
+            os.environ["WORLD_SIZE"] = "1"
+            os.environ["LOCAL_WORLD_SIZE"] = "1"
+            os.environ["RANK"] = "0"
+            torch.distributed.init_process_group("nccl")
+
+            graph_store = GraphStore()
+            graph_store.put_edge_index(
+                torch.tensor([[0, 1], [1, 2]]),
+                ("person", "knows", "person"),
+                "coo",
+                False,
+                (3, 3),
+            )
+            edge_index = graph_store.get_edge_index(
+                ("person", "knows", "person"), "coo"
+            )
+            assert edge_index.shape == torch.Size([2, 2])
+        finally:
+            os.environ["MASTER_ADDR"] = addr
+            os.environ["MASTER_PORT"] = port
+            os.environ["LOCAL_RANK"] = local_rank
+            os.environ["WORLD_SIZE"] = world_size
+            os.environ["LOCAL_WORLD_SIZE"] = local_world_size
+            os.environ["RANK"] = rank
+            torch.distributed.destroy_process_group()


destroy_process_group called even if init_process_group failed

If torch.distributed.init_process_group("nccl") raises (e.g., NCCL not found, GPU not reachable), the finally block unconditionally calls torch.distributed.destroy_process_group(). PyTorch will raise RuntimeError: Default process group has not been initialized from inside the finally, which suppresses the original exception and makes diagnosis much harder.

Guard the destroy call:

initialized = False try: torch.distributed.init_process_group("nccl") initialized = True # ... rest of the check ... finally: # restore env vars ... if initialized: torch.distributed.destroy_process_group()

python/cugraph-pyg/cugraph_pyg/_doctor_check.py

…graph

python/pylibwholegraph/pylibwholegraph/_doctor_check.py

alexbarghi-nv added 4 commits March 3, 2026 15:30

fix copyrights/style

f44a527

remove debug files

2622444

alexbarghi-nv requested review from a team as code owners March 3, 2026 23:34

alexbarghi-nv requested a review from msarahan March 3, 2026 23:34

alexbarghi-nv self-assigned this Mar 3, 2026

alexbarghi-nv requested review from jacobtomlinson and tingyu66 March 3, 2026 23:34

alexbarghi-nv added feature request New feature or request non-breaking Introduces a non-breaking change labels Mar 3, 2026

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

linhu-nv approved these changes Mar 4, 2026

View reviewed changes

jacobtomlinson reviewed Mar 4, 2026

View reviewed changes

alexbarghi-nv added 2 commits March 5, 2026 11:03

update smoke test to test torch

3db88f6

remove cugraph-pyg check since it requires creating a process group

cb93dc2

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

alexbarghi-nv added 2 commits March 5, 2026 15:59

fix import

d9e7947

add missing whitespace

96b7b26

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

python/cugraph-pyg/cugraph_pyg/_doctor_check.py Outdated Show resolved Hide resolved

python/cugraph-pyg/cugraph_pyg/_doctor_check.py Outdated Show resolved Hide resolved

python/cugraph-pyg/cugraph_pyg/_doctor_check.py Outdated Show resolved Hide resolved

alexbarghi-nv added 3 commits March 5, 2026 16:07

add check for cuda support

6be24a0

check for cuda support in pytorch for cugraph-pyg

0c623f7

reset environment variables

6c3ecd0

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

python/pylibwholegraph/pylibwholegraph/_doctor_check.py Outdated Show resolved Hide resolved

python/cugraph-pyg/cugraph_pyg/_doctor_check.py Outdated Show resolved Hide resolved

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

alexbarghi-nv added 2 commits March 5, 2026 16:17

fix assertion/exception handling issues

23b948b

fix environment variable reset

fa5a5a6

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

python/cugraph-pyg/cugraph_pyg/_doctor_check.py Outdated Show resolved Hide resolved

alexbarghi-nv and others added 2 commits March 5, 2026 16:29

ensure destroy process group is only called if initialized

25f0541

Merge branch 'main' into feature/rapids-doctor-cugraph-pyg-pylibwhole…

02c66e4

…graph

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

python/pylibwholegraph/pylibwholegraph/_doctor_check.py Show resolved Hide resolved

Conversation

alexbarghi-nv commented Mar 3, 2026

Uh oh!

greptile-apps bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

linhu-nv left a comment

Choose a reason for hiding this comment

Uh oh!

jacobtomlinson left a comment

Choose a reason for hiding this comment

Uh oh!

alexbarghi-nv commented Mar 4, 2026

Uh oh!

jacobtomlinson commented Mar 5, 2026

Uh oh!

alexbarghi-nv commented Mar 5, 2026

Uh oh!

jacobtomlinson commented Mar 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Mar 3, 2026 •

edited

Loading