Skip to content

Add RAPIDS Doctor Check for cuGraph-PyG and pylibwholegraph#418

Open
alexbarghi-nv wants to merge 15 commits intorapidsai:mainfrom
alexbarghi-nv:feature/rapids-doctor-cugraph-pyg-pylibwholegraph
Open

Add RAPIDS Doctor Check for cuGraph-PyG and pylibwholegraph#418
alexbarghi-nv wants to merge 15 commits intorapidsai:mainfrom
alexbarghi-nv:feature/rapids-doctor-cugraph-pyg-pylibwholegraph

Conversation

@alexbarghi-nv
Copy link
Member

Adds a RAPIDS Doctor check for cuGraph-PyG and pylibwholegraph.

This commit introduces new entry points for smoke checks in both the `pylibwholegraph` and `cugraph-pyg` projects. The entry points are defined in their respective `pyproject.toml` files, allowing for easier integration and testing of smoke checks.

Changes:
- Added `pylibwholegraph_smoke_check` entry point in `python/pylibwholegraph/pyproject.toml`
- Added `cugraph_pyg_smoke_check` entry point in `python/cugraph-pyg/pyproject.toml`
This commit modifies the entry points for smoke checks in the `pyproject.toml` files of both `pylibwholegraph` and `cugraph-pyg` to enhance testing capabilities.

Changes:
- Updated `pylibwholegraph_smoke_check` entry point in `python/pylibwholegraph/pyproject.toml`
- Updated `cugraph_pyg_smoke_check` entry point in `python/cugraph-pyg/pyproject.toml`
@alexbarghi-nv alexbarghi-nv requested review from a team as code owners March 3, 2026 23:34
@alexbarghi-nv alexbarghi-nv requested a review from msarahan March 3, 2026 23:34
@alexbarghi-nv alexbarghi-nv self-assigned this Mar 3, 2026
@alexbarghi-nv alexbarghi-nv added feature request New feature or request non-breaking Introduces a non-breaking change labels Mar 3, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 3, 2026

Greptile Summary

Adds rapids doctor smoke check entrypoints for cugraph-pyg and pylibwholegraph. The cuGraph-PyG check imports the package and its core submodules, verifies __version__, and — when PyTorch with CUDA is available — runs a short distributed GraphStore round-trip with proper env-var save/restore and NCCL process-group lifecycle management. The pylibwholegraph check is lighter: import + version check + optional PyTorch/CUDA probe. Entry points are wired up in each package's pyproject.toml.

  • Most previous review feedback has been addressed: env-var restoration now uses None sentinel, initialized flag guards destroy_process_group, and AssertionError is caught in the pylibwholegraph CUDA probe.
  • The bare assert edge_index.shape == torch.Size([2, 2]) in the cuGraph-PyG check (line 82) is still present and silently becomes a no-op under python -O.
  • assert torch.cuda.is_available() in the pylibwholegraph check (line 32) has the same -O stripping issue: the CUDA warning is silently skipped.

Confidence Score: 4/5

  • Safe to merge; remaining issues are minor style concerns around bare asserts under -O.
  • The core logic is sound — env vars are restored, process group is only destroyed when successfully initialized, and error paths are handled. The only remaining open issue is bare assert statements that are stripped under Python's -O flag, which is unlikely to affect typical RAPIDS Doctor usage but is still a best-practice gap.
  • python/pylibwholegraph/pylibwholegraph/_doctor_check.py (bare assert at line 32), python/cugraph-pyg/cugraph_pyg/_doctor_check.py (bare assert at line 82)

Important Files Changed

Filename Overview
python/cugraph-pyg/cugraph_pyg/_doctor_check.py Adds RAPIDS Doctor smoke check for cuGraph-PyG with proper env var save/restore, NCCL process group lifecycle guarding via initialized flag, and optional distributed GraphStore smoke test. One bare assert remains at line 82 (previously flagged).
python/pylibwholegraph/pylibwholegraph/_doctor_check.py Adds RAPIDS Doctor smoke check for pylibwholegraph. The assert torch.cuda.is_available() pattern is silently stripped under -O, causing the CUDA warning to be skipped without notice.
python/cugraph-pyg/pyproject.toml Registers cugraph_pyg_smoke_check as a rapids_doctor_check entry point. No issues.
python/pylibwholegraph/pyproject.toml Registers pylibwholegraph_smoke_check as a rapids_doctor_check entry point. No issues.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[rapids doctor invokes entrypoint] --> B{which check?}
    B --> C[cugraph_pyg_smoke_check]
    B --> D[pylibwholegraph_smoke_check]

    C --> C1[import cugraph_pyg + submodules]
    C1 --> C2{ImportError?}
    C2 -- yes --> C_FAIL[raise ImportError]
    C2 -- no --> C3[check __version__]
    C3 --> C4[import_optional torch]
    C4 --> C5{MissingModule or no CUDA?}
    C5 -- yes --> C_WARN[warnings.warn PyTorch needed]
    C5 -- no --> C6[save env vars]
    C6 --> C7[set distributed env vars]
    C7 --> C8[init_process_group nccl]
    C8 --> C9[GraphStore put/get edge_index]
    C9 --> C10[assert shape]
    C10 --> C11[finally: restore env vars\n+ destroy_process_group if initialized]

    D --> D1[import pylibwholegraph]
    D1 --> D2{ImportError?}
    D2 -- yes --> D_FAIL[raise ImportError]
    D2 -- no --> D3[check __version__]
    D3 --> D4[import torch\nassert cuda.is_available]
    D4 --> D5{ImportError or AssertionError?}
    D5 -- yes --> D_WARN[warnings.warn PyTorch/CUDA needed]
    D5 -- no --> D_OK[check passes]
Loading

Last reviewed commit: 02c66e4

Comment on lines +29 to +31
# Ensure core submodules load (touches pylibwholegraph, torch-geometric, etc.)
import cugraph_pyg.data
import cugraph_pyg.tensor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant submodule imports

import cugraph_pyg on line 17 already executes cugraph_pyg/__init__.py, which unconditionally imports cugraph_pyg.data, cugraph_pyg.loader, cugraph_pyg.sampler, and cugraph_pyg.tensor. The explicit imports on lines 30–31 are therefore no-ops (Python returns the already-cached modules), and the comment is misleading about what they accomplish.

If the intent is to surface per-submodule import errors, consider wrapping each in its own try-catch for clarity:

Suggested change
# Ensure core submodules load (touches pylibwholegraph, torch-geometric, etc.)
import cugraph_pyg.data
import cugraph_pyg.tensor
# Validate that core submodules can be explicitly imported.
try:
import cugraph_pyg.data
import cugraph_pyg.tensor
except ImportError as e:
raise ImportError(f"cugraph-pyg submodule could not be imported: {e}") from e

Otherwise, remove these lines and the misleading comment entirely.

Copy link
Contributor

@linhu-nv linhu-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. thx

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this, this is exactly the kind of thing we need.

The most common failure points we see in testing are import failures, memory allocation and kernel launch errors.

Is there some very small thing that could be added which allocates some memory and launches a kernel?

@alexbarghi-nv
Copy link
Member Author

Thanks for opening this, this is exactly the kind of thing we need.

The most common failure points we see in testing are import failures, memory allocation and kernel launch errors.

Is there some very small thing that could be added which allocates some memory and launches a kernel?

We have a bit of a problem there. Because of DLFW and other packaging constraints, we can't have PyTorch as a hard dependency, but in order to actually launch a kernel, we need PyTorch installed.

@jacobtomlinson
Copy link
Member

That feels like a good thing to check with rapids doctor though no? If pytorch isn't a hard dependency that sounds like a big footgun for users. It must be possible to accidentally construct an environment that is missing pytorch?

You could add a check for import torch and raise a helpful error if it's missing "You have cugraph_pyg installed but you are missing pytorch, make sure you install pytorch with [link to steps here]".

@alexbarghi-nv
Copy link
Member Author

That feels like a good thing to check with rapids doctor though no? If pytorch isn't a hard dependency that sounds like a big footgun for users. It must be possible to accidentally construct an environment that is missing pytorch?

You could add a check for import torch and raise a helpful error if it's missing "You have cugraph_pyg installed but you are missing pytorch, make sure you install pytorch with [link to steps here]".

Are you positive this won't break DLFW? They say the packages must be importable without PyTorch installed.

@jacobtomlinson
Copy link
Member

I'm not suggesting changing how you handle importing, just proposing adding a check for RAPIDS doctor that warns users that pytorch is missing.

You could add another check along the lines of

def cugraph_pyg_torch_check(**kwargs):
    """
    A quick check to ensure pytorch is importable.
    """
    try:
        import torch
        # Maybe make a small GPU tensor and do an operation here just to exercise torch a bit?
    except ImportError as e:
        raise ImportError(
            "cugraph-pyg depends on pytorch but torch failed to import. "
            "Tip: install with ..."
        ) from e

Having a check that does this shouldn't affect DLFW, but would be useful for users who have installed cudf-pyg and forgot pytorch.

Comment on lines +29 to +34
try:
import torch

assert torch.cuda.is_available()

except ImportError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AssertionError not caught when CUDA is unavailable

When torch is installed but CUDA is not available, import torch succeeds, then assert torch.cuda.is_available() raises AssertionError. The except ImportError clause does not catch AssertionError, so the exception propagates unhandled to the caller — causing the check to hard-fail instead of emitting the intended warning.

Suggested change
try:
import torch
assert torch.cuda.is_available()
except ImportError:
try:
import torch
if not torch.cuda.is_available():
raise ImportError("torch.cuda is not available")
except (ImportError, AssertionError):

Or more directly: replace the assert with an explicit if/raise ImportError so the single except ImportError handles both the missing-package and no-CUDA cases.

Comment on lines +49 to +54
addr = os.environ.get("MASTER_ADDR", "")
port = os.environ.get("MASTER_PORT", "")
local_rank = os.environ.get("LOCAL_RANK", "")
world_size = os.environ.get("WORLD_SIZE", "")
local_world_size = os.environ.get("LOCAL_WORLD_SIZE", "")
rank = os.environ.get("RANK", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Env-var restoration sets empty string instead of deleting

os.environ.get("KEY", "") returns "" for both "not set" and "set to empty string", so the saved value cannot distinguish between those two states. After the finally block, variables that were originally absent end up set to "" in the environment — which is semantically different from not existing, and could still interfere with downstream distributed code.

Use None as the sentinel:

        addr = os.environ.get("MASTER_ADDR")
        port = os.environ.get("MASTER_PORT")
        local_rank = os.environ.get("LOCAL_RANK")
        world_size = os.environ.get("WORLD_SIZE")
        local_world_size = os.environ.get("LOCAL_WORLD_SIZE")
        rank = os.environ.get("RANK")

Then restore with:

        for key, val in [
            ("MASTER_ADDR", addr), ("MASTER_PORT", port),
            ("LOCAL_RANK", local_rank), ("WORLD_SIZE", world_size),
            ("LOCAL_WORLD_SIZE", local_world_size), ("RANK", rank),
        ]:
            if val is None:
                os.environ.pop(key, None)
            else:
                os.environ[key] = val

Comment on lines +56 to +84
try:
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "29505"
os.environ["LOCAL_RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"
os.environ["LOCAL_WORLD_SIZE"] = "1"
os.environ["RANK"] = "0"
torch.distributed.init_process_group("nccl")

graph_store = GraphStore()
graph_store.put_edge_index(
torch.tensor([[0, 1], [1, 2]]),
("person", "knows", "person"),
"coo",
False,
(3, 3),
)
edge_index = graph_store.get_edge_index(
("person", "knows", "person"), "coo"
)
assert edge_index.shape == torch.Size([2, 2])
finally:
os.environ["MASTER_ADDR"] = addr
os.environ["MASTER_PORT"] = port
os.environ["LOCAL_RANK"] = local_rank
os.environ["WORLD_SIZE"] = world_size
os.environ["LOCAL_WORLD_SIZE"] = local_world_size
os.environ["RANK"] = rank
torch.distributed.destroy_process_group()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

destroy_process_group called even if init_process_group failed

If torch.distributed.init_process_group("nccl") raises (e.g., NCCL not found, GPU not reachable), the finally block unconditionally calls torch.distributed.destroy_process_group(). PyTorch will raise RuntimeError: Default process group has not been initialized from inside the finally, which suppresses the original exception and makes diagnosis much harder.

Guard the destroy call:

        initialized = False
        try:
            torch.distributed.init_process_group("nccl")
            initialized = True
            # ... rest of the check ...
        finally:
            # restore env vars ...
            if initialized:
                torch.distributed.destroy_process_group()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants