Add RAPIDS Doctor Check for cuGraph-PyG and pylibwholegraph#418
Add RAPIDS Doctor Check for cuGraph-PyG and pylibwholegraph#418alexbarghi-nv wants to merge 15 commits intorapidsai:mainfrom
Conversation
This commit introduces new entry points for smoke checks in both the `pylibwholegraph` and `cugraph-pyg` projects. The entry points are defined in their respective `pyproject.toml` files, allowing for easier integration and testing of smoke checks. Changes: - Added `pylibwholegraph_smoke_check` entry point in `python/pylibwholegraph/pyproject.toml` - Added `cugraph_pyg_smoke_check` entry point in `python/cugraph-pyg/pyproject.toml`
This commit modifies the entry points for smoke checks in the `pyproject.toml` files of both `pylibwholegraph` and `cugraph-pyg` to enhance testing capabilities. Changes: - Updated `pylibwholegraph_smoke_check` entry point in `python/pylibwholegraph/pyproject.toml` - Updated `cugraph_pyg_smoke_check` entry point in `python/cugraph-pyg/pyproject.toml`
Greptile SummaryAdds
Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[rapids doctor invokes entrypoint] --> B{which check?}
B --> C[cugraph_pyg_smoke_check]
B --> D[pylibwholegraph_smoke_check]
C --> C1[import cugraph_pyg + submodules]
C1 --> C2{ImportError?}
C2 -- yes --> C_FAIL[raise ImportError]
C2 -- no --> C3[check __version__]
C3 --> C4[import_optional torch]
C4 --> C5{MissingModule or no CUDA?}
C5 -- yes --> C_WARN[warnings.warn PyTorch needed]
C5 -- no --> C6[save env vars]
C6 --> C7[set distributed env vars]
C7 --> C8[init_process_group nccl]
C8 --> C9[GraphStore put/get edge_index]
C9 --> C10[assert shape]
C10 --> C11[finally: restore env vars\n+ destroy_process_group if initialized]
D --> D1[import pylibwholegraph]
D1 --> D2{ImportError?}
D2 -- yes --> D_FAIL[raise ImportError]
D2 -- no --> D3[check __version__]
D3 --> D4[import torch\nassert cuda.is_available]
D4 --> D5{ImportError or AssertionError?}
D5 -- yes --> D_WARN[warnings.warn PyTorch/CUDA needed]
D5 -- no --> D_OK[check passes]
Last reviewed commit: 02c66e4 |
| # Ensure core submodules load (touches pylibwholegraph, torch-geometric, etc.) | ||
| import cugraph_pyg.data | ||
| import cugraph_pyg.tensor |
There was a problem hiding this comment.
Redundant submodule imports
import cugraph_pyg on line 17 already executes cugraph_pyg/__init__.py, which unconditionally imports cugraph_pyg.data, cugraph_pyg.loader, cugraph_pyg.sampler, and cugraph_pyg.tensor. The explicit imports on lines 30–31 are therefore no-ops (Python returns the already-cached modules), and the comment is misleading about what they accomplish.
If the intent is to surface per-submodule import errors, consider wrapping each in its own try-catch for clarity:
| # Ensure core submodules load (touches pylibwholegraph, torch-geometric, etc.) | |
| import cugraph_pyg.data | |
| import cugraph_pyg.tensor | |
| # Validate that core submodules can be explicitly imported. | |
| try: | |
| import cugraph_pyg.data | |
| import cugraph_pyg.tensor | |
| except ImportError as e: | |
| raise ImportError(f"cugraph-pyg submodule could not be imported: {e}") from e |
Otherwise, remove these lines and the misleading comment entirely.
jacobtomlinson
left a comment
There was a problem hiding this comment.
Thanks for opening this, this is exactly the kind of thing we need.
The most common failure points we see in testing are import failures, memory allocation and kernel launch errors.
Is there some very small thing that could be added which allocates some memory and launches a kernel?
We have a bit of a problem there. Because of DLFW and other packaging constraints, we can't have PyTorch as a hard dependency, but in order to actually launch a kernel, we need PyTorch installed. |
|
That feels like a good thing to check with You could add a check for |
Are you positive this won't break DLFW? They say the packages must be importable without PyTorch installed. |
|
I'm not suggesting changing how you handle importing, just proposing adding a check for RAPIDS doctor that warns users that You could add another check along the lines of def cugraph_pyg_torch_check(**kwargs):
"""
A quick check to ensure pytorch is importable.
"""
try:
import torch
# Maybe make a small GPU tensor and do an operation here just to exercise torch a bit?
except ImportError as e:
raise ImportError(
"cugraph-pyg depends on pytorch but torch failed to import. "
"Tip: install with ..."
) from eHaving a check that does this shouldn't affect DLFW, but would be useful for users who have installed cudf-pyg and forgot pytorch. |
| try: | ||
| import torch | ||
|
|
||
| assert torch.cuda.is_available() | ||
|
|
||
| except ImportError: |
There was a problem hiding this comment.
AssertionError not caught when CUDA is unavailable
When torch is installed but CUDA is not available, import torch succeeds, then assert torch.cuda.is_available() raises AssertionError. The except ImportError clause does not catch AssertionError, so the exception propagates unhandled to the caller — causing the check to hard-fail instead of emitting the intended warning.
| try: | |
| import torch | |
| assert torch.cuda.is_available() | |
| except ImportError: | |
| try: | |
| import torch | |
| if not torch.cuda.is_available(): | |
| raise ImportError("torch.cuda is not available") | |
| except (ImportError, AssertionError): |
Or more directly: replace the assert with an explicit if/raise ImportError so the single except ImportError handles both the missing-package and no-CUDA cases.
| addr = os.environ.get("MASTER_ADDR", "") | ||
| port = os.environ.get("MASTER_PORT", "") | ||
| local_rank = os.environ.get("LOCAL_RANK", "") | ||
| world_size = os.environ.get("WORLD_SIZE", "") | ||
| local_world_size = os.environ.get("LOCAL_WORLD_SIZE", "") | ||
| rank = os.environ.get("RANK", "") |
There was a problem hiding this comment.
Env-var restoration sets empty string instead of deleting
os.environ.get("KEY", "") returns "" for both "not set" and "set to empty string", so the saved value cannot distinguish between those two states. After the finally block, variables that were originally absent end up set to "" in the environment — which is semantically different from not existing, and could still interfere with downstream distributed code.
Use None as the sentinel:
addr = os.environ.get("MASTER_ADDR")
port = os.environ.get("MASTER_PORT")
local_rank = os.environ.get("LOCAL_RANK")
world_size = os.environ.get("WORLD_SIZE")
local_world_size = os.environ.get("LOCAL_WORLD_SIZE")
rank = os.environ.get("RANK")Then restore with:
for key, val in [
("MASTER_ADDR", addr), ("MASTER_PORT", port),
("LOCAL_RANK", local_rank), ("WORLD_SIZE", world_size),
("LOCAL_WORLD_SIZE", local_world_size), ("RANK", rank),
]:
if val is None:
os.environ.pop(key, None)
else:
os.environ[key] = val| try: | ||
| os.environ["MASTER_ADDR"] = "localhost" | ||
| os.environ["MASTER_PORT"] = "29505" | ||
| os.environ["LOCAL_RANK"] = "0" | ||
| os.environ["WORLD_SIZE"] = "1" | ||
| os.environ["LOCAL_WORLD_SIZE"] = "1" | ||
| os.environ["RANK"] = "0" | ||
| torch.distributed.init_process_group("nccl") | ||
|
|
||
| graph_store = GraphStore() | ||
| graph_store.put_edge_index( | ||
| torch.tensor([[0, 1], [1, 2]]), | ||
| ("person", "knows", "person"), | ||
| "coo", | ||
| False, | ||
| (3, 3), | ||
| ) | ||
| edge_index = graph_store.get_edge_index( | ||
| ("person", "knows", "person"), "coo" | ||
| ) | ||
| assert edge_index.shape == torch.Size([2, 2]) | ||
| finally: | ||
| os.environ["MASTER_ADDR"] = addr | ||
| os.environ["MASTER_PORT"] = port | ||
| os.environ["LOCAL_RANK"] = local_rank | ||
| os.environ["WORLD_SIZE"] = world_size | ||
| os.environ["LOCAL_WORLD_SIZE"] = local_world_size | ||
| os.environ["RANK"] = rank | ||
| torch.distributed.destroy_process_group() |
There was a problem hiding this comment.
destroy_process_group called even if init_process_group failed
If torch.distributed.init_process_group("nccl") raises (e.g., NCCL not found, GPU not reachable), the finally block unconditionally calls torch.distributed.destroy_process_group(). PyTorch will raise RuntimeError: Default process group has not been initialized from inside the finally, which suppresses the original exception and makes diagnosis much harder.
Guard the destroy call:
initialized = False
try:
torch.distributed.init_process_group("nccl")
initialized = True
# ... rest of the check ...
finally:
# restore env vars ...
if initialized:
torch.distributed.destroy_process_group()
Adds a RAPIDS Doctor check for cuGraph-PyG and pylibwholegraph.