Skip to content

gcn_dist_mnmg.py fails with seed_vertex_label_offsets error on multi-GPU (H100, SNMG) #412

@mmsalehid

Description

@mmsalehid

I am trying to run the gcn_dist_mnmg.py example from the cuGraph-GNN repository on a single node equipped with 4 H100 SXM GPUs.

When executing using torchrun all ranks fail with the following error:

[rank0]: RuntimeError: non-success value returned from cugraph_homogeneous_uniform_neighbor_sample: CUGRAPH_UNKNOWN_ERROR cuGraph failure at file=/home/coder/cugraph/cpp/src/sampling/sampling_post_processing_impl.cuh line=233: Invalid input arguments: if seed_vertex_label_offsets is valid, (*seed_vertex_label_offsets).size() (size of the offset array) should be num_labels + 1.

I tested the example with both ogbn-products and ogbn-arxiv, and the same error occurs in each case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions