-
Notifications
You must be signed in to change notification settings - Fork 4k
Clean up examples and improve cuGraph docs #10489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 17 commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
fba52f9
Remove examples/multi_gpu since all deprecated, point to CuGraph
puririshi98 f3e1dbf
torch geometric distributed is deprecated, we really want to avoid pe…
puririshi98 8d566a8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 3676ea6
Update CHANGELOG.md
puririshi98 4fe5c60
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 21af4de
Update README.md
puririshi98 4abec94
Merge branch 'master' into cleanup-cugraph-examples
puririshi98 2b87e54
Merge branch 'master' into cleanup-cugraph-examples
puririshi98 7c41c0b
Merge branch 'master' into cleanup-cugraph-examples
puririshi98 3016cf2
Merge branch 'master' into cleanup-cugraph-examples
puririshi98 1ee4cfe
Merge branch 'master' into cleanup-cugraph-examples
puririshi98 1d3b06e
Dont delte kuzu
akihironitta 46ce709
Dont delte graphlearn
akihironitta a347b39
Keep distributed/pyg/README.md, guide useres to cuGraph
akihironitta 60cf58a
Keep examples/multi_gpu/
akihironitta 90f8304
Keep distributed/README.md
akihironitta cc883da
Update examples/README.md
akihironitta 4225477
Update examples/README.md
akihironitta 394de16
Update examples/README.md
akihironitta 19c2905
Add a link to cugraph examples in installation page
akihironitta d4c5389
Update CHANGELOG.md
akihironitta 84ed804
Fix docs build workflow
akihironitta 7817688
Merge branch 'cleanup-cugraph-examples' of https://github.com/pyg-tea…
akihironitta 5414259
Merge branch 'master' into cleanup-cugraph-examples
akihironitta 441693b
Fix again
akihironitta File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,138 +1,4 @@ | ||
| # Distributed Training with PyG | ||
|
|
||
| **[`torch_geometric.distributed`](https://github.com/pyg-team/pytorch_geometric/tree/master/torch_geometric/distributed)** (deprecated) implements a scalable solution for distributed GNN training, built exclusively upon PyTorch and PyG. | ||
|
|
||
| Current application can be deployed on a cluster of arbitrary size using multiple CPUs. | ||
| PyG native GPU application is under development and will be released soon. | ||
|
|
||
| The solution is designed to effortlessly distribute the training of large-scale graph neural networks across multiple nodes, thanks to the integration of [Distributed Data Parallelism (DDP)](https://pytorch.org/docs/stable/notes/ddp.html) for model training and [Remote Procedure Call (RPC)](https://pytorch.org/docs/stable/rpc.html) for efficient sampling and fetching of non-local features. | ||
| The design includes a number of custom classes, *i.e.* (1) `DistNeighborSampler` implements CPU sampling algorithms and feature extraction from local and remote data remaining consistent data structure at the output, (2) an integrated `DistLoader` which ensures safe opening & closing of RPC connection between the samplers, and (3) a METIS-based `Partitioner` and many more. | ||
|
|
||
| ## Example for Node-level Distributed Training on OGB Datasets | ||
|
|
||
| The example provided in [`node_ogb_cpu.py`](./node_ogb_cpu.py) performs distributed training with multiple CPU nodes using [OGB](https://ogb.stanford.edu/) datasets and a [`GraphSAGE`](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.models.GraphSAGE.html) model. | ||
| The example can run on both homogeneous (`ogbn-products`) and heterogeneous data (`ogbn-mag`). | ||
| With minor modifications, the example can be extended to train on `ogbn-papers100m` or any other dataset. | ||
|
|
||
| To run the example, please refer to the steps below. | ||
|
|
||
| ### Requirements | ||
|
|
||
| - [`torch-geometric>=2.5.0`](https://github.com/pyg-team/pytorch_geometric) and [`pyg-lib>=0.4.0`](https://github.com/pyg-team/pyg-lib) | ||
| - Password-less SSH needs to be set up on all the nodes that you are using (see the [Linux SSH manual](https://linuxize.com/post/how-to-setup-passwordless-ssh-login)). | ||
| - All nodes need to have a consistent environments installed, specifically `torch` and `pyg-lib` versions must be the same. | ||
| You might want to consider using docker containers. | ||
| - *[Optional]* In some cases Linux firewall might be blocking TCP connection issues. | ||
| Ensure that firewall settings allow for all nodes to communicate (see the [Linux firewall manual](https://ubuntu.com/server/docs/security-firewall)). | ||
| For this example TCP ports `11111`, `11112` and `11113` should be open (*i.e.* `sudo ufw allow 11111`). | ||
|
|
||
| ### Step 1: Prepare and Partition the Data | ||
|
|
||
| In distributed training, each node in the cluster holds a partition of the graph. | ||
| Before the training starts, we partition the dataset into multiple partitions, each of which corresponds to a specific training node. | ||
|
|
||
| Here, we use `ogbn-products` and partition it into two partitions (in default) via the [`partition_graph.py`](./partition_graph.py) script: | ||
|
|
||
| ```bash | ||
| python partition_graph.py --dataset=ogbn-products --root_dir=../../../data --num_partitions=2 | ||
| ``` | ||
|
|
||
| **Caution:** Partitioning with METIS is non-deterministic! | ||
| All nodes should be able to access the same partition data. | ||
| Therefore, generate the partitions on one node and copy the data to all members of the cluster, or place the folder into a shared location. | ||
|
|
||
| The generated partition will have a folder structure as below: | ||
|
|
||
| ``` | ||
| data | ||
| ├─ dataset | ||
| │ ├─ ogbn-mag | ||
| │ └─ ogbn-products | ||
| └─ partitions | ||
| ├─ obgn-mag | ||
| └─ obgn-products | ||
| ├─ ogbn-products-partitions | ||
| │ ├─ part_0 | ||
| │ ├─ part_1 | ||
| │ ├─ META.json | ||
| │ ├─ node_map.pt | ||
| │ └─ edge_map.pt | ||
| ├─ ogbn-products-label | ||
| │ └─ label.pt | ||
| ├─ ogbn-products-test-partitions | ||
| │ ├─ partition0.pt | ||
| │ └─ partition1.pt | ||
| └─ ogbn-products-train-partitions | ||
| ├─ partition0.pt | ||
| └─ partition1.pt | ||
| ``` | ||
|
|
||
| ### Step 2: Run the Example in Each Training Node | ||
|
|
||
| To run the example, you can execute the commands in each node or use the provided launch script. | ||
|
|
||
| #### Option A: Manual Execution | ||
|
|
||
| You should change the `master_addr` to the IP of `node#0`. | ||
| Make sure that the correct `node_rank` is provided, with the master node assigned to rank `0`. | ||
| The `dataset_root_dir` should point to the head directory where your partition is placed, *i.e.* `../../data/partitions/ogbn-products/2-parts`: | ||
|
|
||
| ```bash | ||
| # Node 0: | ||
| python node_ogb_cpu.py \ | ||
| --dataset=ogbn-products \ | ||
| --dataset_root_dir=<partition folder directory> \ | ||
| --num_nodes=2 \ | ||
| --node_rank=0 \ | ||
| --master_addr=<master ip> | ||
|
|
||
| # Node 1: | ||
| python node_obg_cpu.py \ | ||
| --dataset=ogbn-products \ | ||
| --dataset_root_dir=<partition folder directory> \ | ||
| --num_nodes=2 \ | ||
| --node_rank=1 \ | ||
| --master_addr=<master ip> | ||
| ``` | ||
|
|
||
| In some configurations, the network interface used for multi-node communication may be different than the default one. | ||
| In this case, the interface used for multi-node communication needs to be specified to Gloo. | ||
|
|
||
| Assuming that `$MASTER_ADDR` is set to the IP of `node#0`. | ||
|
|
||
| On the `node#0`: | ||
|
|
||
| ```bash | ||
| export TP_SOCKET_IFNAME=$(ip addr | grep "$MASTER_ADDR" | awk '{print $NF}') | ||
| export GLOO_SOCKET_IFNAME=$TP_SOCKET_IFNAME | ||
| ``` | ||
|
|
||
| On the other nodes: | ||
|
|
||
| ```bash | ||
| export TP_SOCKET_IFNAME=$(ip route get $MASTER_ADDR | grep -oP '(?<=dev )[^ ]+') | ||
| export GLOO_SOCKET_IFNAME=$TP_SOCKET_IFNAME | ||
| ``` | ||
|
|
||
| #### Option B: Launch Script | ||
|
|
||
| There exists two methods to run the distributed example with one script in one terminal for multiple nodes: | ||
|
|
||
| 1. [`launch.py`](./launch.py): | ||
| ```bash | ||
| python launch.py | ||
| --workspace {workspace}/pytorch_geometric | ||
| --num_nodes 2 | ||
| --dataset_root_dir {dataset_dir}/mag/2-parts | ||
| --dataset ogbn-mag | ||
| --batch_size 1024 | ||
| --learning_rate 0.0004 | ||
| --part_config {dataset_dir}/mag/2-parts/ogbn-mag-partitions/META.json | ||
| --ip_config {workspace}/pytorch_geometric/ip_config.yaml | ||
| 'cd /home/user_xxx; source {conda_envs}/bin/activate; cd {workspace}/pytorch_geometric; {conda_envs}/bin/python | ||
| {workspace}/pytorch_geometric/examples/pyg/node_ogb_cpu.py --dataset=ogbn-mag --logging --progress_bar --ddp_port=11111' | ||
| ``` | ||
| 1. [`run_dist.sh`](./run_dist.sh): All parameter settings are contained in the `run_dist.sh` script and you just need run with: | ||
| ```bash | ||
| ./run_dist.sh | ||
| ``` | ||
| > **Deprecated:** `torch_geometric.distributed` is deprecated. | ||
| > Please refer to [NVIDIA cuGraph-GNN](https://pytorch-geometric.readthedocs.io/en/latest/install/installation.html#accelerating-pyg-with-nvidia-cugraph-gnn) for scalable distributed GNN training with NVIDIA GPUs. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.