diff --git a/CHANGELOG.md b/CHANGELOG.md
index 709cced4b680..40dcb0214efb 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -90,6 +90,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
 
 ### Changed
 
+- Cleaned up examples folder in regards to multi-gpu scaling ([#10489](https://github.com/pyg-team/pytorch_geometric/pull/10489))
 - Added `edge_attr` in `CuGraphGATConv` ([#10383](https://github.com/pyg-team/pytorch_geometric/pull/10383))
 - Adapt `dgcnn_classification` example to work with `ModelNet` and `MedShapeNet` Datasets ([#9823](https://github.com/pyg-team/pytorch_geometric/pull/9823))
 - Chained exceptions explicitly instead of implicitly ([#10242](https://github.com/pyg-team/pytorch_geometric/pull/10242))
diff --git a/examples/README.md b/examples/README.md
index 2efce7068990..3b8793596da1 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -3,6 +3,8 @@
 This folder contains a plethora of examples covering different GNN use-cases.
 This readme highlights some key examples.
 
+Note: We recommend the [NVIDIA PyG Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pyg/tags) for best results and easiest setup with NVIDIA GPUs.
+
 A great and simple example to start with is [`gcn.py`](./gcn.py), showing a user how to train a [`GCN`](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.models.GCN.html) model for node-level prediction on small-scale homogeneous data.
 
 For a simple GNN based link prediction example, see [`link_pred.py`](./link_pred.py).
@@ -11,8 +13,6 @@ For an improved GNN based link prediction approach using Attract-Repel embedding
 
 To see an example for doing link prediction with an advanced Graph Transformer called [`LPFormer`](https://arxiv.org/abs/2310.11009), see \[`lpformer.py`\].
 
-For examples on [Open Graph Benchmark](https://ogb.stanford.edu/) datasets, see the `ogbn_*.py` examples:
-
 - [`ogbn_train.py`](./ogbn_train.py) is an example for training a GNN on the large-scale `ogbn-papers100m` dataset, containing approximately ~1.6B edges or the medium scale `ogbn-products` dataset, ~62M edges.
   - Uses SGFormer (a kind of GraphTransformer) by default.
   - [SGFormer Paper](https://arxiv.org/pdf/2306.10759)
@@ -26,10 +26,20 @@ For an example on [Relational Deep Learning](https://arxiv.org/abs/2312.04615) w
 
 For examples on using `torch.compile`, see the examples under [`examples/compile`](./compile).
 
-For examples on scaling PyG up via multi-GPUs, see the examples under [`examples/multi_gpu`](./multi_gpu).
-
 For examples on working with heterogeneous data, see the examples under [`examples/hetero`](./hetero).
 
 For examples on co-training LLMs with GNNs, see the examples under [`examples/llm`](./llm).
 
 - [Kumo.ai x NVIDIA GNN+LLM Webinar](https://www.youtube.com/watch?v=uRIA8e7Y_vs)
+
+We recommend looking into [PyTorch documentation](https://docs.pytorch.org/tutorials/beginner/dist_overview.html) for examples on setting up model parralel GNNs.
+
+### Scale to Trillions of Edges with cuGraph
+
+[cuGraph](https://github.com/rapidsai/cugraph) is a collection of packages focused on GPU-accelerated graph analytics including support for property graphs and scaling up to thousands of GPUs. cuGraph supports the creation and manipulation of graphs followed by the execution of scalable fast graph algorithms. It is part of the [RAPIDS](https://rapids.ai) accelerated data science framework.
+
+[cuGraph GNN](https://github.com/rapidsai/cugraph-gnn) is a collection of GPU-accelerated plugins that support PyTorch and PyG natively through the _cuGraph-PyG_ and _WholeGraph_ subprojects. cuGraph GNN is built on top of cuGraph, leveraging its low-level [pylibcugraph](https://github.com/rapidsai/cugraph/python/pylibcugraph) API and C++ primitives for sampling and other GNN operations ([libcugraph](https://github.com/rapidai/cugraph/python/libcugraph)). It also includes the `libwholegraph` and `pylibwholegraph` libraries for high-performance distributed edgelist and embedding storage. Users have the option of working with these lower-level libraries directly, or through the higher-level API in cuGraph-PyG that directly implements the `GraphStore`, `FeatureStore`, `NodeLoader`, and `LinkLoader` interfaces.
+
+Complete documentation on RAPIDS graph packages, including `cugraph`, `cugraph-pyg`, `pylibwholegraph`, and `pylibcugraph` is available on the [RAPIDS docs pages](https://docs.rapids.ai/api/cugraph/nightly/graph_support).
+
+See [`rapidsai/cugraph-gnn/tree/branch-25.12/python/cugraph-pyg/cugraph_pyg/examples` on GitHub](https://github.com/rapidsai/cugraph-gnn/tree/branch-25.12/python/cugraph-pyg/cugraph_pyg/examples) for fully scalable PyG example workflows.
diff --git a/examples/distributed/README.md b/examples/distributed/README.md
deleted file mode 100644
index 3d7c4e8948e0..000000000000
--- a/examples/distributed/README.md
+++ /dev/null
@@ -1,8 +0,0 @@
-# Examples for Distributed Graph Learning
-
-This directory contains examples for distributed graph learning.
-The examples are organized into two subdirectories:
-
-1. [`pyg`](./pyg): Distributed training via PyG's own `torch_geometric.distributed` package (deprecated).
-1. [`graphlearn_for_pytorch`](./graphlearn_for_pytorch): Distributed training via the external [GraphLearn-for-PyTorch (GLT)](https://github.com/alibaba/graphlearn-for-pytorch) package.
-1. [`kuzu`](./kuzu): Remote backend via the [Kùzu](https://kuzudb.com/) graph database.
diff --git a/examples/distributed/graphlearn_for_pytorch/README.md b/examples/distributed/graphlearn_for_pytorch/README.md
deleted file mode 100644
index 85dd2d6d7be4..000000000000
--- a/examples/distributed/graphlearn_for_pytorch/README.md
+++ /dev/null
@@ -1,99 +0,0 @@
-# Using GraphLearn-for-PyTorch (GLT) for Distributed Training with PyG
-
-**[GraphLearn-for-PyTorch (GLT)](https://github.com/alibaba/graphlearn-for-pytorch)** is a graph learning library for PyTorch that makes distributed GNN training easy and efficient.
-GLT leverages GPUs to accelerate graph sampling and utilizes UVA and GPU caches to reduce the data conversion and transferring costs during graph sampling and model training.
-Most of the APIs of GLT are compatible with PyG, so PyG users only need to modify a few lines of their PyG code to train their model with GLT.
-
-## Requirements
-
-- `python >= 3.6`
-- `torch >= 1.12`
-- `graphlearn-torch`
-
-## Distributed (Multi-Node) Example
-
-This example shows how to leverage [GraphLearn-for-PyTorch (GLT)](https://github.com/alibaba/graphlearn-for-pytorch) to train PyG models in a distributed scenario with GPUs. The dataset in this example is `ogbn-products` from the [Open Graph Benchmark](https://ogb.stanford.edu/), but you can also train on `ogbn-papers100M` with only minor modifications.
-
-To run this example, you can run the example as described below or directly make use of our [`launch.py`](launch.py) script.
-The training results will be generated and saved in `dist_sage_sup.txt`.
-
-### Running the Example
-
-#### Step 1: Prepare and partition the data
-
-Here, we use `ogbn-products` and partition it into two partitions:
-
-```bash
-python partition_ogbn_dataset.py --dataset=ogbn-products --root_dir=../../../data/ogbn-products --num_partitions=2
-```
-
-#### Step 2: Run the example in each training node
-
-For example, running the example in two nodes each with two GPUs:
-
-```bash
-# Node 0:
-CUDA_VISIBLE_DEVICES=0,1 python dist_train_sage_supervised.py \
-  --num_nodes=2 --node_rank=0 --master_addr=localhost \
-  --dataset=ogbn-products --dataset_root_dir=../../../data/ogbn-products \
-  --in_channel=100 --out_channel=47
-
-# Node 1:
-CUDA_VISIBLE_DEVICES=2,3 python dist_train_sage_supervised.py \
-  --num_nodes=2 --node_rank=1 --master_addr=localhost \
-  --dataset=ogbn-products --dataset_root_dir=../../../data/ogbn-products \
-  --in_channel=100 --out_channel=47
-```
-
-**Notes:**
-
-1. You should change the `master_addr` to the IP of `node#0`.
-1. Since there is randomness during data partitioning, please ensure all nodes are using the same partitioned data when running `dist_train_sage_supervised.py`.
-
-### Using the `launch.py` Script
-
-#### Step 1: Setup a distributed file system
-
-**Note**: You may skip this step if you already set up folder(s) synchronized across machines.
-
-To perform distributed sampling, files and codes need to be accessed across multiple machines.
-A distributed file system (*i.e.*, [NFS](https://wiki.archlinux.org/index.php/NFS), [SSHFS](https://www.digitalocean.com/community/tutorials/how-to-use-sshfs-to-mount-remote-file-systems-over-ssh), [Ceph](https://docs.ceph.com/en/latest/install), ...) exempts you from synchnonizing files such as partition information.
-
-#### Step 2: Prepare and partition the data
-
-In distributed training (under the worker mode), each node in the cluster holds a partition of the graph.
-Thus, before the training starts, we partition the `ogbn-products` dataset into multiple partitions, each of which corresponds to a specific training worker.
-
-The partitioning occurs in three steps:
-
-1. Run the partition algorithm to assign nodes to partitions.
-1. Construct the partitioned graph structure based on the node assignment.
-1. Split the node features and edge features into partitions.
-
-GLT supports caching graph topology and frequently accessed features in GPU to accelerate GPU sampling and feature collection.
-For feature caching, we adopt a pre-sampling-based approach to determine the hotness of nodes, and cache features for nodes with higher hotness while loading the graph.
-The uncached features are stored in pinned memory for efficient access via UVA.
-
-For further information about partitioning, please refer to the [official tutorial](https://github.com/alibaba/graphlearn-for-pytorch/blob/main/docs/tutorial/dist.md).
-
-Here, we use `ogbn-products` and partition it into two partitions:
-
-```bash
-python partition_ogbn_dataset.py --dataset=ogbn-products --root_dir=../../../data/ogbn-products --num_partitions=2
-```
-
-#### Step 3: Set up the configure file
-
-An example configuration file in given via [`dist_train_sage_sup_config.yml`](dist_train_sage_sup_config.yml).
-
-#### Step 4: Launch the distributed training
-
-```bash
-pip install paramiko
-pip install click
-apt install tmux
-python launch.py --config=dist_train_sage_sup_config.yml --master_addr=0.0.0.0 --master_port=11234
-```
-
-Here, `master_addr` is for the master RPC address, and `master_port` is for PyTorch's process group initialization across training processes.
-Note that you should change the `master_addr` to the IP of `node#0`.
diff --git a/examples/distributed/graphlearn_for_pytorch/dist_train_sage_sup_config.yml b/examples/distributed/graphlearn_for_pytorch/dist_train_sage_sup_config.yml
deleted file mode 100644
index 633be1a7a181..000000000000
--- a/examples/distributed/graphlearn_for_pytorch/dist_train_sage_sup_config.yml
+++ /dev/null
@@ -1,38 +0,0 @@
-# IP addresses for all nodes.
-# Note: The first 3 params are expected to form usernames@nodes:ports.
-nodes:
-  - 0.0.0.0
-  - 1.1.1.1
-
-# SSH ports for each node:
-ports: [22, 22]
-
-# Username for remote IPs:
-usernames:
-  - your_username_for_node_0
-  - your_username_for_node_1
-
-# Path to Python with GLT environment for each node:
-python_bins:
-  - /path/to/python
-  - /path/to/python
-
-# The dataset name, e.g., ogbn-products, ogbn-papers100M.
-# Note: make sure the name of dataset_root_dir is the same as the dataset name.
-dataset: ogbn-products
-
-# `in_channel` and `out_channel` of the dataset, e.g.,:
-# - ogbn-products: in_channel=100, out_channel=47
-# - ogbn-papers100M: in_channel=128, out_channel=172
-in_channel: 100
-out_channel: 47
-
-# Path to the pytorch_geometric directory:
-dst_paths:
-  - /path/to/pytorch_geometric
-  - /path/to/pytorch_geometric
-
-# Setup visible CUDA devices for each node:
-visible_devices:
-  - 0,1,2,3
-  - 0,1,2,3
diff --git a/examples/distributed/graphlearn_for_pytorch/dist_train_sage_supervised.py b/examples/distributed/graphlearn_for_pytorch/dist_train_sage_supervised.py
deleted file mode 100644
index d348e7d4b6cb..000000000000
--- a/examples/distributed/graphlearn_for_pytorch/dist_train_sage_supervised.py
+++ /dev/null
@@ -1,314 +0,0 @@
-import argparse
-import os.path as osp
-import time
-
-import graphlearn_torch as glt
-import torch
-import torch.distributed
-import torch.nn.functional as F
-from ogb.nodeproppred import Evaluator
-from torch import Tensor
-from torch.nn.parallel import DistributedDataParallel
-
-from torch_geometric.io import fs
-from torch_geometric.nn import GraphSAGE
-
-
-@torch.no_grad()
-def test(model, test_loader, dataset_name):
-    evaluator = Evaluator(name=dataset_name)
-    model.eval()
-    xs = []
-    y_true = []
-    for i, batch in enumerate(test_loader):
-        if i == 0:
-            device = batch.x.device
-        x = model(batch.x, batch.edge_index)[:batch.batch_size]
-        xs.append(x.cpu())
-        y_true.append(batch.y[:batch.batch_size].cpu())
-
-    xs = [t.to(device) for t in xs]
-    y_true = [t.to(device) for t in y_true]
-    y_pred = torch.cat(xs, dim=0).argmax(dim=-1, keepdim=True)
-    y_true = torch.cat(y_true, dim=0).unsqueeze(-1)
-    test_acc = evaluator.eval({
-        'y_true': y_true,
-        'y_pred': y_pred,
-    })['acc']
-    return test_acc
-
-
-def run_training_proc(
-    local_proc_rank: int,
-    num_nodes: int,
-    node_rank: int,
-    num_training_procs_per_node: int,
-    dataset_name: str,
-    in_channels: int,
-    out_channels: int,
-    dataset: glt.distributed.DistDataset,
-    train_idx: Tensor,
-    test_idx: Tensor,
-    epochs: int,
-    batch_size: int,
-    master_addr: str,
-    training_pg_master_port: int,
-    train_loader_master_port: int,
-    test_loader_master_port: int,
-):
-    # Initialize graphlearn_torch distributed worker group context:
-    glt.distributed.init_worker_group(
-        world_size=num_nodes * num_training_procs_per_node,
-        rank=node_rank * num_training_procs_per_node + local_proc_rank,
-        group_name='distributed-sage-supervised-trainer')
-
-    current_ctx = glt.distributed.get_context()
-    current_device = torch.device(local_proc_rank % torch.cuda.device_count())
-
-    # Initialize training process group of PyTorch:
-    torch.distributed.init_process_group(
-        backend='nccl',  # or choose 'gloo' if 'nccl' is not supported.
-        rank=current_ctx.rank,
-        world_size=current_ctx.world_size,
-        init_method=f'tcp://{master_addr}:{training_pg_master_port}',
-    )
-
-    # Create distributed neighbor loader for training.
-    # We replace PyG's NeighborLoader with GLT's DistNeighborLoader.
-    # GLT parameters for sampling are quite similar to PyG.
-    # We only need to configure additional network and device parameters:
-    train_idx = train_idx.split(
-        train_idx.size(0) // num_training_procs_per_node)[local_proc_rank]
-    train_loader = glt.distributed.DistNeighborLoader(
-        data=dataset,
-        num_neighbors=[15, 10, 5],
-        input_nodes=train_idx,
-        batch_size=batch_size,
-        shuffle=True,
-        collect_features=True,
-        to_device=current_device,
-        worker_options=glt.distributed.MpDistSamplingWorkerOptions(
-            num_workers=1,
-            worker_devices=[current_device],
-            worker_concurrency=4,
-            master_addr=master_addr,
-            master_port=train_loader_master_port,
-            channel_size='1GB',
-            pin_memory=True,
-        ),
-    )
-
-    # Create distributed neighbor loader for testing.
-    test_idx = test_idx.split(test_idx.size(0) //
-                              num_training_procs_per_node)[local_proc_rank]
-    test_loader = glt.distributed.DistNeighborLoader(
-        data=dataset,
-        num_neighbors=[15, 10, 5],
-        input_nodes=test_idx,
-        batch_size=batch_size,
-        shuffle=False,
-        collect_features=True,
-        to_device=current_device,
-        worker_options=glt.distributed.MpDistSamplingWorkerOptions(
-            num_workers=2,
-            worker_devices=[
-                torch.device('cuda', i % torch.cuda.device_count())
-                for i in range(2)
-            ],
-            worker_concurrency=4,
-            master_addr=master_addr,
-            master_port=test_loader_master_port,
-            channel_size='2GB',
-            pin_memory=True,
-        ),
-    )
-
-    # Define the model and optimizer.
-    torch.cuda.set_device(current_device)
-    model = GraphSAGE(
-        in_channels=in_channels,
-        hidden_channels=256,
-        num_layers=3,
-        out_channels=out_channels,
-    ).to(current_device)
-    model = DistributedDataParallel(model, device_ids=[current_device.index])
-
-    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
-
-    # Train and test:
-    f = open('dist_sage_sup.txt', 'a+')
-    for epoch in range(0, epochs):
-        model.train()
-        start = time.time()
-        for batch in train_loader:
-            optimizer.zero_grad()
-            out = model(batch.x, batch.edge_index)[:batch.batch_size]
-            loss = F.cross_entropy(out, batch.y[:batch.batch_size].long())
-            loss.backward()
-            optimizer.step()
-        f.write(f'-- [Trainer {current_ctx.rank}] Epoch: {epoch:03d}, '
-                f'Loss: {loss:.4f}, Epoch Time: {time.time() - start}\n')
-
-        torch.cuda.synchronize()
-        torch.distributed.barrier()
-
-        if epoch == 0 or epoch > (epochs // 2):
-            test_acc = test(model, test_loader, dataset_name)
-            f.write(f'-- [Trainer {current_ctx.rank}] '
-                    f'Test Acc: {test_acc:.4f}\n')
-            torch.cuda.synchronize()
-            torch.distributed.barrier()
-
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        '--dataset',
-        type=str,
-        default='ogbn-products',
-        help='The name of the dataset',
-    )
-    parser.add_argument(
-        '--in_channel',
-        type=int,
-        default=100,
-        help='Number of input features of the dataset',
-    )
-    parser.add_argument(
-        '--out_channel',
-        type=int,
-        default=47,
-        help='Number of classes of the dataset',
-    )
-    parser.add_argument(
-        '--num_dataset_partitions',
-        type=int,
-        default=2,
-        help='The number of partitions',
-    )
-    parser.add_argument(
-        '--dataset_root_dir',
-        type=str,
-        default='../../../data/products',
-        help='The root directory (relative path) of the partitioned dataset',
-    )
-    parser.add_argument(
-        '--num_nodes',
-        type=int,
-        default=2,
-        help='Number of distributed nodes',
-    )
-    parser.add_argument(
-        '--node_rank',
-        type=int,
-        default=0,
-        help='The current node rank',
-    )
-    parser.add_argument(
-        '--num_training_procs',
-        type=int,
-        default=2,
-        help='The number of training processes per node',
-    )
-    parser.add_argument(
-        '--epochs',
-        type=int,
-        default=10,
-        help='The number of training epochs',
-    )
-    parser.add_argument(
-        '--batch_size',
-        type=int,
-        default=512,
-        help='The batch size for the training and testing data loaders',
-    )
-    parser.add_argument(
-        '--master_addr',
-        type=str,
-        default='localhost',
-        help='The master address for RPC initialization',
-    )
-    parser.add_argument(
-        '--training_pg_master_port',
-        type=int,
-        default=11111,
-        help="The port used for PyTorch's process group initialization",
-    )
-    parser.add_argument(
-        '--train_loader_master_port',
-        type=int,
-        default=11112,
-        help='The port used for RPC initialization for training',
-    )
-    parser.add_argument(
-        '--test_loader_master_port',
-        type=int,
-        default=11113,
-        help='The port used for RPC initialization for testing',
-    )
-    args = parser.parse_args()
-
-    # Record configuration information for debugging
-    f = open('dist_sage_sup.txt', 'a+')
-    f.write('--- Distributed training example of supervised SAGE ---\n')
-    f.write(f'* dataset: {args.dataset}\n')
-    f.write(f'* dataset root dir: {args.dataset_root_dir}\n')
-    f.write(f'* number of dataset partitions: {args.num_dataset_partitions}\n')
-    f.write(f'* total nodes: {args.num_nodes}\n')
-    f.write(f'* node rank: {args.node_rank}\n')
-    f.write(f'* number of training processes per node: '
-            f'{args.num_training_procs}\n')
-    f.write(f'* epochs: {args.epochs}\n')
-    f.write(f'* batch size: {args.batch_size}\n')
-    f.write(f'* master addr: {args.master_addr}\n')
-    f.write(f'* training process group master port: '
-            f'{args.training_pg_master_port}\n')
-    f.write(f'* training loader master port: '
-            f'{args.train_loader_master_port}\n')
-    f.write(f'* testing loader master port: {args.test_loader_master_port}\n')
-
-    f.write('--- Loading data partition ...\n')
-    root_dir = osp.join(osp.dirname(osp.realpath(__file__)),
-                        args.dataset_root_dir)
-    data_pidx = args.node_rank % args.num_dataset_partitions
-    dataset = glt.distributed.DistDataset()
-
-    label_file = osp.join(root_dir, f'{args.dataset}-label', 'label.pt')
-    dataset.load(
-        root_dir=osp.join(root_dir, f'{args.dataset}-partitions'),
-        partition_idx=data_pidx,
-        graph_mode='ZERO_COPY',
-        whole_node_label_file=label_file,
-    )
-    train_file = osp.join(root_dir, f'{args.dataset}-train-partitions',
-                          f'partition{data_pidx}.pt')
-    train_idx = fs.torch_load(train_file)
-    test_file = osp.join(root_dir, f'{args.dataset}-test-partitions',
-                         f'partition{data_pidx}.pt')
-    test_idx = fs.torch_load(test_file)
-    train_idx.share_memory_()
-    test_idx.share_memory_()
-
-    f.write('--- Launching training processes ...\n')
-    torch.multiprocessing.spawn(
-        run_training_proc,
-        args=(
-            args.num_nodes,
-            args.node_rank,
-            args.num_training_procs,
-            args.dataset,
-            args.in_channel,
-            args.out_channel,
-            dataset,
-            train_idx,
-            test_idx,
-            args.epochs,
-            args.batch_size,
-            args.master_addr,
-            args.training_pg_master_port,
-            args.train_loader_master_port,
-            args.test_loader_master_port,
-        ),
-        nprocs=args.num_training_procs,
-        join=True,
-    )
diff --git a/examples/distributed/graphlearn_for_pytorch/launch.py b/examples/distributed/graphlearn_for_pytorch/launch.py
deleted file mode 100644
index 6b467dcd2bcf..000000000000
--- a/examples/distributed/graphlearn_for_pytorch/launch.py
+++ /dev/null
@@ -1,95 +0,0 @@
-import argparse
-
-import click
-import paramiko
-import yaml
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        '--config',
-        type=str,
-        default='dist_train_sage_sup_config.yml',
-        help='The path to the configuration file',
-    )
-    parser.add_argument(
-        '--epochs',
-        type=int,
-        default=10,
-        help='The number of training epochs',
-    )
-    parser.add_argument(
-        '--batch_size',
-        type=int,
-        default=512,
-        help='The batch size for the training and testing data loaders',
-    )
-    parser.add_argument(
-        '--master_addr',
-        type=str,
-        default='0.0.0.0',
-        help='Master IP address for synchronization across all training nodes',
-    )
-    parser.add_argument(
-        '--master_port',
-        type=str,
-        default='11345',
-        help='The port for synchronization across all training nodes',
-    )
-    args = parser.parse_args()
-
-    config = open(args.config)
-    config = yaml.safe_load(config)
-    dataset = config['dataset']
-    ip_list = config['nodes']
-    port_list = config['ports']
-    username_list = config['usernames']
-    dst_path_list = config['dst_paths']
-    node_ranks = list(range(len(ip_list)))
-    num_nodes = len(node_ranks)
-    visible_devices = config['visible_devices']
-    python_bins = config['python_bins']
-    num_cores = len(str(visible_devices[0]).split(','))
-    in_channel = str(config['in_channel'])
-    out_channel = str(config['out_channel'])
-
-    dataset_path = '../../../data/'
-    passwd_dict = {}
-    for username, ip in zip(username_list, ip_list):
-        passwd_dict[ip + username] = click.prompt(
-            f'Password for {username}@{ip}', hide_input=True)
-    for username, ip, port, dst, noderk, device, pythonbin in zip(
-            username_list,
-            ip_list,
-            port_list,
-            dst_path_list,
-            node_ranks,
-            visible_devices,
-            python_bins,
-    ):
-        trans = paramiko.Transport((ip, port))
-        trans.connect(username=username, password=passwd_dict[ip + username])
-        ssh = paramiko.SSHClient()
-        ssh._transport = trans
-
-        to_dist_dir = 'cd ' + dst + \
-            '/examples/distributed/graphlearn_for_pytorch/ '
-        exec_example = "tmux new -d 'CUDA_VISIBLE_DEVICES=" + str(device) + \
-            " " + pythonbin + " dist_train_sage_supervised.py --dataset=" + \
-            dataset + " --dataset_root_dir=" + dataset_path + dataset + \
-            " --in_channel=" + in_channel + " --out_channel=" + out_channel + \
-            " --node_rank=" + str(noderk) + " --num_dataset_partitions=" + \
-            str(num_nodes) + " --num_nodes=" + str(num_nodes) + \
-            " --num_training_procs=" + str(num_cores) + " --master_addr=" + \
-            args.master_addr + " --training_pg_master_port=" + \
-            args.master_port + " --train_loader_master_port=" + \
-            str(int(args.master_port) + 1) + " --test_loader_master_port=" + \
-            str(int(args.master_port) + 2) + " --batch_size=" + \
-            str(args.batch_size) + " --epochs=" + str(args.epochs)
-
-        print(to_dist_dir + ' && ' + exec_example + " '")
-        stdin, stdout, stderr = ssh.exec_command(
-            to_dist_dir + ' && ' + exec_example + " '", bufsize=1)
-        print(stdout.read().decode())
-        print(stderr.read().decode())
-        ssh.close()
diff --git a/examples/distributed/graphlearn_for_pytorch/partition_ogbn_dataset.py b/examples/distributed/graphlearn_for_pytorch/partition_ogbn_dataset.py
deleted file mode 100644
index 02347a026709..000000000000
--- a/examples/distributed/graphlearn_for_pytorch/partition_ogbn_dataset.py
+++ /dev/null
@@ -1,145 +0,0 @@
-import argparse
-import ast
-import os.path as osp
-
-import graphlearn_torch as glt
-import torch
-from ogb.nodeproppred import PygNodePropPredDataset
-
-
-def partition_dataset(
-    ogbn_dataset: str,
-    root_dir: str,
-    num_partitions: int,
-    num_nbrs: glt.NumNeighbors,
-    chunk_size: int,
-    cache_ratio: float,
-):
-    ###########################################################################
-    # In distributed training (under the worker mode), each node in the cluster
-    # holds a partition of the graph. Thus before the training starts, we
-    # partition the dataset into multiple partitions, each of which corresponds
-    # to a specific training worker.
-    # The partitioning occurs in three steps:
-    #   1. Run a partition algorithm to assign nodes to partitions.
-    #   2. Construct partition graph structure based on the node assignment.
-    #   3. Split the node features and edge features based on the partition
-    # result.
-    ###########################################################################
-
-    print(f'-- Loading {ogbn_dataset} ...')
-    dataset = PygNodePropPredDataset(ogbn_dataset, root_dir)
-    data = dataset[0]
-    print(f'* node count: {data.num_nodes}')
-    print(f'* edge count: {data.num_edges}')
-    split_idx = dataset.get_idx_split()
-
-    print('-- Saving label ...')
-    label_dir = osp.join(root_dir, f'{ogbn_dataset}-label')
-    glt.utils.ensure_dir(label_dir)
-    torch.save(data.y.squeeze(), osp.join(label_dir, 'label.pt'))
-
-    print('-- Partitioning training idx ...')
-    train_idx = split_idx['train']
-    train_idx = train_idx.split(train_idx.size(0) // num_partitions)
-    train_idx_partitions_dir = osp.join(
-        root_dir,
-        f'{ogbn_dataset}-train-partitions',
-    )
-    glt.utils.ensure_dir(train_idx_partitions_dir)
-    for pidx in range(num_partitions):
-        torch.save(
-            train_idx[pidx],
-            osp.join(train_idx_partitions_dir, f'partition{pidx}.pt'),
-        )
-
-    print('-- Partitioning test idx ...')
-    test_idx = split_idx['test']
-    test_idx = test_idx.split(test_idx.size(0) // num_partitions)
-    test_idx_partitions_dir = osp.join(
-        root_dir,
-        f'{ogbn_dataset}-test-partitions',
-    )
-    glt.utils.ensure_dir(test_idx_partitions_dir)
-    for pidx in range(num_partitions):
-        torch.save(
-            test_idx[pidx],
-            osp.join(test_idx_partitions_dir, f'partition{pidx}.pt'),
-        )
-
-    print('-- Initializing graph ...')
-    csr_topo = glt.data.Topology(edge_index=data.edge_index,
-                                 input_layout='COO')
-    graph = glt.data.Graph(csr_topo, mode='ZERO_COPY')
-
-    print('-- Sampling hotness ...')
-    glt_sampler = glt.sampler.NeighborSampler(graph, num_nbrs)
-    node_probs = []
-    for pidx in range(num_partitions):
-        seeds = train_idx[pidx]
-        prob = glt_sampler.sample_prob(seeds, data.num_nodes)
-        node_probs.append(prob.cpu())
-
-    print('-- Partitioning graph and features ...')
-    partitions_dir = osp.join(root_dir, f'{ogbn_dataset}-partitions')
-    freq_partitioner = glt.partition.FrequencyPartitioner(
-        output_dir=partitions_dir,
-        num_parts=num_partitions,
-        num_nodes=data.num_nodes,
-        edge_index=data.edge_index,
-        probs=node_probs,
-        node_feat=data.x,
-        chunk_size=chunk_size,
-        cache_ratio=cache_ratio,
-    )
-    freq_partitioner.partition()
-
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        '--dataset',
-        type=str,
-        default='ogbn-products',
-        help='The name of the dataset',
-    )
-    parser.add_argument(
-        '--num_partitions',
-        type=int,
-        default=2,
-        help='The Number of partitions',
-    )
-    parser.add_argument(
-        '--root_dir',
-        type=str,
-        default='../../../data/ogbn-products',
-        help='The root directory (relative path) of the partitioned dataset',
-    )
-    parser.add_argument(
-        '--num_nbrs',
-        type=ast.literal_eval,
-        default='[15,10,5]',
-        help='The number of neighbors to sample hotness for feature caching',
-    )
-    parser.add_argument(
-        '--chunk_size',
-        type=int,
-        default=10000,
-        help='The chunk size for feature partitioning',
-    )
-    parser.add_argument(
-        '--cache_ratio',
-        type=float,
-        default=0.2,
-        help='The proportion to cache features per partition',
-    )
-    args = parser.parse_args()
-
-    partition_dataset(
-        ogbn_dataset=args.dataset,
-        root_dir=osp.join(osp.dirname(osp.realpath(__file__)), args.root_dir),
-        num_partitions=args.num_partitions,
-        num_nbrs=args.num_nbrs,
-        chunk_size=args.chunk_size,
-        cache_ratio=args.cache_ratio,
-    )
diff --git a/examples/distributed/kuzu/README.md b/examples/distributed/kuzu/README.md
deleted file mode 100644
index 298baf8f9493..000000000000
--- a/examples/distributed/kuzu/README.md
+++ /dev/null
@@ -1,38 +0,0 @@
-# Using Kùzu as a Remote Backend for PyG
-
-[Kùzu](https://kuzudb.com/) is an in-process property graph database management system built for query speed and scalability.
-It provides an integration with PyG via the [remote backend interface](https://pytorch-geometric.readthedocs.io/en/latest/advanced/remote.html) of PyG.
-The Python API of Kùzu outputs a [`torch_geometric.data.FeatureStore`](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.data.FeatureStore.html) and a [`torch_geometric.data.GraphStore`](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.data.GraphStore.html) that can be plugged directly into existing familiar PyG interfaces such as [`NeighborLoader`](https://pytorch-geometric.readthedocs.io/en/latest/_modules/torch_geometric/loader/neighbor_loader.html) and enables training GNNs directly on graphs stored in Kùzu.
-This is particularly useful if you would like to train graphs that don't fit on your CPU's memory.
-
-## Installation
-
-You can install Kùzu as follows:
-
-```bash
-pip install kuzu
-```
-
-## Usage
-
-The API and design documentation of Kùzu can be found at [https://kuzudb.com/docs/](https://kuzudb.com/docs/).
-
-## Examples
-
-We provide the following examples to showcase the usage of Kùzu remote backend within PyG:
-
-### PubMed
-
-<a target="_blank" href="https://colab.research.google.com/drive/12fOSqPm1HQTz_m9caRW7E_92vaeD9xq6">
-  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
-</a>
-
-The PubMed example is hosted on [Google Colab](https://colab.research.google.com/drive/12fOSqPm1HQTz_m9caRW7E_92vaeD9xq6).
-In this example, we work on a small dataset for demonstrative purposes.
-The [PubMed](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.Planetoid.html) dataset consists of 19,717 papers as nodes and 88,648 citation relationships between them.
-
-### `papers_100M`
-
-This example shows how to use the remote backend feature of Kùzu to work with a large graph of papers and citations on a single machine.
-The data used in this example is `ogbn-papers100M` from the [Open Graph Benchmark](https://ogb.stanford.edu/).
-The dataset contains approximately 111 million nodes and 1.6 billion edges.
diff --git a/examples/distributed/kuzu/papers_100M/README.md b/examples/distributed/kuzu/papers_100M/README.md
deleted file mode 100644
index 7e30a81a7e9f..000000000000
--- a/examples/distributed/kuzu/papers_100M/README.md
+++ /dev/null
@@ -1,16 +0,0 @@
-# `papers_100M` Example
-
-This example shows how to use the remote backend feature of [Kùzu](https://kuzudb.com) to work with a large graph of papers and citations on a single machine.
-The data used in this example is `ogbn-papers100M` from the [Open Graph Benchmark](https://ogb.stanford.edu/).
-The dataset contains approximately 100 million nodes and 1.6 billion edges.
-
-## Prepare the data
-
-1. Download the dataset from [`http://snap.stanford.edu/ogb/data/nodeproppred/papers100M-bin.zip`](http://snap.stanford.edu/ogb/data/nodeproppred/papers100M-bin.zip) and put the `*.zip` file into this directory.
-1. Run `python prepare_data.py`.
-   The script will automatically extract the data and convert it to the format that Kùzu can read.
-   A Kùzu database instance is then created under `papers_100M` and the data is loaded into the it.
-
-## Train a Model
-
-Afterwards, run `python train.py` to train a three-layer [`GraphSAGE`](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.models.GraphSAGE.html) model on this dataset.
diff --git a/examples/distributed/kuzu/papers_100M/prepare_data.py b/examples/distributed/kuzu/papers_100M/prepare_data.py
deleted file mode 100644
index a4892a6df895..000000000000
--- a/examples/distributed/kuzu/papers_100M/prepare_data.py
+++ /dev/null
@@ -1,54 +0,0 @@
-from multiprocessing import cpu_count
-from os import path
-from zipfile import ZipFile
-
-import kuzu
-import numpy as np
-from tqdm import tqdm
-
-with ZipFile("papers100M-bin.zip", 'r') as papers100M_zip:
-    print('Extracting papers100M-bin.zip...')
-    papers100M_zip.extractall()
-
-with ZipFile("papers100M-bin/raw/data.npz", 'r') as data_zip:
-    print('Extracting data.npz...')
-    data_zip.extractall()
-
-with ZipFile("papers100M-bin/raw/node-label.npz", 'r') as node_label_zip:
-    print('Extracting node-label.npz...')
-    node_label_zip.extractall()
-
-print("Converting edge_index to CSV...")
-edge_index = np.load('edge_index.npy', mmap_mode='r')
-csvfile = open('edge_index.csv', 'w')
-csvfile.write('src,dst\n')
-for i in tqdm(range(edge_index.shape[1])):
-    csvfile.write(str(edge_index[0, i]) + ',' + str(edge_index[1, i]) + '\n')
-csvfile.close()
-
-print("Generating IDs for nodes...")
-node_year = np.load('node_year.npy', mmap_mode='r')
-length = node_year.shape[0]
-ids = np.arange(length)
-np.save('ids.npy', ids)
-
-ids_path = path.abspath(path.join('.', 'ids.npy'))
-edge_index_path = path.abspath(path.join('.', 'edge_index.csv'))
-node_label_path = path.abspath(path.join('.', 'node_label.npy'))
-node_feature_path = path.abspath(path.join('.', 'node_feat.npy'))
-node_year_path = path.abspath(path.join('.', 'node_year.npy'))
-
-print("Creating Kùzu database...")
-db = kuzu.Database('papers100M')
-conn = kuzu.Connection(db, num_threads=cpu_count())
-print("Creating Kùzu tables...")
-conn.execute(
-    "CREATE NODE TABLE paper(id INT64, x FLOAT[128], year INT64, y FLOAT, "
-    "PRIMARY KEY (id));")
-conn.execute("CREATE REL TABLE cites(FROM paper TO paper, MANY_MANY);")
-print("Copying nodes to Kùzu tables...")
-conn.execute('COPY paper FROM ("%s",  "%s",  "%s", "%s") BY COLUMN;' %
-             (ids_path, node_feature_path, node_year_path, node_label_path))
-print("Copying edges to Kùzu tables...")
-conn.execute('COPY cites FROM "%s";' % (edge_index_path))
-print("All done!")
diff --git a/examples/distributed/kuzu/papers_100M/train.py b/examples/distributed/kuzu/papers_100M/train.py
deleted file mode 100644
index d4a1049e96cf..000000000000
--- a/examples/distributed/kuzu/papers_100M/train.py
+++ /dev/null
@@ -1,123 +0,0 @@
-import multiprocessing as mp
-import os.path as osp
-
-import kuzu
-import pandas as pd
-import torch
-import torch.nn.functional as F
-from tqdm import tqdm
-
-from torch_geometric.loader import NeighborLoader
-from torch_geometric.nn import MLP, BatchNorm, SAGEConv
-
-NUM_EPOCHS = 1
-LOADER_BATCH_SIZE = 1024
-
-print('Batch size:', LOADER_BATCH_SIZE)
-print('Number of epochs:', NUM_EPOCHS)
-
-device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
-print('Using device:', device)
-
-# Load the train set:
-train_path = osp.join('.', 'papers100M-bin', 'split', 'time', 'train.csv.gz')
-train_df = pd.read_csv(
-    osp.abspath(train_path),
-    compression='gzip',
-    header=None,
-)
-input_nodes = torch.tensor(train_df[0].values, dtype=torch.long)
-
-########################################################################
-# The below code sets up the remote backend of Kùzu for PyG.
-# Please refer to: https://kuzudb.com/docs/client-apis/python-api/overview.html
-# for how to use the Python API of Kùzu.
-########################################################################
-
-# The buffer pool size of Kùzu is set to 40GB. You can change it to a smaller
-# value if you have less memory.
-KUZU_BM_SIZE = 40 * 1024**3
-
-# Create Kùzu database:
-db = kuzu.Database(osp.abspath(osp.join('.', 'papers100M')), KUZU_BM_SIZE)
-
-# Get remote backend for PyG:
-feature_store, graph_store = db.get_torch_geometric_remote_backend(
-    mp.cpu_count())
-
-# Plug the graph store and feature store into the `NeighborLoader`.
-# Note that `filter_per_worker` is set to `False`. This is because the Kùzu
-# database is already using multi-threading to scan the features in parallel
-# and the database object is not fork-safe.
-loader = NeighborLoader(
-    data=(feature_store, graph_store),
-    num_neighbors={('paper', 'cites', 'paper'): [12, 12, 12]},
-    batch_size=LOADER_BATCH_SIZE,
-    input_nodes=('paper', input_nodes),
-    num_workers=4,
-    filter_per_worker=False,
-)
-
-
-class GraphSAGE(torch.nn.Module):
-    def __init__(self, in_channels, hidden_channels, out_channels, num_layers,
-                 dropout=0.2):
-        super().__init__()
-
-        self.convs = torch.nn.ModuleList()
-        self.norms = torch.nn.ModuleList()
-
-        self.convs.append(SAGEConv(in_channels, hidden_channels))
-        self.norms.append(BatchNorm(hidden_channels))
-        for _ in range(1, num_layers):
-            self.convs.append(SAGEConv(hidden_channels, hidden_channels))
-            self.norms.append(BatchNorm(hidden_channels))
-
-        self.mlp = MLP(
-            in_channels=in_channels + num_layers * hidden_channels,
-            hidden_channels=2 * out_channels,
-            out_channels=out_channels,
-            num_layers=2,
-            norm='batch_norm',
-            act='leaky_relu',
-        )
-
-        self.dropout = dropout
-
-    def forward(self, x, edge_index):
-        x = F.dropout(x, p=self.dropout, training=self.training)
-        xs = [x]
-        for conv, norm in zip(self.convs, self.norms):
-            x = conv(x, edge_index)
-            x = norm(x)
-            x = x.relu()
-            x = F.dropout(x, p=self.dropout, training=self.training)
-            xs.append(x)
-        return self.mlp(torch.cat(xs, dim=-1))
-
-
-model = GraphSAGE(in_channels=128, hidden_channels=1024, out_channels=172,
-                  num_layers=3).to(device)
-optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
-
-for epoch in range(1, NUM_EPOCHS + 1):
-    total_loss = total_examples = 0
-    for batch in tqdm(loader):
-        batch = batch.to(device)
-        batch_size = batch['paper'].batch_size
-
-        optimizer.zero_grad()
-        out = model(
-            batch['paper'].x,
-            batch['paper', 'cites', 'paper'].edge_index,
-        )[:batch_size]
-        y = batch['paper'].y[:batch_size].long().view(-1)
-        loss = F.cross_entropy(out, y)
-
-        loss.backward()
-        optimizer.step()
-
-        total_loss += float(loss) * y.numel()
-        total_examples += y.numel()
-
-    print(f'Epoch: {epoch:02d}, Loss: {total_loss / total_examples:.4f}')
diff --git a/examples/distributed/pyg/README.md b/examples/distributed/pyg/README.md
deleted file mode 100644
index 890643359d08..000000000000
--- a/examples/distributed/pyg/README.md
+++ /dev/null
@@ -1,138 +0,0 @@
-# Distributed Training with PyG
-
-**[`torch_geometric.distributed`](https://github.com/pyg-team/pytorch_geometric/tree/master/torch_geometric/distributed)** (deprecated) implements a scalable solution for distributed GNN training, built exclusively upon PyTorch and PyG.
-
-Current application can be deployed on a cluster of arbitrary size using multiple CPUs.
-PyG native GPU application is under development and will be released soon.
-
-The solution is designed to effortlessly distribute the training of large-scale graph neural networks across multiple nodes, thanks to the integration of [Distributed Data Parallelism (DDP)](https://pytorch.org/docs/stable/notes/ddp.html) for model training and [Remote Procedure Call (RPC)](https://pytorch.org/docs/stable/rpc.html) for efficient sampling and fetching of non-local features.
-The design includes a number of custom classes, *i.e.* (1) `DistNeighborSampler` implements CPU sampling algorithms and feature extraction from local and remote data remaining consistent data structure at the output, (2) an integrated `DistLoader` which ensures safe opening & closing of RPC connection between the samplers, and (3) a METIS-based `Partitioner` and many more.
-
-## Example for Node-level Distributed Training on OGB Datasets
-
-The example provided in [`node_ogb_cpu.py`](./node_ogb_cpu.py) performs distributed training with multiple CPU nodes using [OGB](https://ogb.stanford.edu/) datasets and a [`GraphSAGE`](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.models.GraphSAGE.html) model.
-The example can run on both homogeneous (`ogbn-products`) and heterogeneous data (`ogbn-mag`).
-With minor modifications, the example can be extended to train on `ogbn-papers100m` or any other dataset.
-
-To run the example, please refer to the steps below.
-
-### Requirements
-
-- [`torch-geometric>=2.5.0`](https://github.com/pyg-team/pytorch_geometric) and [`pyg-lib>=0.4.0`](https://github.com/pyg-team/pyg-lib)
-- Password-less SSH needs to be set up on all the nodes that you are using (see the [Linux SSH manual](https://linuxize.com/post/how-to-setup-passwordless-ssh-login)).
-- All nodes need to have a consistent environments installed, specifically `torch` and `pyg-lib` versions must be the same.
-  You might want to consider using docker containers.
-- *[Optional]* In some cases Linux firewall might be blocking TCP connection issues.
-  Ensure that firewall settings allow for all nodes to communicate (see the [Linux firewall manual](https://ubuntu.com/server/docs/security-firewall)).
-  For this example TCP ports `11111`, `11112` and `11113` should be open (*i.e.* `sudo ufw allow 11111`).
-
-### Step 1: Prepare and Partition the Data
-
-In distributed training, each node in the cluster holds a partition of the graph.
-Before the training starts, we partition the dataset into multiple partitions, each of which corresponds to a specific training node.
-
-Here, we use `ogbn-products` and partition it into two partitions (in default) via the [`partition_graph.py`](./partition_graph.py) script:
-
-```bash
-python partition_graph.py --dataset=ogbn-products --root_dir=../../../data --num_partitions=2
-```
-
-**Caution:** Partitioning with METIS is non-deterministic!
-All nodes should be able to access the same partition data.
-Therefore, generate the partitions on one node and copy the data to all members of the cluster, or place the folder into a shared location.
-
-The generated partition will have a folder structure as below:
-
-```
-data
-├─ dataset
-│  ├─ ogbn-mag
-│  └─ ogbn-products
-└─ partitions
-   ├─ obgn-mag
-   └─ obgn-products
-      ├─ ogbn-products-partitions
-      │  ├─ part_0
-      │  ├─ part_1
-      │  ├─ META.json
-      │  ├─ node_map.pt
-      │  └─ edge_map.pt
-      ├─ ogbn-products-label
-      │  └─ label.pt
-      ├─ ogbn-products-test-partitions
-      │  ├─ partition0.pt
-      │  └─ partition1.pt
-      └─ ogbn-products-train-partitions
-         ├─ partition0.pt
-         └─ partition1.pt
-```
-
-### Step 2: Run the Example in Each Training Node
-
-To run the example, you can execute the commands in each node or use the provided launch script.
-
-#### Option A: Manual Execution
-
-You should change the `master_addr` to the IP of `node#0`.
-Make sure that the correct `node_rank` is provided, with the master node assigned to rank `0`.
-The `dataset_root_dir` should point to the head directory where your partition is placed, *i.e.* `../../data/partitions/ogbn-products/2-parts`:
-
-```bash
-# Node 0:
-python node_ogb_cpu.py \
-  --dataset=ogbn-products \
-  --dataset_root_dir=<partition folder directory> \
-  --num_nodes=2 \
-  --node_rank=0 \
-  --master_addr=<master ip>
-
-# Node 1:
-python node_obg_cpu.py \
-  --dataset=ogbn-products \
-  --dataset_root_dir=<partition folder directory> \
-  --num_nodes=2 \
-  --node_rank=1 \
-  --master_addr=<master ip>
-```
-
-In some configurations, the network interface used for multi-node communication may be different than the default one.
-In this case, the interface used for multi-node communication needs to be specified to Gloo.
-
-Assuming that `$MASTER_ADDR` is set to the IP of `node#0`.
-
-On the `node#0`:
-
-```bash
-export TP_SOCKET_IFNAME=$(ip addr | grep "$MASTER_ADDR" | awk '{print $NF}')
-export GLOO_SOCKET_IFNAME=$TP_SOCKET_IFNAME
-```
-
-On the other nodes:
-
-```bash
-export TP_SOCKET_IFNAME=$(ip route get $MASTER_ADDR | grep -oP '(?<=dev )[^ ]+')
-export GLOO_SOCKET_IFNAME=$TP_SOCKET_IFNAME
-```
-
-#### Option B: Launch Script
-
-There exists two methods to run the distributed example with one script in one terminal for multiple nodes:
-
-1. [`launch.py`](./launch.py):
-   ```bash
-   python launch.py
-     --workspace {workspace}/pytorch_geometric
-     --num_nodes 2
-     --dataset_root_dir {dataset_dir}/mag/2-parts
-     --dataset ogbn-mag
-     --batch_size 1024
-     --learning_rate 0.0004
-     --part_config {dataset_dir}/mag/2-parts/ogbn-mag-partitions/META.json
-     --ip_config {workspace}/pytorch_geometric/ip_config.yaml
-    'cd /home/user_xxx; source {conda_envs}/bin/activate; cd {workspace}/pytorch_geometric; {conda_envs}/bin/python
-     {workspace}/pytorch_geometric/examples/pyg/node_ogb_cpu.py --dataset=ogbn-mag --logging --progress_bar --ddp_port=11111'
-   ```
-1. [`run_dist.sh`](./run_dist.sh): All parameter settings are contained in the `run_dist.sh` script and you just need run with:
-   ```bash
-   ./run_dist.sh
-   ```
diff --git a/examples/distributed/pyg/launch.py b/examples/distributed/pyg/launch.py
deleted file mode 100644
index fee6d6750b77..000000000000
--- a/examples/distributed/pyg/launch.py
+++ /dev/null
@@ -1,430 +0,0 @@
-import argparse
-import logging
-import multiprocessing
-import os
-import queue
-import re
-import signal
-import subprocess
-import sys
-import time
-from functools import partial
-from threading import Thread
-from typing import Optional
-
-
-def clean_runs(get_all_remote_pids, conn):
-    """This process cleans up the remaining remote training tasks."""
-    print("Cleanup runs")
-    signal.signal(signal.SIGINT, signal.SIG_IGN)
-    data = conn.recv()
-
-    # If the launch process exits normally, don't do anything:
-    if data == "exit":
-        sys.exit(0)
-    else:
-        remote_pids = get_all_remote_pids()
-        for (ip, port), pids in remote_pids.items():
-            kill_proc(ip, port, pids)
-    print("Cleanup exits")
-
-
-def kill_proc(ip, port, pids):
-    """SSH to remote nodes and kill the specified processes."""
-    curr_pid = os.getpid()
-    killed_pids = []
-    pids.sort()
-    for pid in pids:
-        assert curr_pid != pid
-        print(f"Kill process {pid} on {ip}:{port}", flush=True)
-        kill_cmd = ("ssh -o StrictHostKeyChecking=no -p " + str(port) + " " +
-                    ip + f" 'kill {pid}'")
-        subprocess.run(kill_cmd, shell=True)
-        killed_pids.append(pid)
-    for _ in range(3):
-        killed_pids = get_pids_to_kill(ip, port, killed_pids)
-        if len(killed_pids) == 0:
-            break
-        else:
-            killed_pids.sort()
-            for pid in killed_pids:
-                print(f"Kill process {pid} on {ip}:{port}", flush=True)
-                kill_cmd = ("ssh -o StrictHostKeyChecking=no -p " + str(port) +
-                            " " + ip + f" 'kill -9 {pid}'")
-                subprocess.run(kill_cmd, shell=True)
-
-
-def get_pids_to_kill(ip, port, killed_pids):
-    """Get the process IDs that we want to kill but are still alive."""
-    killed_pids = [str(pid) for pid in killed_pids]
-    killed_pids = ",".join(killed_pids)
-    ps_cmd = ("ssh -o StrictHostKeyChecking=no -p " + str(port) + " " + ip +
-              f" 'ps -p {killed_pids} -h'")
-    res = subprocess.run(ps_cmd, shell=True, stdout=subprocess.PIPE)
-    pids = []
-    for p in res.stdout.decode("utf-8").split("\n"):
-        ps = p.split()
-        if len(ps) > 0:
-            pids.append(int(ps[0]))
-    return pids
-
-
-def remote_execute(
-    cmd: str,
-    state_q: queue.Queue,
-    ip: str,
-    port: int,
-    username: Optional[str] = None,
-) -> Thread:
-    """Execute command line on remote machine via ssh.
-
-    Args:
-        cmd: User-defined command (udf) to execute on the remote host.
-        state_q: A queue collecting Thread exit states.
-        ip: The ip-address of the host to run the command on.
-        port: Port number that the host is listening on.
-        username: If given, this will specify a username to use when issuing
-            commands over SSH. Useful when your infra requires you to
-            explicitly specify a username to avoid permission issues.
-
-    Returns:
-        thread: The thread who runs the command on the remote host.
-            Returns when the command completes on the remote host.
-    """
-    ip_prefix = ""
-    if username is not None:
-        ip_prefix += f"{username}@"
-
-    # Construct ssh command that executes `cmd` on the remote host
-    ssh_cmd = (f"ssh -o StrictHostKeyChecking=no -p {port} {ip_prefix}{ip} "
-               f"'{cmd}'")
-
-    print(f"----- ssh_cmd={ssh_cmd} ")
-
-    # thread func to run the job
-    def run(ssh_cmd, state_q):
-        try:
-            subprocess.check_call(ssh_cmd, shell=True)
-            state_q.put(0)
-        except subprocess.CalledProcessError as err:
-            print(f"Called process error {err}")
-            state_q.put(err.returncode)
-        except Exception:
-            state_q.put(-1)
-
-    thread = Thread(
-        target=run,
-        args=(
-            ssh_cmd,
-            state_q,
-        ),
-    )
-    thread.setDaemon(True)
-    thread.start()
-    # Sleep for a while in case SSH is rejected by peer due to busy connection:
-    time.sleep(0.2)
-    return thread
-
-
-def get_remote_pids(ip, port, cmd_regex):
-    """Get the process IDs that run the command in the remote machine."""
-    pids = []
-    curr_pid = os.getpid()
-    # We want to get the Python processes. However, we may get some SSH
-    # processes, so we should filter them out:
-    ps_cmd = (f"ssh -o StrictHostKeyChecking=no -p {port} {ip} "
-              f"'ps -aux | grep python | grep -v StrictHostKeyChecking'")
-    res = subprocess.run(ps_cmd, shell=True, stdout=subprocess.PIPE)
-    for p in res.stdout.decode("utf-8").split("\n"):
-        ps = p.split()
-        if len(ps) < 2:
-            continue
-        # We only get the processes that run the specified command:
-        res = re.search(cmd_regex, p)
-        if res is not None and int(ps[1]) != curr_pid:
-            pids.append(ps[1])
-
-    pid_str = ",".join([str(pid) for pid in pids])
-    ps_cmd = (f"ssh -o StrictHostKeyChecking=no -p {port} {ip} "
-              f" 'pgrep -P {pid_str}'")
-    res = subprocess.run(ps_cmd, shell=True, stdout=subprocess.PIPE)
-    pids1 = res.stdout.decode("utf-8").split("\n")
-    all_pids = []
-    for pid in set(pids + pids1):
-        if pid == "" or int(pid) == curr_pid:
-            continue
-        all_pids.append(int(pid))
-    all_pids.sort()
-    return all_pids
-
-
-def get_all_remote_pids(hosts, ssh_port, udf_command):
-    """Get all remote processes."""
-    remote_pids = {}
-    for host in hosts:
-        ip, _ = host
-        # When creating training processes in remote machines, we may insert
-        # some arguments in the commands. We need to use regular expressions to
-        # match the modified command.
-        cmds = udf_command.split()
-        new_udf_command = " .*".join(cmds)
-        pids = get_remote_pids(ip, ssh_port, new_udf_command)
-        remote_pids[(ip, ssh_port)] = pids
-    return remote_pids
-
-
-def wrap_cmd_w_envvars(cmd: str, env_vars: str) -> str:
-    """Wraps a CLI command with desired environment variables.
-
-    Example:
-        >>> cmd = "ls && pwd"
-        >>> env_vars = "VAR1=value1 VAR2=value2"
-        >>> wrap_cmd_w_envvars(cmd, env_vars)
-        "(export VAR1=value1 VAR2=value2; ls && pwd)"
-    """
-    if env_vars == "":
-        return f"({cmd})"
-    else:
-        return f"(export {env_vars}; {cmd})"
-
-
-def wrap_cmd_w_extra_envvars(cmd: str, env_vars: list) -> str:
-    """Wraps a CLI command with extra environment variables.
-
-    Example:
-        >>> cmd = "ls && pwd"
-        >>> env_vars = ["VAR1=value1", "VAR2=value2"]
-        >>> wrap_cmd_w_extra_envvars(cmd, env_vars)
-        "(export VAR1=value1 VAR2=value2; ls && pwd)"
-    """
-    env_vars = " ".join(env_vars)
-    return wrap_cmd_w_envvars(cmd, env_vars)
-
-
-def get_available_port(ip):
-    """Get available port with specified ip."""
-    import socket
-
-    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
-    for port in range(1234, 65535):
-        try:
-            sock.connect((ip, port))
-        except Exception:
-            return port
-    raise RuntimeError(f"Failed to get available port for ip~{ip}")
-
-
-def submit_all_jobs(args, udf_command, dry_run=False):
-    if dry_run:
-        print("Dry run mode, no jobs will be launched")
-
-    servers_cmd = []
-    hosts = []
-    thread_list = []
-
-    # Get the IP addresses of the cluster:
-    ip_config = os.path.join(args.workspace, args.ip_config)
-    with open(ip_config) as f:
-        for line in f:
-            result = line.strip().split()
-            if len(result) == 2:
-                ip = result[0]
-                port = int(result[1])
-                hosts.append((ip, port))
-            elif len(result) == 1:
-                ip = result[0]
-                port = get_available_port(ip)
-                hosts.append((ip, port))
-            else:
-                raise RuntimeError("Format error of 'ip_config'")
-
-    state_q = queue.Queue()
-
-    master_ip, _ = hosts[0]
-    for i in range(len(hosts)):
-        ip, _ = hosts[i]
-        server_env_vars_cur = ""
-        cmd = wrap_cmd_w_envvars(udf_command, server_env_vars_cur)
-        cmd = (wrap_cmd_w_extra_envvars(cmd, args.extra_envs)
-               if len(args.extra_envs) > 0 else cmd)
-
-        cmd = cmd[:-1]
-        cmd += " --logging"
-        cmd += f" --dataset_root_dir={args.dataset_root_dir}"
-        cmd += f" --dataset={args.dataset}"
-        cmd += f" --num_nodes={args.num_nodes}"
-        cmd += f" --num_neighbors={args.num_neighbors}"
-        cmd += f" --node_rank={i}"
-        cmd += f" --master_addr={master_ip}"
-        cmd += f" --num_epochs={args.num_epochs}"
-        cmd += f" --batch_size={args.batch_size}"
-        cmd += f" --num_workers={args.num_workers}"
-        cmd += f" --concurrency={args.concurrency}"
-        cmd += f" --ddp_port={args.ddp_port})"
-        servers_cmd.append(cmd)
-
-        if not dry_run:
-            thread_list.append(
-                remote_execute(cmd, state_q, ip, args.ssh_port,
-                               username=args.ssh_username))
-
-    # Start a cleanup process dedicated for cleaning up remote training jobs:
-    conn1, conn2 = multiprocessing.Pipe()
-    func = partial(get_all_remote_pids, hosts, args.ssh_port, udf_command)
-    process = multiprocessing.Process(target=clean_runs, args=(func, conn1))
-    process.start()
-
-    def signal_handler(signal, frame):
-        logging.info("Stop launcher")
-        # We need to tell the cleanup process to kill remote training jobs:
-        conn2.send("cleanup")
-        sys.exit(0)
-
-    signal.signal(signal.SIGINT, signal_handler)
-
-    err = 0
-    for thread in thread_list:
-        thread.join()
-        err_code = state_q.get()
-        if err_code != 0:
-            err = err_code  # Record error code:
-
-    # The training processes completed.
-    # We tell the cleanup process to exit.
-    conn2.send("exit")
-    process.join()
-    if err != 0:
-        print("Task failed")
-        sys.exit(-1)
-    print("=== fully done ! === ")
-
-
-def main():
-    parser = argparse.ArgumentParser(description="Launch a distributed job")
-    parser.add_argument(
-        "--ssh_port",
-        type=int,
-        default=22,
-        help="SSH port",
-    )
-    parser.add_argument(
-        "--ssh_username",
-        type=str,
-        default="",
-        help=("When issuing commands (via ssh) to the cluster, use the "
-              "provided username in the ssh cmd. For example, if you provide "
-              "--ssh_username=bob, then the ssh command will be like "
-              "'ssh bob@1.2.3.4 CMD'"),
-    )
-    parser.add_argument(
-        "--workspace",
-        type=str,
-        required=True,
-        help="Path of user directory of distributed tasks",
-    )
-    parser.add_argument(
-        "--dataset",
-        type=str,
-        default="ogbn-products",
-        help="The name of the dataset",
-    )
-    parser.add_argument(
-        "--dataset_root_dir",
-        type=str,
-        default='../../data/products',
-        help="The root directory (relative path) of partitioned dataset",
-    )
-    parser.add_argument(
-        "--num_nodes",
-        type=int,
-        default=2,
-        help="Number of distributed nodes",
-    )
-    parser.add_argument(
-        "--num_neighbors",
-        type=str,
-        default="15,10,5",
-        help="Number of node neighbors sampled at each layer",
-    )
-    parser.add_argument(
-        "--node_rank",
-        type=int,
-        default=0,
-        help="The current node rank",
-    )
-    parser.add_argument(
-        "--num_training_procs",
-        type=int,
-        default=2,
-        help="The number of training processes per node",
-    )
-    parser.add_argument(
-        "--master_addr",
-        type=str,
-        default='localhost',
-        help="The master address for RPC initialization",
-    )
-    parser.add_argument(
-        "--num_epochs",
-        type=int,
-        default=100,
-        help="The number of training epochs",
-    )
-    parser.add_argument(
-        "--batch_size",
-        type=int,
-        default=1024,
-        help="Batch size for training and testing",
-    )
-    parser.add_argument(
-        "--num_workers",
-        type=int,
-        default=2,
-        help="Number of sampler sub-processes",
-    )
-    parser.add_argument(
-        "--concurrency",
-        type=int,
-        default=2,
-        help="Number of maximum concurrent RPC for each sampler",
-    )
-    parser.add_argument(
-        "--learning_rate",
-        type=float,
-        default=0.0004,
-        help="Learning rate",
-    )
-    parser.add_argument(
-        '--ddp_port',
-        type=int,
-        default=11111,
-        help="Port used for PyTorch's DDP communication",
-    )
-    parser.add_argument(
-        "--ip_config",
-        required=True,
-        type=str,
-        help="File (in workspace) of IP configuration for server processes",
-    )
-    parser.add_argument(
-        "--extra_envs",
-        nargs="+",
-        type=str,
-        default=[],
-        help=("Extra environment parameters be set. For example, you can set "
-              "the 'LD_LIBRARY_PATH' by adding: --extra_envs "
-              "LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH"),
-    )
-    args, udf_command = parser.parse_known_args()
-
-    udf_command = str(udf_command[0])
-    if "python" not in udf_command:
-        raise RuntimeError("Launching script does only support a Python "
-                           "executable file")
-    submit_all_jobs(args, udf_command)
-
-
-if __name__ == "__main__":
-    fmt = "%(asctime)s %(levelname)s %(message)s"
-    logging.basicConfig(format=fmt, level=logging.INFO)
-    main()
diff --git a/examples/distributed/pyg/node_ogb_cpu.py b/examples/distributed/pyg/node_ogb_cpu.py
deleted file mode 100644
index 7f210b1cb33a..000000000000
--- a/examples/distributed/pyg/node_ogb_cpu.py
+++ /dev/null
@@ -1,472 +0,0 @@
-import argparse
-import os.path as osp
-import time
-from contextlib import nullcontext
-
-import torch
-import torch.distributed
-import torch.nn.functional as F
-from torch.nn.parallel import DistributedDataParallel
-from tqdm import tqdm
-
-from torch_geometric.data import HeteroData
-from torch_geometric.distributed import (
-    DistContext,
-    DistNeighborLoader,
-    LocalFeatureStore,
-    LocalGraphStore,
-)
-from torch_geometric.io import fs
-from torch_geometric.nn import GraphSAGE, to_hetero
-
-
-@torch.no_grad()
-def test(
-    model,
-    loader,
-    dist_context,
-    device,
-    epoch,
-    logfile=None,
-    num_loader_threads=10,
-    progress_bar=True,
-):
-    def test_homo(batch):
-        out = model(batch.x, batch.edge_index)[:batch.batch_size]
-        y_pred = out.argmax(dim=-1)
-        y_true = batch.y[:batch.batch_size]
-        return y_pred, y_true
-
-    def test_hetero(batch):
-        batch_size = batch['paper'].batch_size
-        out = model(batch.x_dict, batch.edge_index_dict)
-        out = out['paper'][:batch_size]
-        y_pred = out.argmax(dim=-1)
-        y_true = batch['paper'].y[:batch_size]
-        return y_pred, y_true
-
-    model.eval()
-    total_examples = total_correct = 0
-
-    if loader.num_workers > 0:
-        context = loader.enable_multithreading(num_loader_threads)
-    else:
-        context = nullcontext()
-
-    with context:
-        if progress_bar:
-            loader = tqdm(loader, desc=f'[Node {dist_context.rank}] Test')
-
-        start_time = batch_time = time.time()
-        for i, batch in enumerate(loader):
-            batch = batch.to(device)
-
-            if isinstance(batch, HeteroData):
-                y_pred, y_true = test_hetero(batch)
-            else:
-                y_pred, y_true = test_homo(batch)
-
-            total_correct += int((y_pred == y_true).sum())
-            total_examples += y_pred.size(0)
-            batch_acc = int((y_pred == y_true).sum()) / y_pred.size(0)
-
-            result = (f'[Node {dist_context.rank}] Test: epoch={epoch}, '
-                      f'it={i}, acc={batch_acc:.4f}, '
-                      f'time={(time.time() - batch_time):.4f}')
-            batch_time = time.time()
-
-            if logfile:
-                log = open(logfile, 'a+')
-                log.write(f'{result}\n')
-                log.close()
-
-            if not progress_bar:
-                print(result)
-
-    total_acc = total_correct / total_examples
-    print(f'[Node {dist_context.rank}] Test epoch {epoch} END: '
-          f'acc={total_acc:.4f}, time={(time.time() - start_time):.2f}')
-    torch.distributed.barrier()
-
-
-def train(
-    model,
-    loader,
-    optimizer,
-    dist_context,
-    device,
-    epoch,
-    logfile=None,
-    num_loader_threads=10,
-    progress_bar=True,
-):
-    def train_homo(batch):
-        out = model(batch.x, batch.edge_index)[:batch.batch_size]
-        loss = F.cross_entropy(out, batch.y[:batch.batch_size])
-        return loss, batch.batch_size
-
-    def train_hetero(batch):
-        batch_size = batch['paper'].batch_size
-        out = model(batch.x_dict, batch.edge_index_dict)
-        out = out['paper'][:batch_size]
-        target = batch['paper'].y[:batch_size]
-        loss = F.cross_entropy(out, target)
-        return loss, batch_size
-
-    model.train()
-    total_loss = total_examples = 0
-
-    if loader.num_workers > 0:
-        context = loader.enable_multithreading(num_loader_threads)
-    else:
-        context = nullcontext()
-
-    with context:
-        if progress_bar:
-            loader = tqdm(loader, desc=f'[Node {dist_context.rank}] Train')
-
-        start_time = batch_time = time.time()
-        for i, batch in enumerate(loader):
-            batch = batch.to(device)
-            optimizer.zero_grad()
-
-            if isinstance(batch, HeteroData):
-                loss, batch_size = train_hetero(batch)
-            else:
-                loss, batch_size = train_homo(batch)
-
-            loss.backward()
-            optimizer.step()
-
-            total_loss += float(loss) * batch_size
-            total_examples += batch_size
-
-            result = (f'[Node {dist_context.rank}] Train: epoch={epoch}, '
-                      f'it={i}, loss={loss:.4f}, '
-                      f'time={(time.time() - batch_time):.4f}')
-            batch_time = time.time()
-
-            if logfile:
-                log = open(logfile, 'a+')
-                log.write(f'{result}\n')
-                log.close()
-
-            if not progress_bar:
-                print(result)
-
-    print(f'[Node {dist_context.rank}] Train epoch {epoch} END: '
-          f'loss={total_loss/total_examples:.4f}, '
-          f'time={(time.time() - start_time):.2f}')
-    torch.distributed.barrier()
-
-
-def run_proc(
-    local_proc_rank: int,
-    num_nodes: int,
-    node_rank: int,
-    dataset: str,
-    dataset_root_dir: str,
-    master_addr: str,
-    ddp_port: int,
-    train_loader_port: int,
-    test_loader_port: int,
-    num_epochs: int,
-    batch_size: int,
-    num_neighbors: str,
-    async_sampling: bool,
-    concurrency: int,
-    num_workers: int,
-    num_loader_threads: int,
-    progress_bar: bool,
-    logfile: str,
-):
-    is_hetero = dataset == 'ogbn-mag'
-
-    print('--- Loading data partition files ...')
-    root_dir = osp.join(osp.dirname(osp.realpath(__file__)), dataset_root_dir)
-    node_label_file = osp.join(root_dir, f'{dataset}-label', 'label.pt')
-    train_idx = fs.torch_load(
-        osp.join(
-            root_dir,
-            f'{dataset}-train-partitions',
-            f'partition{node_rank}.pt',
-        ))
-    test_idx = fs.torch_load(
-        osp.join(
-            root_dir,
-            f'{dataset}-test-partitions',
-            f'partition{node_rank}.pt',
-        ))
-
-    if is_hetero:
-        train_idx = ('paper', train_idx)
-        test_idx = ('paper', test_idx)
-
-    # Load partition into local graph store:
-    graph = LocalGraphStore.from_partition(
-        osp.join(root_dir, f'{dataset}-partitions'), node_rank)
-    # Load partition into local feature store:
-    feature = LocalFeatureStore.from_partition(
-        osp.join(root_dir, f'{dataset}-partitions'), node_rank)
-    feature.labels = fs.torch_load(node_label_file)
-    partition_data = (feature, graph)
-    print(f'Partition metadata: {graph.meta}')
-
-    # Initialize distributed context:
-    current_ctx = DistContext(
-        world_size=num_nodes,
-        rank=node_rank,
-        global_world_size=num_nodes,
-        global_rank=node_rank,
-        group_name='distributed-ogb-sage',
-    )
-    current_device = torch.device('cpu')
-
-    print('--- Initialize DDP training group ...')
-    torch.distributed.init_process_group(
-        backend='gloo',
-        rank=current_ctx.rank,
-        world_size=current_ctx.world_size,
-        init_method=f'tcp://{master_addr}:{ddp_port}',
-    )
-
-    print('--- Initialize distributed loaders ...')
-    num_neighbors = [int(i) for i in num_neighbors.split(',')]
-    # Create distributed neighbor loader for training:
-    train_loader = DistNeighborLoader(
-        data=partition_data,
-        input_nodes=train_idx,
-        current_ctx=current_ctx,
-        device=current_device,
-        num_neighbors=num_neighbors,
-        shuffle=True,
-        drop_last=True,
-        persistent_workers=num_workers > 0,
-        batch_size=batch_size,
-        num_workers=num_workers,
-        master_addr=master_addr,
-        master_port=train_loader_port,
-        concurrency=concurrency,
-        async_sampling=async_sampling,
-    )
-    # Create distributed neighbor loader for testing:
-    test_loader = DistNeighborLoader(
-        data=partition_data,
-        input_nodes=test_idx,
-        current_ctx=current_ctx,
-        device=current_device,
-        num_neighbors=num_neighbors,
-        shuffle=False,
-        drop_last=False,
-        persistent_workers=num_workers > 0,
-        batch_size=batch_size,
-        num_workers=num_workers,
-        master_addr=master_addr,
-        master_port=test_loader_port,
-        concurrency=concurrency,
-        async_sampling=async_sampling,
-    )
-
-    print('--- Initialize model ...')
-    model = GraphSAGE(
-        in_channels=128 if is_hetero else 100,  # num_features
-        hidden_channels=256,
-        num_layers=len(num_neighbors),
-        out_channels=349 if is_hetero else 47,  # num_classes in dataset
-    ).to(current_device)
-
-    if is_hetero:  # Turn model into a heterogeneous variant:
-        metadata = [
-            graph.meta['node_types'],
-            [tuple(e) for e in graph.meta['edge_types']],
-        ]
-        model = to_hetero(model, metadata).to(current_device)
-        torch.distributed.barrier()
-
-    # Enable DDP:
-    model = DistributedDataParallel(model, find_unused_parameters=is_hetero)
-    optimizer = torch.optim.Adam(model.parameters(), lr=0.0004)
-    torch.distributed.barrier()
-
-    # Train and test:
-    print(f'--- Start training for {num_epochs} epochs ...')
-    for epoch in range(1, num_epochs + 1):
-        print(f'Train epoch {epoch}/{num_epochs}:')
-        train(
-            model,
-            train_loader,
-            optimizer,
-            current_ctx,
-            current_device,
-            epoch,
-            logfile,
-            num_loader_threads,
-            progress_bar,
-        )
-
-        if epoch % 5 == 0:
-            print(f'Test epoch {epoch}/{num_epochs}:')
-            test(
-                model,
-                test_loader,
-                current_ctx,
-                current_device,
-                epoch,
-                logfile,
-                num_loader_threads,
-                progress_bar,
-            )
-    print(f'--- [Node {current_ctx.rank}] Closing ---')
-    torch.distributed.destroy_process_group()
-
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser(
-        description='Arguments for distributed training')
-
-    parser.add_argument(
-        '--dataset',
-        type=str,
-        default='ogbn-products',
-        choices=['ogbn-products', 'ogbn-mag'],
-        help='Name of the dataset: (ogbn-products, ogbn-mag)',
-    )
-    parser.add_argument(
-        '--dataset_root_dir',
-        type=str,
-        default='../../../data/partitions/ogbn-products/2-parts',
-        help='The root directory (relative path) of partitioned dataset',
-    )
-    parser.add_argument(
-        '--num_nodes',
-        type=int,
-        default=2,
-        help='Number of distributed nodes',
-    )
-    parser.add_argument(
-        '--num_neighbors',
-        type=str,
-        default='15,10,5',
-        help='Number of node neighbors sampled at each layer',
-    )
-    parser.add_argument(
-        '--node_rank',
-        type=int,
-        default=0,
-        help='The current node rank',
-    )
-    parser.add_argument(
-        '--num_epochs',
-        type=int,
-        default=100,
-        help='The number of training epochs',
-    )
-    parser.add_argument(
-        '--batch_size',
-        type=int,
-        default=1024,
-        help='Batch size for training and testing',
-    )
-    parser.add_argument(
-        '--num_workers',
-        type=int,
-        default=4,
-        help='Number of sampler sub-processes',
-    )
-    parser.add_argument(
-        '--num_loader_threads',
-        type=int,
-        default=10,
-        help='Number of threads used for each sampler sub-process',
-    )
-    parser.add_argument(
-        '--concurrency',
-        type=int,
-        default=4,
-        help='Number of maximum concurrent RPC for each sampler',
-    )
-    parser.add_argument(
-        '--async_sampling',
-        type=bool,
-        default=True,
-        help='Whether sampler processes RPC requests asynchronously',
-    )
-    parser.add_argument(
-        '--master_addr',
-        type=str,
-        default='localhost',
-        help='The master address for RPC initialization',
-    )
-    parser.add_argument(
-        '--ddp_port',
-        type=int,
-        default=11111,
-        help="The port used for PyTorch's DDP communication",
-    )
-    parser.add_argument(
-        '--train_loader_port',
-        type=int,
-        default=11112,
-        help='The port used for RPC communication across training samplers',
-    )
-    parser.add_argument(
-        '--test_loader_port',
-        type=int,
-        default=11113,
-        help='The port used for RPC communication across test samplers',
-    )
-    parser.add_argument('--logging', action='store_true')
-    parser.add_argument('--progress_bar', action='store_true')
-
-    args = parser.parse_args()
-
-    print('--- Distributed training example on OGB ---')
-    print(f'* total nodes: {args.num_nodes}')
-    print(f'* node rank: {args.node_rank}')
-    print(f'* dataset: {args.dataset}')
-    print(f'* dataset root dir: {args.dataset_root_dir}')
-    print(f'* epochs: {args.num_epochs}')
-    print(f'* batch size: {args.batch_size}')
-    print(f'* number of sampler workers: {args.num_workers}')
-    print(f'* master addr: {args.master_addr}')
-    print(f'* training process group master port: {args.ddp_port}')
-    print(f'* training loader master port: {args.train_loader_port}')
-    print(f'* testing loader master port: {args.test_loader_port}')
-    print(f'* RPC asynchronous processing: {args.async_sampling}')
-    print(f'* RPC concurrency: {args.concurrency}')
-    print(f'* loader multithreading: {args.num_loader_threads}')
-    print(f'* logging enabled: {args.logging}')
-    print(f'* progress bars enabled: {args.progress_bar}')
-
-    if args.logging:
-        logfile = f'dist_cpu-node{args.node_rank}.txt'
-        with open(logfile, 'a+') as log:
-            log.write(f'\n--- Inputs: {str(args)}')
-    else:
-        logfile = None
-
-    print('--- Launching training processes ...')
-    torch.multiprocessing.spawn(
-        run_proc,
-        args=(
-            args.num_nodes,
-            args.node_rank,
-            args.dataset,
-            args.dataset_root_dir,
-            args.master_addr,
-            args.ddp_port,
-            args.train_loader_port,
-            args.test_loader_port,
-            args.num_epochs,
-            args.batch_size,
-            args.num_neighbors,
-            args.async_sampling,
-            args.concurrency,
-            args.num_workers,
-            args.num_loader_threads,
-            args.progress_bar,
-            logfile,
-        ),
-        join=True,
-    )
-    print('--- Finished training processes ...')
diff --git a/examples/distributed/pyg/partition_graph.py b/examples/distributed/pyg/partition_graph.py
deleted file mode 100644
index 07a902c72694..000000000000
--- a/examples/distributed/pyg/partition_graph.py
+++ /dev/null
@@ -1,173 +0,0 @@
-import argparse
-import os
-import os.path as osp
-
-import torch
-from ogb.nodeproppred import PygNodePropPredDataset
-
-import torch_geometric.transforms as T
-from torch_geometric.datasets import OGB_MAG, MovieLens, Reddit
-from torch_geometric.distributed import Partitioner
-from torch_geometric.utils import mask_to_index
-
-
-def partition_dataset(
-    dataset_name: str,
-    root_dir: str,
-    num_parts: int,
-    recursive: bool = False,
-    use_sparse_tensor: bool = False,
-):
-    if not osp.isabs(root_dir):
-        path = osp.dirname(osp.realpath(__file__))
-        root_dir = osp.join(path, root_dir)
-
-    dataset_dir = osp.join(root_dir, 'dataset', dataset_name)
-    dataset = get_dataset(dataset_name, dataset_dir, use_sparse_tensor)
-    data = dataset[0]
-
-    save_dir = osp.join(root_dir, 'partitions', dataset_name,
-                        f'{num_parts}-parts')
-
-    partitions_dir = osp.join(save_dir, f'{dataset_name}-partitions')
-    partitioner = Partitioner(data, num_parts, partitions_dir, recursive)
-    partitioner.generate_partition()
-
-    print('-- Saving label ...')
-    label_dir = osp.join(save_dir, f'{dataset_name}-label')
-    os.makedirs(label_dir, exist_ok=True)
-
-    if dataset_name == 'ogbn-mag':
-        split_data = data['paper']
-        label = split_data.y
-    else:
-        split_data = data
-        if dataset_name == 'ogbn-products':
-            label = split_data.y.squeeze()
-        elif dataset_name == 'Reddit':
-            label = split_data.y
-        elif dataset_name == 'MovieLens':
-            label = split_data[data.edge_types[0]].edge_label
-
-    torch.save(label, osp.join(label_dir, 'label.pt'))
-
-    split_idx = get_idx_split(dataset, dataset_name, split_data)
-
-    if dataset_name == 'MovieLens':
-        save_link_partitions(split_idx, data, dataset_name, num_parts,
-                             save_dir)
-    else:
-        save_partitions(split_idx, dataset_name, num_parts, save_dir)
-
-
-def get_dataset(name, dataset_dir, use_sparse_tensor=False):
-    transforms = []
-    if use_sparse_tensor:
-        transforms = [T.ToSparseTensor(remove_edge_index=False)]
-
-    if name == 'ogbn-mag':
-        transforms = [T.ToUndirected(merge=True)] + transforms
-        return OGB_MAG(
-            root=dataset_dir,
-            preprocess='metapath2vec',
-            transform=T.Compose(transforms),
-        )
-
-    elif name == 'ogbn-products':
-        transforms = [T.RemoveDuplicatedEdges()] + transforms
-        return PygNodePropPredDataset(
-            'ogbn-products',
-            root=dataset_dir,
-            transform=T.Compose(transforms),
-        )
-
-    elif name == 'MovieLens':
-        transforms = [T.ToUndirected(merge=True)] + transforms
-        return MovieLens(
-            root=dataset_dir,
-            model_name='all-MiniLM-L6-v2',
-            transform=T.Compose(transforms),
-        )
-
-    elif name == 'Reddit':
-        return Reddit(
-            root=dataset_dir,
-            transform=T.Compose(transforms),
-        )
-
-
-def get_idx_split(dataset, dataset_name, split_data):
-    if dataset_name == 'ogbn-mag' or dataset_name == 'Reddit':
-        train_idx = mask_to_index(split_data.train_mask)
-        test_idx = mask_to_index(split_data.test_mask)
-        val_idx = mask_to_index(split_data.val_mask)
-
-    elif dataset_name == 'ogbn-products':
-        split_idx = dataset.get_idx_split()
-        train_idx = split_idx['train']
-        test_idx = split_idx['test']
-        val_idx = split_idx['valid']
-
-    elif dataset_name == 'MovieLens':
-        # Perform a 80/10/10 temporal link-level split:
-        perm = torch.argsort(dataset[0][('user', 'rates', 'movie')].time)
-        train_idx = perm[:int(0.8 * perm.size(0))]
-        val_idx = perm[int(0.8 * perm.size(0)):int(0.9 * perm.size(0))]
-        test_idx = perm[int(0.9 * perm.size(0)):]
-
-    return {'train': train_idx, 'val': val_idx, 'test': test_idx}
-
-
-def save_partitions(split_idx, dataset_name, num_parts, save_dir):
-    for key, idx in split_idx.items():
-        print(f'-- Partitioning {key} indices ...')
-        idx = idx.split(idx.size(0) // num_parts)
-
-        part_dir = osp.join(save_dir, f'{dataset_name}-{key}-partitions')
-        os.makedirs(part_dir, exist_ok=True)
-        for i in range(num_parts):
-            torch.save(idx[i], osp.join(part_dir, f'partition{i}.pt'))
-
-
-def save_link_partitions(split_idx, data, dataset_name, num_parts, save_dir):
-    edge_type = data.edge_types[0]
-
-    for key, idx in split_idx.items():
-        print(f'-- Partitioning {key} indices ...')
-        edge_index = data[edge_type].edge_index[:, idx]
-        edge_index = edge_index.split(edge_index.size(1) // num_parts, dim=1)
-
-        label = data[edge_type].edge_label[idx]
-        label = label.split(label.size(0) // num_parts)
-
-        edge_time = data[edge_type].time[idx]
-        edge_time = edge_time.split(edge_time.size(0) // num_parts)
-
-        part_dir = osp.join(save_dir, f'{dataset_name}-{key}-partitions')
-        os.makedirs(part_dir, exist_ok=True)
-        for i in range(num_parts):
-            partition = {
-                'edge_label_index': edge_index[i],
-                'edge_label': label[i],
-                'edge_label_time': edge_time[i] - 1,
-            }
-            torch.save(partition, osp.join(part_dir, f'partition{i}.pt'))
-
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser()
-    add = parser.add_argument
-
-    add('--dataset', type=str,
-        choices=['ogbn-mag', 'ogbn-products', 'MovieLens',
-                 'Reddit'], default='ogbn-products')
-    add('--root_dir', default='../../../data', type=str)
-    add('--num_partitions', type=int, default=2)
-    add('--recursive', action='store_true')
-    # TODO (kgajdamo) Add support for arguments below:
-    # add('--use-sparse-tensor', action='store_true')
-    # add('--bf16', action='store_true')
-    args = parser.parse_args()
-
-    partition_dataset(args.dataset, args.root_dir, args.num_partitions,
-                      args.recursive)
diff --git a/examples/distributed/pyg/run_dist.sh b/examples/distributed/pyg/run_dist.sh
deleted file mode 100755
index aeadcf43f9c0..000000000000
--- a/examples/distributed/pyg/run_dist.sh
+++ /dev/null
@@ -1,53 +0,0 @@
-#!/bin/bash
-
-PYG_WORKSPACE=$PWD
-USER=user
-CONDA_ENV=pygenv
-CONDA_DIR="/home/${USER}/anaconda3"
-PY_EXEC="${CONDA_DIR}/envs/${CONDA_ENV}/bin/python"
-EXEC_SCRIPT="${PYG_WORKSPACE}/node_ogb_cpu.py"
-CMD="cd ${PYG_WORKSPACE}; ${PY_EXEC} ${EXEC_SCRIPT}"
-
-# Node number:
-NUM_NODES=2
-
-# Dataset name:
-DATASET=ogbn-products
-
-# Dataset folder:
-DATASET_ROOT_DIR="../../../data/partitions/${DATASET}/${NUM_NODES}-parts"
-
-# Number of epochs:
-NUM_EPOCHS=10
-
-# The batch size:
-BATCH_SIZE=1024
-
-# Fanout per layer:
-NUM_NEIGHBORS="5,5,5"
-
-# Number of workers for sampling:
-NUM_WORKERS=2
-CONCURRENCY=4
-
-# DDP Port
-DDP_PORT=11111
-
-# IP configuration path:
-IP_CONFIG=${PYG_WORKSPACE}/ip_config.yaml
-
-# Folder and filename to place logs:
-logdir="logs"
-mkdir -p "logs"
-logname=log_${DATASET}_${NUM_PARTS}_$RANDOM
-echo "stdout stored in ${PYG_WORKSPACE}/${logdir}/${logname}"
-set -x
-
-# stdout stored in `/logdir/logname.out`.
-python launch.py --workspace ${PYG_WORKSPACE} --ip_config ${IP_CONFIG} --ssh_username ${USER} --num_nodes ${NUM_NODES} --num_neighbors ${NUM_NEIGHBORS} --dataset_root_dir ${DATASET_ROOT_DIR} --dataset ${DATASET}  --num_epochs ${NUM_EPOCHS} --batch_size ${BATCH_SIZE} --num_workers ${NUM_WORKERS} --concurrency ${CONCURRENCY} --ddp_port ${DDP_PORT} "${CMD}" |& tee ${logdir}/${logname} &
-pid=$!
-echo "started launch.py: ${pid}"
-# kill processes at script exit (Ctrl + C)
-trap "kill -2 $pid" SIGINT
-wait $pid
-set +x
diff --git a/examples/distributed/pyg/temporal_link_movielens_cpu.py b/examples/distributed/pyg/temporal_link_movielens_cpu.py
deleted file mode 100644
index d51950c65536..000000000000
--- a/examples/distributed/pyg/temporal_link_movielens_cpu.py
+++ /dev/null
@@ -1,515 +0,0 @@
-import argparse
-import os.path as osp
-import time
-from contextlib import nullcontext
-
-import torch
-import torch.distributed
-import torch.nn.functional as F
-from torch.nn import Linear
-from torch.nn.parallel import DistributedDataParallel
-from tqdm import tqdm
-
-from torch_geometric.distributed import (
-    DistContext,
-    DistLinkNeighborLoader,
-    LocalFeatureStore,
-    LocalGraphStore,
-)
-from torch_geometric.io import fs
-from torch_geometric.nn import SAGEConv, to_hetero
-
-
-class GNNEncoder(torch.nn.Module):
-    def __init__(self, hidden_channels, out_channels):
-        super().__init__()
-        self.conv1 = SAGEConv((-1, -1), hidden_channels)
-        self.conv2 = SAGEConv((-1, -1), out_channels)
-
-    def forward(self, x, edge_index):
-        x = self.conv1(x, edge_index).relu()
-        x = self.conv2(x, edge_index)
-        return x
-
-
-class EdgeDecoder(torch.nn.Module):
-    def __init__(self, hidden_channels):
-        super().__init__()
-        self.lin1 = Linear(2 * hidden_channels, hidden_channels)
-        self.lin2 = Linear(hidden_channels, 1)
-
-    def forward(self, z_dict, edge_label_index):
-        row, col = edge_label_index
-        z = torch.cat([z_dict['user'][row], z_dict['movie'][col]], dim=-1)
-
-        z = self.lin1(z).relu()
-        z = self.lin2(z)
-        return z.view(-1)
-
-
-class Model(torch.nn.Module):
-    def __init__(self, hidden_channels, metadata):
-        super().__init__()
-        self.encoder = GNNEncoder(hidden_channels, hidden_channels)
-        self.encoder = to_hetero(self.encoder, metadata, aggr='sum')
-        self.decoder = EdgeDecoder(hidden_channels)
-
-    def forward(self, x_dict, edge_index_dict, edge_label_index):
-        z_dict = self.encoder(x_dict, edge_index_dict)
-        return self.decoder(z_dict, edge_label_index)
-
-
-@torch.no_grad()
-def test(
-    model,
-    loader,
-    dist_context,
-    device,
-    epoch,
-    logfile=None,
-    num_loader_threads=10,
-    progress_bar=True,
-):
-    model.eval()
-    preds, targets = [], []
-
-    if loader.num_workers > 0:
-        context = loader.enable_multithreading(num_loader_threads)
-    else:
-        context = nullcontext()
-
-    with context:
-        if progress_bar:
-            loader = tqdm(loader, desc=f'[Node {dist_context.rank}] Test')
-
-        start_time = batch_time = time.time()
-        for i, batch in enumerate(loader):
-            batch = batch.to(device)
-
-            pred = model(
-                batch.x_dict,
-                batch.edge_index_dict,
-                batch['user', 'movie'].edge_label_index,
-            ).clamp(min=0, max=5)
-            target = batch['user', 'movie'].edge_label.float()
-            preds.append(pred)
-            targets.append(target)
-
-            rmse = (pred - target).pow(2).mean().sqrt()
-
-            result = (f'[Node {dist_context.rank}] Test: epoch={epoch}, '
-                      f'it={i}, rmse={rmse:.4f}, '
-                      f'time={(time.time() - batch_time):.4f}')
-            batch_time = time.time()
-
-            if logfile:
-                log = open(logfile, 'a+')
-                log.write(f'{result}\n')
-                log.close()
-
-            if not progress_bar:
-                print(result)
-
-    pred = torch.cat(preds, dim=0)
-    target = torch.cat(targets, dim=0)
-    total_rmse = (pred - target).pow(2).mean().sqrt()
-    print(f'[Node {dist_context.rank}] Test epoch {epoch} END: '
-          f'rmse={total_rmse:.4f}, time={(time.time() - start_time):.2f}')
-    torch.distributed.barrier()
-
-
-def train(
-    model,
-    loader,
-    optimizer,
-    dist_context,
-    device,
-    epoch,
-    logfile=None,
-    num_loader_threads=10,
-    progress_bar=True,
-):
-    model.train()
-    total_loss = total_examples = 0
-
-    if loader.num_workers > 0:
-        context = loader.enable_multithreading(num_loader_threads)
-    else:
-        context = nullcontext()
-
-    with context:
-        if progress_bar:
-            loader = tqdm(loader, desc=f'[Node {dist_context.rank}] Train')
-
-        start_time = batch_time = time.time()
-        for i, batch in enumerate(loader):
-            batch = batch.to(device)
-            optimizer.zero_grad()
-
-            pred = model(
-                batch.x_dict,
-                batch.edge_index_dict,
-                batch['user', 'movie'].edge_label_index,
-            )
-            target = batch['user', 'movie'].edge_label.float()
-
-            loss = F.mse_loss(pred, target)
-            loss.backward()
-            optimizer.step()
-
-            total_loss += float(loss) * pred.size(0)
-            total_examples += pred.size(0)
-
-            result = (f'[Node {dist_context.rank}] Train: epoch={epoch}, '
-                      f'it={i}, loss={loss:.4f}, '
-                      f'time={(time.time() - batch_time):.4}')
-            batch_time = time.time()
-
-            if logfile:
-                log = open(logfile, 'a+')
-                log.write(f'{result}\n')
-                log.close()
-
-            if not progress_bar:
-                print(result)
-
-    torch.distributed.barrier()
-    print(f'[Node {dist_context.rank}] Train epoch {epoch} END: '
-          f'loss={total_loss/total_examples:.4f}, '
-          f'time={(time.time() - start_time):.2f}')
-
-
-def run_proc(
-    local_proc_rank: int,
-    num_nodes: int,
-    node_rank: int,
-    dataset: str,
-    dataset_root_dir: str,
-    master_addr: str,
-    ddp_port: int,
-    train_loader_port: int,
-    test_loader_port: int,
-    num_epochs: int,
-    batch_size: int,
-    num_neighbors: str,
-    async_sampling: bool,
-    concurrency: int,
-    num_workers: int,
-    num_loader_threads: int,
-    progress_bar: bool,
-    logfile: str,
-):
-    print('--- Loading data partition files ...')
-    root_dir = osp.join(osp.dirname(osp.realpath(__file__)), dataset_root_dir)
-    edge_label_file = osp.join(root_dir, f'{dataset}-label', 'label.pt')
-    train_data = fs.torch_load(
-        osp.join(
-            root_dir,
-            f'{dataset}-train-partitions',
-            f'partition{node_rank}.pt',
-        ))
-    test_data = fs.torch_load(
-        osp.join(
-            root_dir,
-            f'{dataset}-test-partitions',
-            f'partition{node_rank}.pt',
-        ))
-
-    train_edge_label_index = train_data['edge_label_index']
-    train_edge_label = train_data['edge_label']
-    train_edge_label_time = train_data['edge_label_time']
-
-    test_edge_label_index = test_data['edge_label_index']
-    test_edge_label = test_data['edge_label']
-    test_edge_label_time = test_data['edge_label_time']
-
-    # Load partition into local graph store:
-    graph = LocalGraphStore.from_partition(
-        osp.join(root_dir, f'{dataset}-partitions'), node_rank)
-    # Load partition into local feature store:
-    feature = LocalFeatureStore.from_partition(
-        osp.join(root_dir, f'{dataset}-partitions'), node_rank)
-    feature.labels = fs.torch_load(edge_label_file)
-    partition_data = (feature, graph)
-
-    # Add identity user node features for message passing:
-    x = torch.eye(
-        feature._global_id['user'].size(0),
-        feature._feat[('movie', 'x')].size(1),
-    )
-    feature.put_tensor(x, group_name='user', attr_name='x')
-
-    # Initialize distributed context:
-    current_ctx = DistContext(
-        world_size=num_nodes,
-        rank=node_rank,
-        global_world_size=num_nodes,
-        global_rank=node_rank,
-        group_name='distributed-temporal-link-movielens',
-    )
-    current_device = torch.device('cpu')
-
-    print('--- Initialize DDP training group ...')
-    torch.distributed.init_process_group(
-        backend='gloo',
-        rank=current_ctx.rank,
-        world_size=current_ctx.world_size,
-        init_method=f'tcp://{master_addr}:{ddp_port}',
-    )
-
-    print('--- Initialize distributed loaders ...')
-    num_neighbors = [int(i) for i in num_neighbors.split(',')]
-    # Create distributed neighbor loader for training:
-    train_loader = DistLinkNeighborLoader(
-        data=partition_data,
-        edge_label_index=((('user', 'rates', 'movie')),
-                          train_edge_label_index),
-        edge_label=train_edge_label,
-        edge_label_time=train_edge_label_time,
-        disjoint=True,
-        time_attr='edge_time',
-        temporal_strategy='last',
-        current_ctx=current_ctx,
-        device=current_device,
-        num_neighbors=num_neighbors,
-        shuffle=True,
-        drop_last=True,
-        persistent_workers=num_workers > 0,
-        batch_size=batch_size,
-        num_workers=num_workers,
-        master_addr=master_addr,
-        master_port=train_loader_port,
-        concurrency=concurrency,
-        async_sampling=async_sampling,
-    )
-    # Create distributed neighbor loader for testing:
-    test_loader = DistLinkNeighborLoader(
-        data=partition_data,
-        edge_label_index=((('user', 'rates', 'movie')), test_edge_label_index),
-        edge_label=test_edge_label,
-        edge_label_time=test_edge_label_time,
-        disjoint=True,
-        time_attr='edge_time',
-        temporal_strategy='last',
-        current_ctx=current_ctx,
-        device=current_device,
-        num_neighbors=num_neighbors,
-        shuffle=False,
-        drop_last=False,
-        persistent_workers=num_workers > 0,
-        batch_size=batch_size,
-        num_workers=num_workers,
-        master_addr=master_addr,
-        master_port=test_loader_port,
-        concurrency=concurrency,
-        async_sampling=async_sampling,
-    )
-
-    print('--- Initialize model ...')
-    node_types = graph.meta['node_types']
-    edge_types = [tuple(e) for e in graph.meta['edge_types']]
-    metadata = (node_types, edge_types)
-    model = Model(hidden_channels=32, metadata=metadata).to(current_device)
-
-    @torch.no_grad()
-    def init_params():  # Init parameters via forwarding a single batch:
-        with train_loader as iterator:
-            batch = next(iter(iterator))
-            batch = batch.to(current_device)
-            model(
-                batch.x_dict,
-                batch.edge_index_dict,
-                batch['user', 'movie'].edge_label_index,
-            )
-
-    print('--- Initialize parameters of the model ...')
-    init_params()
-    torch.distributed.barrier()
-
-    # Enable DDP:
-    model = DistributedDataParallel(model, find_unused_parameters=False)
-    optimizer = torch.optim.Adam(model.parameters(), lr=0.0004)
-    torch.distributed.barrier()
-
-    # Train and test:
-    print(f'--- Start training for {num_epochs} epochs ...')
-    for epoch in range(1, num_epochs + 1):
-        print(f'Train Epoch {epoch}/{num_epochs}')
-        train(
-            model,
-            train_loader,
-            optimizer,
-            current_ctx,
-            current_device,
-            epoch,
-            logfile,
-            num_loader_threads,
-            progress_bar,
-        )
-
-        if epoch % 5 == 0:
-            print(f'Test Epoch {epoch}/{num_epochs}')
-            test(
-                model,
-                test_loader,
-                current_ctx,
-                current_device,
-                epoch,
-                logfile,
-                num_loader_threads,
-                progress_bar,
-            )
-    print(f'--- [Node {current_ctx.rank}] Closing ---')
-    torch.distributed.destroy_process_group()
-
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser(
-        description='Arguments for distributed training')
-
-    parser.add_argument(
-        '--dataset',
-        type=str,
-        default='MovieLens',
-        choices=['MovieLens'],
-        help='Name of the dataset: (MovieLens)',
-    )
-    parser.add_argument(
-        '--dataset_root_dir',
-        type=str,
-        default='../../../data/partitions/MovieLens/2-parts',
-        help='The root directory (relative path) of partitioned dataset',
-    )
-    parser.add_argument(
-        '--num_nodes',
-        type=int,
-        default=2,
-        help='Number of distributed nodes',
-    )
-    parser.add_argument(
-        '--num_neighbors',
-        type=str,
-        default='20,10',
-        help='Number of node neighbors sampled at each layer',
-    )
-    parser.add_argument(
-        '--node_rank',
-        type=int,
-        default=0,
-        help='The current node rank',
-    )
-    parser.add_argument(
-        '--num_epochs',
-        type=int,
-        default=100,
-        help='The number of training epochs',
-    )
-    parser.add_argument(
-        '--batch_size',
-        type=int,
-        default=1024,
-        help='Batch size for training and testing',
-    )
-    parser.add_argument(
-        '--num_workers',
-        type=int,
-        default=4,
-        help='Number of sampler sub-processes',
-    )
-    parser.add_argument(
-        '--num_loader_threads',
-        type=int,
-        default=10,
-        help='Number of threads used for each sampler sub-process',
-    )
-    parser.add_argument(
-        '--concurrency',
-        type=int,
-        default=1,
-        help='Number of max concurrent RPC for each sampler',
-    )
-    parser.add_argument(
-        '--async_sampling',
-        type=bool,
-        default=True,
-        help='Whether sampler processes RPC requests asynchronously',
-    )
-    parser.add_argument(
-        '--master_addr',
-        type=str,
-        default='localhost',
-        help='The master address for RPC initialization',
-    )
-    parser.add_argument(
-        '--ddp_port',
-        type=int,
-        default=11111,
-        help='The port used for PyTorch\'s DDP communication.',
-    )
-    parser.add_argument(
-        '--train_loader_port',
-        type=int,
-        default=11112,
-        help='The port used for RPC communication across training samplers',
-    )
-    parser.add_argument(
-        '--test_loader_port',
-        type=int,
-        default=11113,
-        help='The port used for RPC communication across test samplers',
-    )
-    parser.add_argument('--logging', action='store_true')
-    parser.add_argument('--progress_bar', action='store_true')
-
-    args = parser.parse_args()
-
-    print('--- Distributed training example on MovieLens ---')
-    print(f'* total nodes: {args.num_nodes}')
-    print(f'* node rank: {args.node_rank}')
-    print(f'* dataset: {args.dataset}')
-    print(f'* dataset root dir: {args.dataset_root_dir}')
-    print(f'* epochs: {args.num_epochs}')
-    print(f'* batch size: {args.batch_size}')
-    print(f'* number of sampler workers: {args.num_workers}')
-    print(f'* master addr: {args.master_addr}')
-    print(f'* training process group master port: {args.ddp_port}')
-    print(f'* training loader master port: {args.train_loader_port}')
-    print(f'* testing loader master port: {args.test_loader_port}')
-    print(f'* RPC asynchronous processing: {args.async_sampling}')
-    print(f'* RPC concurrency: {args.concurrency}')
-    print(f'* loader multithreading: {args.num_loader_threads}')
-    print(f'* logging enabled: {args.logging}')
-    print(f'* progress bars enabled: {args.progress_bar}')
-
-    if args.logging:
-        logfile = f'dist_cpu-link_temporal{args.node_rank}.txt'
-        with open(logfile, 'a+') as log:
-            log.write(f'\n--- Inputs: {str(args)}')
-    else:
-        logfile = None
-
-    print('--- Launching training processes ...')
-    torch.multiprocessing.spawn(
-        run_proc,
-        args=(
-            args.num_nodes,
-            args.node_rank,
-            args.dataset,
-            args.dataset_root_dir,
-            args.master_addr,
-            args.ddp_port,
-            args.train_loader_port,
-            args.test_loader_port,
-            args.num_epochs,
-            args.batch_size,
-            args.num_neighbors,
-            args.async_sampling,
-            args.concurrency,
-            args.num_workers,
-            args.num_loader_threads,
-            args.progress_bar,
-            logfile,
-        ),
-        join=True,
-    )
-    print('--- Finished training processes ...')
diff --git a/examples/multi_gpu/README.md b/examples/multi_gpu/README.md
deleted file mode 100644
index e35a338b4252..000000000000
--- a/examples/multi_gpu/README.md
+++ /dev/null
@@ -1,40 +0,0 @@
-# Examples for Distributed Training
-
-## Examples with NVIDIA GPUs
-
-Note: We recommend the [NVIDIA PyG Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pyg/tags) for best results and easiest setup with NVIDIA GPUs
-
-### Examples with cuGraph
-
-[cuGraph](https://github.com/rapidsai/cugraph) is a collection of packages focused on GPU-accelerated graph analytics including support for property graphs and scaling up to thousands of GPUs. cuGraph supports the creation and manipulation of graphs followed by the execution of scalable fast graph algorithms. It is part of the [RAPIDS](https://rapids.ai) accelerated data science framework.
-
-[cuGraph GNN](https://github.com/rapidsai/cugraph-gnn) is a collection of GPU-accelerated plugins that support PyTorch and PyG natively through the _cuGraph-PyG_ and _WholeGraph_ subprojects. cuGraph GNN is built on top of cuGraph, leveraging its low-level [pylibcugraph](https://github.com/rapidsai/cugraph/python/pylibcugraph) API and C++ primitives for sampling and other GNN operations ([libcugraph](https://github.com/rapidai/cugraph/python/libcugraph)). It also includes the `libwholegraph` and `pylibwholegraph` libraries for high-performance distributed edgelist and embedding storage. Users have the option of working with these lower-level libraries directly, or through the higher-level API in cuGraph-PyG that directly implements the `GraphStore`, `FeatureStore`, `NodeLoader`, and `LinkLoader` interfaces.
-
-Complete documentation on RAPIDS graph packages, including `cugraph`, `cugraph-pyg`, `pylibwholegraph`, and `pylibcugraph` is available on the [RAPIDS docs pages](https://docs.rapids.ai/api/cugraph/nightly/graph_support).
-
-| Example                                                                        | Scalability | Description                                                                                                                                       |
-| ------------------------------------------------------------------------------ | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
-| [`ogbn_train_cugraph.py`](./ogbn_train_cugraph.py)                             | single-node | Single Node Multi GPU Example for `ogbn_train.py` using [CuGraph](https://www.nvidia.com/en-us/on-demand/session/gtc24-s61197/).                  |
-| [`papers100m_gcn_cugraph_multinode.py`](./papers100m_gcn_cugraph_multinode.py) | multi-node  | Example for training GNNs on a homogeneous graph on multiple nodes using [CuGraph](https://www.nvidia.com/en-us/on-demand/session/gtc24-s61197/). |
-
-### Examples with Pure PyTorch
-
-| Example                                                                            | Scalability | Description                                                                                                                                                                                                                                                                                                  |
-| ---------------------------------------------------------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| [`distributed_batching.py`](./distributed_batching.py)                             | single-node | Example for training GNNs on multiple graphs. (deprecated in favor of [`ogbn_train_cugraph.py`](./ogbn_train_cugraph.py))                                                                                                                                                                                    |
-| [`distributed_sampling.py`](./distributed_sampling.py)                             | single-node | Example for training GNNs on a homogeneous graph with neighbor sampling. (deprecated in favor of [`ogbn_train_cugraph.py`](./ogbn_train_cugraph.py))                                                                                                                                                         |
-| [`distributed_sampling_multinode.py`](./distributed_sampling_multinode.py)         | multi-node  | Example for training GNNs on a homogeneous graph with neighbor sampling on multiple nodes. (deprecated in favor of [`papers100m_gcn_cugraph_multinode.py`](./papers100m_gcn_cugraph_multinode.py))                                                                                                           |
-| [`distributed_sampling_multinode.sbatch`](./distributed_sampling_multinode.sbatch) | multi-node  | Example for submitting a training job to a Slurm cluster using [`distributed_sampling_multi_node.py`](./distributed_sampling_multinode.py).                                                                                                                                                                  |
-| [`papers100m_gcn.py`](./papers100m_gcn.py)                                         | single-node | Example for training GNNs on the `ogbn-papers100M` homogeneous graph w/ ~1.6B edges. (deprecated in favor of [`ogbn_train_cugraph.py`](./ogbn_train_cugraph.py))                                                                                                                                             |
-| [`papers100m_gcn_multinode.py`](./papers100m_gcn_multinode.py)                     | multi-node  | Example for training GNNs on a homogeneous graph on multiple nodes. (deprecated in favor of [`papers100m_gcn_cugraph_multinode.py`](./papers100m_gcn_cugraph_multinode.py))                                                                                                                                  |
-| [`pcqm4m_ogb.py`](./pcqm4m_ogb.py)                                                 | single-node | Example for training GNNs for a graph-level regression task.                                                                                                                                                                                                                                                 |
-| [`mag240m_graphsage.py`](./mag240m_graphsage.py)                                   | single-node | Example for training GNNs on a large heterogeneous graph.                                                                                                                                                                                                                                                    |
-| [`taobao.py`](./taobao.py)                                                         | single-node | Example for training link prediction GNNs on a heterogeneous graph. (deprecated in favor of [taobao_mnmg.py](https://github.com/rapidsai/cugraph-gnn/blob/branch-25.04/python/cugraph-pyg/cugraph_pyg/examples/taobao_mnmg.py) with [CuGraph](https://www.nvidia.com/en-us/on-demand/session/gtc24-s61197/). |
-| [`model_parallel.py`](./model_parallel.py)                                         | single-node | Example for model parallelism by manually placing layers on each GPU.                                                                                                                                                                                                                                        |
-| [`data_parallel.py`](./data_parallel.py)                                           | single-node | Example for training GNNs on multiple graphs. Note that [`torch_geometric.nn.DataParallel`](https://pytorch-geometric.readthedocs.io/en/latest/modules/nn.html#torch_geometric.nn.data_parallel.DataParallel) is deprecated and [discouraged](https://github.com/pytorch/pytorch/issues/65936).              |
-
-## Examples with Intel GPUs (XPUs)
-
-| Example                                                        | Scalability            | Description                                                              |
-| -------------------------------------------------------------- | ---------------------- | ------------------------------------------------------------------------ |
-| [`distributed_sampling_xpu.py`](./distributed_sampling_xpu.py) | single-node, multi-gpu | Example for training GNNs on a homogeneous graph with neighbor sampling. |
diff --git a/examples/multi_gpu/data_parallel.py b/examples/multi_gpu/data_parallel.py
deleted file mode 100644
index e78fbafd1016..000000000000
--- a/examples/multi_gpu/data_parallel.py
+++ /dev/null
@@ -1,67 +0,0 @@
-import os.path as osp
-
-import torch
-import torch.nn.functional as F
-from torch.nn import Linear, ReLU, Sequential
-
-import torch_geometric.transforms as T
-from torch_geometric.datasets import MNISTSuperpixels
-from torch_geometric.loader import DataListLoader
-from torch_geometric.nn import (
-    DataParallel,
-    NNConv,
-    SplineConv,
-    global_mean_pool,
-)
-from torch_geometric.typing import WITH_TORCH_SPLINE_CONV
-
-path = osp.join(osp.dirname(osp.realpath(__file__)), '../../data', 'MNIST')
-dataset = MNISTSuperpixels(path, transform=T.Cartesian()).shuffle()
-loader = DataListLoader(dataset, batch_size=1024, shuffle=True)
-
-
-class Net(torch.nn.Module):
-    def __init__(self):
-        super().__init__()
-        if WITH_TORCH_SPLINE_CONV:
-            self.conv1 = SplineConv(dataset.num_features, 32, dim=2,
-                                    kernel_size=5)
-            self.conv2 = SplineConv(32, 64, dim=2, kernel_size=5)
-        else:
-            nn1 = Sequential(Linear(2, 25), ReLU(),
-                             Linear(25, dataset.num_features * 32))
-            self.conv1 = NNConv(dataset.num_features, 32, nn1, aggr='mean')
-
-            nn2 = Sequential(Linear(2, 25), ReLU(), Linear(25, 32 * 64))
-            self.conv2 = NNConv(32, 64, nn2, aggr='mean')
-
-        self.lin1 = torch.nn.Linear(64, 128)
-        self.lin2 = torch.nn.Linear(128, dataset.num_classes)
-
-    def forward(self, data):
-        print(f'Inside model - num graphs: {data.num_graphs}, '
-              f'device: {data.batch.device}')
-
-        x, edge_index, edge_attr = data.x, data.edge_index, data.edge_attr
-        x = F.elu(self.conv1(x, edge_index, edge_attr))
-        x = F.elu(self.conv2(x, edge_index, edge_attr))
-        x = global_mean_pool(x, data.batch)
-        x = F.elu(self.lin1(x))
-        return F.log_softmax(self.lin2(x), dim=1)
-
-
-model = Net()
-print(f"Let's use {torch.cuda.device_count()} GPUs!")
-model = DataParallel(model)
-device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
-model = model.to(device)
-optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
-
-for data_list in loader:
-    optimizer.zero_grad()
-    output = model(data_list)
-    print(f'Outside model - num graphs: {output.size(0)}')
-    y = torch.cat([data.y for data in data_list]).to(output.device)
-    loss = F.nll_loss(output, y)
-    loss.backward()
-    optimizer.step()
diff --git a/examples/multi_gpu/distributed_batching.py b/examples/multi_gpu/distributed_batching.py
deleted file mode 100644
index b242499d3e76..000000000000
--- a/examples/multi_gpu/distributed_batching.py
+++ /dev/null
@@ -1,173 +0,0 @@
-import os
-import os.path as osp
-
-import torch
-import torch.distributed as dist
-import torch.multiprocessing as mp
-import torch.nn.functional as F
-from ogb.graphproppred import Evaluator, PygGraphPropPredDataset
-from ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder
-from torch.nn import BatchNorm1d as BatchNorm
-from torch.nn import Linear, ReLU, Sequential
-from torch.nn.parallel import DistributedDataParallel
-from torch.utils.data.distributed import DistributedSampler
-from torch_sparse import SparseTensor
-
-import torch_geometric.transforms as T
-from torch_geometric.loader import DataLoader
-from torch_geometric.nn import GINEConv, global_mean_pool
-
-
-class GIN(torch.nn.Module):
-    def __init__(
-        self,
-        hidden_channels: int,
-        out_channels: int,
-        num_layers: int = 3,
-        dropout: float = 0.5,
-    ) -> None:
-        super().__init__()
-        self.dropout = dropout
-        self.atom_encoder = AtomEncoder(hidden_channels)
-        self.bond_encoder = BondEncoder(hidden_channels)
-        self.convs = torch.nn.ModuleList()
-        for _ in range(num_layers):
-            nn = Sequential(
-                Linear(hidden_channels, 2 * hidden_channels),
-                BatchNorm(2 * hidden_channels),
-                ReLU(),
-                Linear(2 * hidden_channels, hidden_channels),
-                BatchNorm(hidden_channels),
-                ReLU(),
-            )
-            self.convs.append(GINEConv(nn, train_eps=True))
-
-        self.lin = Linear(hidden_channels, out_channels)
-
-    def forward(
-        self,
-        x: torch.Tensor,
-        adj_t: SparseTensor,
-        batch: torch.Tensor,
-    ) -> torch.Tensor:
-        x = self.atom_encoder(x)
-        edge_attr = adj_t.coo()[2]
-        adj_t = adj_t.set_value(self.bond_encoder(edge_attr), layout='coo')
-
-        for conv in self.convs:
-            x = conv(x, adj_t)
-            x = F.dropout(x, p=self.dropout, training=self.training)
-
-        x = global_mean_pool(x, batch)
-        x = self.lin(x)
-        return x
-
-
-def run(rank: int, world_size: int, dataset_name: str, root: str) -> None:
-    os.environ['MASTER_ADDR'] = 'localhost'
-    os.environ['MASTER_PORT'] = '12355'
-    dist.init_process_group('nccl', rank=rank, world_size=world_size)
-
-    dataset = PygGraphPropPredDataset(
-        dataset_name,
-        root=root,
-        pre_transform=T.ToSparseTensor(attr='edge_attr'),
-    )
-    split_idx = dataset.get_idx_split()
-    evaluator = Evaluator(dataset_name)
-
-    train_dataset = dataset[split_idx['train']]
-    train_loader = DataLoader(
-        train_dataset,
-        batch_size=128,
-        sampler=DistributedSampler(
-            train_dataset,
-            shuffle=True,
-            drop_last=True,
-        ),
-    )
-
-    torch.manual_seed(12345)
-    model = GIN(128, dataset.num_tasks, num_layers=3, dropout=0.5).to(rank)
-    model = DistributedDataParallel(model, device_ids=[rank])
-    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
-    criterion = torch.nn.BCEWithLogitsLoss()
-
-    if rank == 0:
-        val_loader = DataLoader(dataset[split_idx['valid']], batch_size=256)
-        test_loader = DataLoader(dataset[split_idx['test']], batch_size=256)
-
-    for epoch in range(1, 51):
-        model.train()
-        train_loader.sampler.set_epoch(epoch)
-        total_loss = torch.zeros(2, device=rank)
-        for data in train_loader:
-            data = data.to(rank)
-            logits = model(data.x, data.adj_t, data.batch)
-            loss = criterion(logits, data.y.to(torch.float))
-            loss.backward()
-            optimizer.step()
-            optimizer.zero_grad()
-
-            with torch.no_grad():
-                total_loss[0] += loss * logits.size(0)
-                total_loss[1] += data.num_graphs
-
-        dist.all_reduce(total_loss, op=dist.ReduceOp.AVG)
-        train_loss = total_loss[0] / total_loss[1]
-
-        if rank == 0:  # We evaluate on a single GPU for now.
-            model.eval()
-
-            y_pred, y_true = [], []
-            for data in val_loader:
-                data = data.to(rank)
-                with torch.no_grad():
-                    y_pred.append(model.module(data.x, data.adj_t, data.batch))
-                    y_true.append(data.y)
-            val_rocauc = evaluator.eval({
-                'y_pred': torch.cat(y_pred, dim=0),
-                'y_true': torch.cat(y_true, dim=0),
-            })['rocauc']
-
-            y_pred, y_true = [], []
-            for data in test_loader:
-                data = data.to(rank)
-                with torch.no_grad():
-                    y_pred.append(model.module(data.x, data.adj_t, data.batch))
-                    y_true.append(data.y)
-            test_rocauc = evaluator.eval({
-                'y_pred': torch.cat(y_pred, dim=0),
-                'y_true': torch.cat(y_true, dim=0),
-            })['rocauc']
-
-            print(f'Epoch: {epoch:03d}, '
-                  f'Loss: {train_loss:.4f}, '
-                  f'Val: {val_rocauc:.4f}, '
-                  f'Test: {test_rocauc:.4f}')
-
-        dist.barrier()
-
-    dist.destroy_process_group()
-
-
-if __name__ == '__main__':
-    dataset_name = 'ogbg-molhiv'
-    root = osp.join(
-        osp.dirname(__file__),
-        '..',
-        '..',
-        'data',
-        'OGB',
-    )
-    # Download and process the dataset on main process.
-    PygGraphPropPredDataset(
-        dataset_name,
-        root,
-        pre_transform=T.ToSparseTensor(attr='edge_attr'),
-    )
-
-    world_size = torch.cuda.device_count()
-    print('Let\'s use', world_size, 'GPUs!')
-    args = (world_size, dataset_name, root)
-    mp.spawn(run, args=args, nprocs=world_size, join=True)
diff --git a/examples/multi_gpu/distributed_sampling.py b/examples/multi_gpu/distributed_sampling.py
deleted file mode 100644
index baa3e16ab5f1..000000000000
--- a/examples/multi_gpu/distributed_sampling.py
+++ /dev/null
@@ -1,151 +0,0 @@
-import os
-import os.path as osp
-from math import ceil
-
-import torch
-import torch.distributed as dist
-import torch.multiprocessing as mp
-import torch.nn.functional as F
-from torch import Tensor
-from torch.nn.parallel import DistributedDataParallel
-from tqdm import tqdm
-
-from torch_geometric.datasets import Reddit
-from torch_geometric.loader import NeighborLoader
-from torch_geometric.nn import SAGEConv
-
-
-class SAGE(torch.nn.Module):
-    def __init__(
-        self,
-        in_channels: int,
-        hidden_channels: int,
-        out_channels: int,
-        num_layers: int = 2,
-    ) -> None:
-        super().__init__()
-        self.convs = torch.nn.ModuleList()
-        self.convs.append(SAGEConv(in_channels, hidden_channels))
-        for _ in range(num_layers - 2):
-            self.convs.append(SAGEConv(hidden_channels, hidden_channels))
-        self.convs.append(SAGEConv(hidden_channels, out_channels))
-
-    def forward(self, x: Tensor, edge_index: Tensor) -> Tensor:
-        for i, conv in enumerate(self.convs):
-            x = conv(x, edge_index)
-            if i < len(self.convs) - 1:
-                x = x.relu()
-                x = F.dropout(x, p=0.5, training=self.training)
-        return x
-
-
-@torch.no_grad()
-def test(
-    loader: NeighborLoader,
-    model: DistributedDataParallel,
-    rank: int,
-) -> Tensor:
-    model.eval()
-    total_correct = torch.tensor(0, dtype=torch.long, device=rank)
-    total_examples = 0
-    for batch in loader:
-        out = model(batch.x, batch.edge_index.to(rank))
-        pred = out[:batch.batch_size].argmax(dim=-1)
-        y = batch.y[:batch.batch_size].to(rank)
-        total_correct += (pred == y).sum()
-        total_examples += batch.batch_size
-
-    return total_correct / total_examples
-
-
-def run(rank: int, world_size: int, dataset: Reddit) -> None:
-    os.environ['MASTER_ADDR'] = 'localhost'
-    os.environ['MASTER_PORT'] = '12355'
-    dist.init_process_group('nccl', rank=rank, world_size=world_size)
-
-    data = dataset[0]
-    data = data.to(rank, 'x', 'y')  # Move to device for faster feature fetch.
-
-    # Split indices into `world_size` many chunks:
-    train_idx = data.train_mask.nonzero(as_tuple=False).view(-1)
-    train_idx = train_idx.split(ceil(train_idx.size(0) / world_size))[rank]
-    val_idx = data.val_mask.nonzero(as_tuple=False).view(-1)
-    val_idx = val_idx.split(ceil(val_idx.size(0) / world_size))[rank]
-    test_idx = data.val_mask.nonzero(as_tuple=False).view(-1)
-    test_idx = test_idx.split(ceil(test_idx.size(0) / world_size))[rank]
-
-    kwargs = dict(
-        data=data,
-        batch_size=1024,
-        num_neighbors=[25, 10],
-        drop_last=True,
-        num_workers=4,
-        persistent_workers=True,
-    )
-    train_loader = NeighborLoader(
-        input_nodes=train_idx,
-        shuffle=True,
-        **kwargs,
-    )
-    val_loader = NeighborLoader(
-        input_nodes=val_idx,
-        shuffle=False,
-        **kwargs,
-    )
-    test_loader = NeighborLoader(
-        input_nodes=test_idx,
-        shuffle=False,
-        **kwargs,
-    )
-
-    torch.manual_seed(12345)
-    model = SAGE(dataset.num_features, 256, dataset.num_classes).to(rank)
-    model = DistributedDataParallel(model, device_ids=[rank])
-    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
-
-    for epoch in range(1, 21):
-        model.train()
-        for batch in tqdm(
-                train_loader,
-                desc=f'Epoch {epoch:02d}',
-                disable=rank != 0,
-        ):
-            out = model(batch.x, batch.edge_index.to(rank))[:batch.batch_size]
-            loss = F.cross_entropy(out, batch.y[:batch.batch_size])
-            loss.backward()
-            optimizer.step()
-            optimizer.zero_grad()
-
-        if rank == 0:
-            print(f'Epoch {epoch:02d}: Train loss: {loss:.4f}')
-
-        if epoch % 5 == 0:
-            train_acc = test(train_loader, model, rank)
-            val_acc = test(val_loader, model, rank)
-            test_acc = test(test_loader, model, rank)
-
-            if world_size > 1:
-                dist.all_reduce(train_acc, op=dist.ReduceOp.AVG)
-                dist.all_reduce(val_acc, op=dist.ReduceOp.AVG)
-                dist.all_reduce(test_acc, op=dist.ReduceOp.AVG)
-
-            if rank == 0:
-                print(f'Train acc: {train_acc:.4f}, '
-                      f'Val acc: {val_acc:.4f}, '
-                      f'Test acc: {test_acc:.4f}')
-
-    dist.destroy_process_group()
-
-
-if __name__ == '__main__':
-    path = osp.join(
-        osp.dirname(__file__),
-        '..',
-        '..',
-        'data',
-        'Reddit',
-    )
-    dataset = Reddit(path)
-    world_size = torch.cuda.device_count()
-    print("Let's use", world_size, "GPUs!")
-    mp.spawn(run, args=(world_size, dataset), nprocs=world_size, join=True)
diff --git a/examples/multi_gpu/distributed_sampling_multinode.py b/examples/multi_gpu/distributed_sampling_multinode.py
deleted file mode 100644
index b83131082dc0..000000000000
--- a/examples/multi_gpu/distributed_sampling_multinode.py
+++ /dev/null
@@ -1,161 +0,0 @@
-import copy
-import os
-from math import ceil
-
-import torch
-import torch.distributed as dist
-import torch.nn.functional as F
-from torch import Tensor
-from torch.nn.parallel import DistributedDataParallel
-from tqdm import tqdm
-
-from torch_geometric.datasets import Reddit
-from torch_geometric.loader import NeighborLoader
-from torch_geometric.nn import SAGEConv
-
-
-class SAGE(torch.nn.Module):
-    def __init__(
-        self,
-        in_channels: int,
-        hidden_channels: int,
-        out_channels: int,
-        num_layers: int = 2,
-    ):
-        super().__init__()
-        self.convs = torch.nn.ModuleList()
-        self.convs.append(SAGEConv(in_channels, hidden_channels))
-        for _ in range(num_layers - 2):
-            self.convs.append(SAGEConv(hidden_channels, hidden_channels))
-        self.convs.append(SAGEConv(hidden_channels, out_channels))
-
-    def forward(self, x: Tensor, edge_index: Tensor) -> Tensor:
-        for i, conv in enumerate(self.convs):
-            x = conv(x, edge_index)
-            if i < len(self.convs) - 1:
-                x = x.relu_()
-                x = F.dropout(x, p=0.5, training=self.training)
-        return x
-
-    @torch.no_grad()
-    def inference(
-        self,
-        x_all: Tensor,
-        device: torch.device,
-        subgraph_loader: NeighborLoader,
-    ) -> Tensor:
-        pbar = tqdm(total=len(subgraph_loader) * len(self.convs))
-        pbar.set_description('Evaluating')
-
-        # Compute representations of nodes layer by layer, using *all*
-        # available edges. This leads to faster computation in contrast to
-        # immediately computing the final representations of each batch:
-        for i, conv in enumerate(self.convs):
-            xs = []
-            for batch in subgraph_loader:
-                x = x_all[batch.node_id.to(x_all.device)].to(device)
-                x = conv(x, batch.edge_index.to(device))
-                x = x[:batch.batch_size]
-                if i < len(self.convs) - 1:
-                    x = x.relu_()
-                xs.append(x.cpu())
-                pbar.update(1)
-            x_all = torch.cat(xs, dim=0)
-
-        pbar.close()
-        return x_all
-
-
-def run(world_size: int, rank: int, local_rank: int):
-    # Will query the runtime environment for `MASTER_ADDR` and `MASTER_PORT`.
-    # Make sure, those are set!
-    dist.init_process_group('nccl', world_size=world_size, rank=rank)
-
-    # Download and unzip only with one process ...
-    if rank == 0:
-        dataset = Reddit('data/Reddit')
-    dist.barrier()
-    # ... and then read from all the other processes:
-    if rank != 0:
-        dataset = Reddit('data/Reddit')
-    dist.barrier()
-
-    data = dataset[0]
-
-    # Move to device for faster feature fetch.
-    data = data.to(local_rank, 'x', 'y')
-
-    # Split training indices into `world_size` many chunks:
-    train_idx = data.train_mask.nonzero(as_tuple=False).view(-1)
-    train_idx = train_idx.split(ceil(train_idx.size(0) / world_size))[rank]
-
-    kwargs = dict(batch_size=1024, num_workers=4, persistent_workers=True)
-    train_loader = NeighborLoader(
-        data,
-        input_nodes=train_idx,
-        num_neighbors=[25, 10],
-        shuffle=True,
-        drop_last=True,
-        **kwargs,
-    )
-
-    if rank == 0:  # Create single-hop evaluation neighbor loader:
-        subgraph_loader = NeighborLoader(
-            copy.copy(data),
-            num_neighbors=[-1],
-            shuffle=False,
-            **kwargs,
-        )
-        # No need to maintain these features during evaluation:
-        del subgraph_loader.data.x, subgraph_loader.data.y
-        # Add global node index information:
-        subgraph_loader.data.node_id = torch.arange(data.num_nodes)
-
-    torch.manual_seed(12345)
-    model = SAGE(dataset.num_features, 256, dataset.num_classes).to(local_rank)
-    model = DistributedDataParallel(model, device_ids=[local_rank])
-    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
-
-    for epoch in range(1, 21):
-        model.train()
-        for batch in train_loader:
-            optimizer.zero_grad()
-            out = model(batch.x,
-                        batch.edge_index.to(local_rank))[:batch.batch_size]
-            loss = F.cross_entropy(out, batch.y[:batch.batch_size])
-            loss.backward()
-            optimizer.step()
-
-        dist.barrier()
-
-        if rank == 0:
-            print(f'Epoch: {epoch:02d}, Loss: {loss:.4f}')
-
-        if rank == 0 and epoch % 5 == 0:  # We evaluate on a single GPU for now
-            model.eval()
-            with torch.no_grad():
-                out = model.module.inference(
-                    data.x,
-                    local_rank,
-                    subgraph_loader,
-                )
-            res = out.argmax(dim=-1) == data.y.to(out.device)
-            acc1 = int(res[data.train_mask].sum()) / int(data.train_mask.sum())
-            acc2 = int(res[data.val_mask].sum()) / int(data.val_mask.sum())
-            acc3 = int(res[data.test_mask].sum()) / int(data.test_mask.sum())
-            print(f'Train: {acc1:.4f}, Val: {acc2:.4f}, Test: {acc3:.4f}')
-
-        dist.barrier()
-
-    dist.destroy_process_group()
-
-
-if __name__ == '__main__':
-    # Get the world size from the WORLD_SIZE variable or directly from SLURM:
-    world_size = int(
-        os.environ.get('WORLD_SIZE', os.environ.get('SLURM_NTASKS')))
-    # Likewise for RANK and LOCAL_RANK:
-    rank = int(os.environ.get('RANK', os.environ.get('SLURM_PROCID')))
-    local_rank = int(
-        os.environ.get('LOCAL_RANK', os.environ.get('SLURM_LOCALID')))
-    run(world_size, rank, local_rank)
diff --git a/examples/multi_gpu/distributed_sampling_multinode.sbatch b/examples/multi_gpu/distributed_sampling_multinode.sbatch
deleted file mode 100644
index 5fc8d4b1a15a..000000000000
--- a/examples/multi_gpu/distributed_sampling_multinode.sbatch
+++ /dev/null
@@ -1,25 +0,0 @@
-#!/bin/bash
-#SBATCH --job-name=pyg-multinode-tutorial # identifier for the job listings
-#SBATCH --output=pyg-multinode.log        # outputfile
-#SBATCH --partition=gpucloud              # ADJUST this to your system
-#SBATCH -N 2                              # number of nodes you want to use
-#SBATCH --ntasks=4                        # number of processes to be run
-#SBATCH --gpus-per-task=1                 # every process wants one GPU!
-#SBATCH --gpu-bind=none                   # NCCL can't deal with task-binding...
-## Now you can add more stuff for your convenience
-#SBATCH --cpus-per-task=8                 # make sure more cpu-cores are available to each process to spawn workers (default=1 and this is a hard limit)
-#SBATCH --mem=100G                        # total number of memory available per node (tensorflow need(ed) at least <GPU-memory> per GPU)
-#SBATCH --export=ALL                      # use your shell environment (PATHs, ...)
-
-# Thanks for shell-ideas to https://github.com/PrincetonUniversity/multi_gpu_training
-export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
-export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
-echo "MASTER_ADDR:MASTER_PORT="${MASTER_ADDR}:${MASTER_PORT}
-
-echo "###########################################################################"
-echo "We recommend you set up your environment here (conda/spack/pip/modulefiles)"
-echo "then remove --export=ALL (allows running the sbatch from any shell"
-echo "###########################################################################"
-
-# use --output=0 so that only the first task logs to the file!
-srun --output=0 python distributed_sampling_multinode.py
diff --git a/examples/multi_gpu/distributed_sampling_xpu.py b/examples/multi_gpu/distributed_sampling_xpu.py
deleted file mode 100644
index aa0d4b4f02dc..000000000000
--- a/examples/multi_gpu/distributed_sampling_xpu.py
+++ /dev/null
@@ -1,217 +0,0 @@
-"""Distributed GAT training, targeting XPU devices.
-PVC has 2 tiles, each reports itself as a separate
-device. DDP approach allows us to employ both tiles.
-
-Additional requirements:
-    IPEX (intel_extension_for_pytorch)
-    oneCCL (oneccl_bindings_for_pytorch)
-
-    We need to import both these modules, as they extend
-    torch module with XPU/oneCCL related functionality.
-
-Run with:
-    mpirun -np 2 python distributed_sampling_xpu.py
-"""
-
-import copy
-import os
-import os.path as osp
-from typing import Any, Tuple, Union
-
-import intel_extension_for_pytorch  # noqa
-import oneccl_bindings_for_pytorch  # noqa
-import torch
-import torch.distributed as dist
-import torch.nn.functional as F
-from ogb.nodeproppred import Evaluator, PygNodePropPredDataset
-from torch import Tensor
-from torch.nn import Linear as Lin
-from torch.nn.parallel import DistributedDataParallel as DDP
-from tqdm import tqdm
-
-from torch_geometric.loader import NeighborLoader
-from torch_geometric.nn import GATConv
-from torch_geometric.profile import get_stats_summary, profileit
-
-
-class GAT(torch.nn.Module):
-    def __init__(
-        self,
-        in_channels: int,
-        hidden_channels: int,
-        out_channels: int,
-        num_layers: int,
-        heads: int,
-    ):
-        super().__init__()
-
-        self.num_layers = num_layers
-
-        self.convs = torch.nn.ModuleList()
-        self.convs.append(GATConv(dataset.num_features, hidden_channels,
-                                  heads))
-        for _ in range(num_layers - 2):
-            self.convs.append(
-                GATConv(heads * hidden_channels, hidden_channels, heads))
-        self.convs.append(
-            GATConv(heads * hidden_channels, out_channels, heads,
-                    concat=False))
-
-        self.skips = torch.nn.ModuleList()
-        self.skips.append(Lin(dataset.num_features, hidden_channels * heads))
-        for _ in range(num_layers - 2):
-            self.skips.append(
-                Lin(hidden_channels * heads, hidden_channels * heads))
-        self.skips.append(Lin(hidden_channels * heads, out_channels))
-
-    def forward(self, x: Tensor, edge_index: Tensor) -> Tensor:
-        for i, (conv, skip) in enumerate(zip(self.convs, self.skips)):
-            x = conv(x, edge_index) + skip(x)
-            if i != self.num_layers - 1:
-                x = F.elu(x)
-                x = F.dropout(x, p=0.5, training=self.training)
-        return x
-
-    def inference(
-        self,
-        x_all: Tensor,
-        device: Union[str, torch.device],
-        subgraph_loader: NeighborLoader,
-    ) -> Tensor:
-        pbar = tqdm(total=x_all.size(0) * self.num_layers)
-        pbar.set_description("Evaluating")
-
-        # Compute representations of nodes layer by layer, using *all*
-        # available edges. This leads to faster computation in contrast to
-        # immediately computing the final representations of each batch.
-        for i in range(self.num_layers):
-            xs = []
-            for batch in subgraph_loader:
-                x = x_all[batch.n_id].to(device)
-                edge_index = batch.edge_index.to(device)
-                x = self.convs[i](x, edge_index) + self.skips[i](x)
-                x = x[:batch.batch_size]
-                if i != self.num_layers - 1:
-                    x = F.elu(x)
-                xs.append(x.cpu())
-
-                pbar.update(batch.batch_size)
-
-            x_all = torch.cat(xs, dim=0)
-
-        pbar.close()
-
-        return x_all
-
-
-@profileit('xpu')
-def train_step(model: Any, optimizer: Any, x: Tensor, edge_index: Tensor,
-               y: Tensor, bs: int) -> float:
-    optimizer.zero_grad()
-    out = model(x, edge_index)[:bs]
-    loss = F.cross_entropy(out, y[:bs].squeeze())
-    loss.backward()
-    optimizer.step()
-    return float(loss)
-
-
-def run(rank: int, world_size: int, dataset: PygNodePropPredDataset):
-    device = f"xpu:{rank}"
-
-    split_idx = dataset.get_idx_split()
-    split_idx["train"] = (split_idx["train"].split(
-        split_idx["train"].size(0) // world_size, dim=0)[rank].clone())
-    data = dataset[0].to(device, "x", "y")
-
-    kwargs = dict(batch_size=1024, num_workers=0, pin_memory=True)
-    train_loader = NeighborLoader(data, input_nodes=split_idx["train"],
-                                  num_neighbors=[10, 10, 5], **kwargs)
-
-    if rank == 0:
-        subgraph_loader = NeighborLoader(copy.copy(data), num_neighbors=[-1],
-                                         **kwargs)
-        evaluator = Evaluator(name="ogbn-products")
-
-    torch.manual_seed(12345)
-    model = GAT(dataset.num_features, 128, dataset.num_classes, num_layers=3,
-                heads=4).to(device)
-    model = DDP(model, device_ids=[device])
-    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
-
-    for epoch in range(1, 21):
-        epoch_stats = []
-        model.train()
-        for batch in train_loader:
-            batch = batch.to(device)
-            loss, stats = train_step(model, optimizer, batch.x,
-                                     batch.edge_index, batch.y,
-                                     batch.batch_size)
-            epoch_stats.append(stats)
-
-        dist.barrier()
-
-        if rank == 0:
-            print(f"Epoch: {epoch:02d}, Loss: {loss:.4f}")
-
-        print(f"Epoch: {epoch:02d}, Rank: {rank}, "
-              f"Stats: {get_stats_summary(epoch_stats)}")
-
-        if rank == 0 and epoch % 5 == 0:  # Evaluation on a single GPU
-            model.eval()
-            with torch.no_grad():
-                out = model.module.inference(data.x, device, subgraph_loader)
-
-            y_true = data.y.to(out.device)
-            y_pred = out.argmax(dim=-1, keepdim=True)
-
-            train_acc = evaluator.eval({
-                "y_true": y_true[split_idx["train"]],
-                "y_pred": y_pred[split_idx["train"]],
-            })["acc"]
-            val_acc = evaluator.eval({
-                "y_true": y_true[split_idx["valid"]],
-                "y_pred": y_pred[split_idx["valid"]],
-            })["acc"]
-            test_acc = evaluator.eval({
-                "y_true": y_true[split_idx["test"]],
-                "y_pred": y_pred[split_idx["test"]],
-            })["acc"]
-
-            print(f"Train: {train_acc:.4f}, Val: {val_acc:.4f}, "
-                  f"Test: {test_acc:.4f}")
-
-        dist.barrier()
-
-    dist.destroy_process_group()
-
-
-def get_dist_params() -> Tuple[int, int, str]:
-    master_addr = "127.0.0.1"
-    master_port = "29500"
-    os.environ["MASTER_ADDR"] = master_addr
-    os.environ["MASTER_PORT"] = master_port
-
-    mpi_rank = int(os.environ.get("PMI_RANK", -1))
-    mpi_world_size = int(os.environ.get("PMI_SIZE", -1))
-    rank = mpi_rank if mpi_world_size > 0 else os.environ.get("RANK", 0)
-    world_size = (mpi_world_size if mpi_world_size > 0 else os.environ.get(
-        "WORLD_SIZE", 1))
-
-    os.environ["RANK"] = str(rank)
-    os.environ["WORLD_SIZE"] = str(world_size)
-
-    init_method = f"tcp://{master_addr}:{master_port}"
-
-    return rank, world_size, init_method
-
-
-if __name__ == "__main__":
-    rank, world_size, init_method = get_dist_params()
-    dist.init_process_group(backend="ccl", init_method=init_method,
-                            world_size=world_size, rank=rank)
-
-    path = osp.join(osp.dirname(osp.realpath(__file__)), "../../data",
-                    "ogbn-products")
-    dataset = PygNodePropPredDataset("ogbn-products", path)
-
-    run(rank, world_size, dataset)
diff --git a/examples/multi_gpu/mag240m_graphsage.py b/examples/multi_gpu/mag240m_graphsage.py
deleted file mode 100644
index 8f6880fdbe04..000000000000
--- a/examples/multi_gpu/mag240m_graphsage.py
+++ /dev/null
@@ -1,293 +0,0 @@
-import argparse
-import os
-
-import torch
-import torch.distributed as dist
-import torch.multiprocessing as mp
-import torch.nn.functional as F
-from ogb.lsc import MAG240MDataset
-from torch.nn.parallel import DistributedDataParallel
-from torchmetrics import Accuracy
-from tqdm import tqdm
-
-from torch_geometric.loader import NeighborLoader
-from torch_geometric.nn import BatchNorm, HeteroConv, SAGEConv
-
-
-def common_step(batch, model):
-    batch_size = batch['paper'].batch_size
-    x_dict = model(batch.x_dict, batch.edge_index_dict)
-    y_hat = x_dict['paper'][:batch_size]
-    y = batch['paper'].y[:batch_size].to(torch.long)
-    return y_hat, y
-
-
-def training_step(batch, acc, model):
-    y_hat, y = common_step(batch, model)
-    train_loss = F.cross_entropy(y_hat, y)
-    acc(y_hat, y)
-    return train_loss
-
-
-def validation_step(batch, acc, model):
-    y_hat, y = common_step(batch, model)
-    acc(y_hat, y)
-
-
-class HeteroSAGEConv(torch.nn.Module):
-    def __init__(self, in_channels, out_channels, dropout, node_types,
-                 edge_types, is_output_layer=False):
-        super().__init__()
-        self.conv = HeteroConv({
-            edge_type: SAGEConv(in_channels, out_channels)
-            for edge_type in edge_types
-        })
-        if not is_output_layer:
-            self.dropout = torch.nn.Dropout(dropout)
-            self.norm_dict = torch.nn.ModuleDict({
-                node_type:
-                BatchNorm(out_channels)
-                for node_type in node_types
-            })
-
-        self.is_output_layer = is_output_layer
-
-    def forward(self, x_dict, edge_index_dict):
-        x_dict = self.conv(x_dict, edge_index_dict)
-        if not self.is_output_layer:
-            for node_type, x in x_dict.items():
-                x = self.dropout(x.relu())
-                x = self.norm_dict[node_type](x)
-                x_dict[node_type] = x
-        return x_dict
-
-
-class HeteroGraphSAGE(torch.nn.Module):
-    def __init__(self, in_channels, hidden_channels, num_layers, out_channels,
-                 dropout, node_types, edge_types):
-        super().__init__()
-
-        self.convs = torch.nn.ModuleList()
-        for i in range(num_layers):
-            # Since authors and institution do not come with features, we learn
-            # them via the GNN. However, this also means we need to exclude
-            # them as source types in the first two iterations:
-            if i == 0:
-                edge_types_of_layer = [
-                    edge_type for edge_type in edge_types
-                    if edge_type[0] == 'paper'
-                ]
-            elif i == 1:
-                edge_types_of_layer = [
-                    edge_type for edge_type in edge_types
-                    if edge_type[0] != 'institution'
-                ]
-            else:
-                edge_types_of_layer = edge_types
-
-            conv = HeteroSAGEConv(
-                in_channels if i == 0 else hidden_channels,
-                out_channels if i == num_layers - 1 else hidden_channels,
-                dropout=dropout,
-                node_types=node_types,
-                edge_types=edge_types_of_layer,
-                is_output_layer=i == num_layers - 1,
-            )
-            self.convs.append(conv)
-
-    def forward(self, x_dict, edge_index_dict):
-        for conv in self.convs:
-            x_dict = conv(x_dict, edge_index_dict)
-        return x_dict
-
-
-def run(
-    rank,
-    data,
-    num_devices,
-    num_epochs,
-    num_steps_per_epoch,
-    log_every_n_steps,
-    batch_size,
-    num_neighbors,
-    hidden_channels,
-    dropout,
-    num_val_steps,
-    lr,
-):
-    if num_devices > 1:
-        if rank == 0:
-            print("Setting up distributed...")
-        os.environ['MASTER_ADDR'] = 'localhost'
-        os.environ['MASTER_PORT'] = '12355'
-        dist.init_process_group('nccl', rank=rank, world_size=num_devices)
-
-    acc = Accuracy(task='multiclass', num_classes=data.num_classes)
-    model = HeteroGraphSAGE(
-        in_channels=-1,
-        hidden_channels=hidden_channels,
-        num_layers=len(num_neighbors),
-        out_channels=data.num_classes,
-        dropout=dropout,
-        node_types=data.node_types,
-        edge_types=data.edge_types,
-    )
-
-    train_idx = data['paper'].train_mask.nonzero(as_tuple=False).view(-1)
-    val_idx = data['paper'].val_mask.nonzero(as_tuple=False).view(-1)
-    if num_devices > 1:  # Split indices into `num_devices` many chunks:
-        train_idx = train_idx.split(train_idx.size(0) // num_devices)[rank]
-        val_idx = val_idx.split(val_idx.size(0) // num_devices)[rank]
-
-    # Delete unused tensors to not sample:
-    del data['paper'].train_mask
-    del data['paper'].val_mask
-    del data['paper'].test_mask
-    del data['paper'].year
-
-    kwargs = dict(
-        batch_size=batch_size,
-        num_workers=16,
-        persistent_workers=True,
-        num_neighbors=num_neighbors,
-        drop_last=True,
-    )
-
-    train_loader = NeighborLoader(
-        data,
-        input_nodes=('paper', train_idx),
-        shuffle=True,
-        **kwargs,
-    )
-    val_loader = NeighborLoader(data, input_nodes=('paper', val_idx), **kwargs)
-
-    if num_devices > 0:
-        model = model.to(rank)
-        acc = acc.to(rank)
-    if num_devices > 1:
-        model = DistributedDataParallel(model, device_ids=[rank])
-    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
-
-    for epoch in range(1, num_epochs + 1):
-        model.train()
-        for i, batch in enumerate(tqdm(train_loader)):
-            if num_steps_per_epoch >= 0 and i >= num_steps_per_epoch:
-                break
-
-            if num_devices > 0:
-                batch = batch.to(rank, 'x', 'y', 'edge_index')
-                # Features loaded in as 16 bits, train in 32 bits:
-                batch['paper'].x = batch['paper'].x.to(torch.float32)
-
-            optimizer.zero_grad()
-            loss = training_step(batch, acc, model)
-            loss.backward()
-            optimizer.step()
-
-            if i % log_every_n_steps == 0:
-                if rank == 0:
-                    print(f"Epoch: {epoch:02d}, Step: {i:d}, "
-                          f"Loss: {loss:.4f}, "
-                          f"Train Acc: {acc.compute():.4f}")
-
-        if num_devices > 1:
-            dist.barrier()
-
-        if rank == 0:
-            print(f"Epoch: {epoch:02d}, Loss: {loss:.4f}, "
-                  f"Train Acc :{acc.compute():.4f}")
-        acc.reset()
-
-        model.eval()
-        with torch.no_grad():
-            for i, batch in enumerate(tqdm(val_loader)):
-                if num_val_steps >= 0 and i >= num_val_steps:
-                    break
-
-                if num_devices > 0:
-                    batch = batch.to(rank, 'x', 'y', 'edge_index')
-                    batch['paper'].x = batch['paper'].x.to(torch.float32)
-
-                validation_step(batch, acc, model)
-
-            if num_devices > 1:
-                dist.barrier()
-
-            if rank == 0:
-                print(f"Val Acc: {acc.compute():.4f}")
-            acc.reset()
-
-    model.eval()
-
-    if num_devices > 1:
-        dist.destroy_process_group()
-
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--hidden_channels", type=int, default=1024)
-    parser.add_argument("--batch_size", type=int, default=1024)
-    parser.add_argument("--dropout", type=float, default=0.5)
-    parser.add_argument("--lr", type=float, default=0.001)
-    parser.add_argument("--num_epochs", type=int, default=20)
-    parser.add_argument("--num_steps_per_epoch", type=int, default=-1)
-    parser.add_argument("--log_every_n_steps", type=int, default=100)
-    parser.add_argument("--num_val_steps", type=int, default=-1, help=50)
-    parser.add_argument("--num_neighbors", type=str, default="25-15")
-    parser.add_argument("--num_devices", type=int, default=1)
-    args = parser.parse_args()
-
-    args.num_neighbors = [int(i) for i in args.num_neighbors.split('-')]
-
-    import warnings
-    warnings.simplefilter("ignore")
-
-    if not torch.cuda.is_available():
-        args.num_devices = 0
-    elif args.num_devices > torch.cuda.device_count():
-        args.num_devices = torch.cuda.device_count()
-
-    dataset = MAG240MDataset()
-    data = dataset.to_pyg_hetero_data()
-
-    if args.num_devices > 1:
-        print("Let's use", args.num_devices, "GPUs!")
-        from torch.multiprocessing.spawn import ProcessExitedException
-        try:
-            mp.spawn(
-                run,
-                args=(
-                    data,
-                    args.num_devices,
-                    args.num_epochs,
-                    args.num_steps_per_epoch,
-                    args.log_every_n_steps,
-                    args.batch_size,
-                    args.num_neighbors,
-                    args.hidden_channels,
-                    args.dropout,
-                    args.num_val_steps,
-                    args.lr,
-                ),
-                nprocs=args.num_devices,
-                join=True,
-            )
-        except ProcessExitedException as e:
-            print("torch.multiprocessing.spawn.ProcessExitedException:", e)
-            print("Exceptions/SIGBUS/Errors may be caused by a lack of RAM")
-
-    else:
-        run(
-            0,
-            data,
-            args.num_devices,
-            args.num_epochs,
-            args.num_steps_per_epoch,
-            args.log_every_n_steps,
-            args.batch_size,
-            args.num_neighbors,
-            args.hidden_channels,
-            args.dropout,
-            args.num_val_steps,
-            args.lr,
-        )
diff --git a/examples/multi_gpu/model_parallel.py b/examples/multi_gpu/model_parallel.py
deleted file mode 100644
index b4596195501e..000000000000
--- a/examples/multi_gpu/model_parallel.py
+++ /dev/null
@@ -1,78 +0,0 @@
-import os.path as osp
-
-import torch
-import torch.nn.functional as F
-
-from torch_geometric.datasets import Planetoid
-from torch_geometric.nn import GCNConv
-from torch_geometric.transforms import NormalizeFeatures
-
-if torch.cuda.device_count() < 2:
-    quit('This example requires multiple GPUs')
-
-path = osp.dirname(osp.realpath(__file__))
-path = osp.join(path, '..', '..', 'data', 'Planetoid')
-dataset = Planetoid(root=path, name='Cora', transform=NormalizeFeatures())
-data = dataset[0].to('cuda:0')
-
-
-class GCN(torch.nn.Module):
-    def __init__(self, in_channels, out_channels, device1, device2):
-        super().__init__()
-        self.device1 = device1
-        self.device2 = device2
-
-        self.conv1 = GCNConv(in_channels, 16).to(device1)
-        self.conv2 = GCNConv(16, out_channels).to(device2)
-
-    def forward(self, x, edge_index):
-        x = F.dropout(x, p=0.5, training=self.training)
-        x = self.conv1(x, edge_index).relu()
-        # Move data to the second device:
-        x, edge_index = x.to(self.device2), edge_index.to(self.device2)
-        x = F.dropout(x, p=0.5, training=self.training)
-        x = self.conv2(x, edge_index)
-        return x
-
-
-model = GCN(
-    dataset.num_features,
-    dataset.num_classes,
-    device1='cuda:0',
-    device2='cuda:1',
-)
-optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
-
-
-def train():
-    model.train()
-    optimizer.zero_grad()
-    out = model(data.x, data.edge_index).to('cuda:0')
-    loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
-    loss.backward()
-    optimizer.step()
-    return float(loss)
-
-
-@torch.no_grad()
-def test():
-    model.eval()
-    out = model(data.x, data.edge_index)
-    pred = out.argmax(dim=-1).to('cuda:0')
-
-    accs = []
-    for mask in [data.train_mask, data.val_mask, data.test_mask]:
-        accs.append(int((pred[mask] == data.y[mask]).sum()) / int(mask.sum()))
-    return accs
-
-
-best_val_acc = test_acc = 0
-times = []
-for epoch in range(1, 201):
-    loss = train()
-    train_acc, val_acc, tmp_test_acc = test()
-    if val_acc > best_val_acc:
-        best_val_acc = val_acc
-        test_acc = tmp_test_acc
-    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, Train: {train_acc:.4f}, '
-          f'Val: {val_acc:.4f}, Test: {test_acc:.4f}')
diff --git a/examples/multi_gpu/ogbn_train_cugraph.py b/examples/multi_gpu/ogbn_train_cugraph.py
deleted file mode 100644
index 231ed67bf52e..000000000000
--- a/examples/multi_gpu/ogbn_train_cugraph.py
+++ /dev/null
@@ -1,392 +0,0 @@
-"""Single-node, multi-GPU example."""
-
-import argparse
-import os
-import os.path as osp
-import time
-
-import cupy
-import rmm
-import torch
-import torch.distributed as dist
-import torch.multiprocessing as mp
-import torch.nn.functional as F
-from cugraph.gnn import (
-    cugraph_comms_create_unique_id,
-    cugraph_comms_init,
-    cugraph_comms_shutdown,
-)
-from ogb.nodeproppred import PygNodePropPredDataset
-from torch.nn.parallel import DistributedDataParallel
-
-import torch_geometric
-from torch_geometric import seed_everything
-from torch_geometric.utils import (
-    add_self_loops,
-    remove_self_loops,
-    to_undirected,
-)
-
-# Allow computation on objects that are larger than GPU memory
-# https://docs.rapids.ai/api/cudf/stable/developer_guide/library_design/#spilling-to-host-memory
-os.environ['CUDF_SPILL'] = '1'
-
-# Ensures that a CUDA context is not created on import of rapids.
-# Allows pytorch to create the context instead
-os.environ['RAPIDS_NO_INITIALIZE'] = '1'
-
-
-def arg_parse():
-    parser = argparse.ArgumentParser(
-        formatter_class=argparse.ArgumentDefaultsHelpFormatter, )
-    parser.add_argument(
-        '--dataset',
-        type=str,
-        default='ogbn-arxiv',
-        choices=['ogbn-papers100M', 'ogbn-products', 'ogbn-arxiv'],
-        help='Dataset name.',
-    )
-    parser.add_argument(
-        '--dataset_dir',
-        type=str,
-        default='/workspace/data',
-        help='Root directory of dataset.',
-    )
-    parser.add_argument(
-        "--dataset_subdir",
-        type=str,
-        default="ogbn-arxiv",
-        help="directory of dataset.",
-    )
-    parser.add_argument('--hidden_channels', type=int, default=256)
-    parser.add_argument('--num_layers', type=int, default=3)
-    parser.add_argument('--lr', type=float, default=0.001)
-    parser.add_argument('--wd', type=float, default=0.000)
-    parser.add_argument('-e', '--epochs', type=int, default=50)
-    parser.add_argument('-b', '--batch_size', type=int, default=1024)
-    parser.add_argument('--fan_out', type=int, default=10)
-    parser.add_argument('--warmup_steps', type=int, default=20)
-    parser.add_argument('--dropout', type=float, default=0.5)
-    parser.add_argument(
-        '--use_directed_graph',
-        action='store_true',
-        help='Whether or not to use directed graph',
-    )
-    parser.add_argument(
-        '--add_self_loop',
-        action='store_true',
-        help='Whether or not to add self loop',
-    )
-    parser.add_argument(
-        "--model",
-        type=str,
-        default='GCN',
-        choices=[
-            'SAGE',
-            'GAT',
-            'GCN',
-            # TODO: Uncomment when we add support for disjoint sampling
-            # 'SGFormer',
-        ],
-        help="Model used for training, default GCN",
-    )
-    parser.add_argument(
-        "--num_heads",
-        type=int,
-        default=1,
-        help="If using GATConv or GT, number of attention heads to use",
-    )
-    parser.add_argument(
-        '--num_devices',
-        type=int,
-        default=-1,
-        help='How many GPUs to use. Defaults to all available GPUs',
-    )
-    parser.add_argument(
-        '--verbose',
-        action='store_true',
-        help='Whether or not to generate statistical report',
-    )
-    args = parser.parse_args()
-
-    return args
-
-
-def evaluate(rank, loader, model):
-    with torch.no_grad():
-        total_correct = total_examples = 0
-        for batch in loader:
-            batch = batch.to(rank)
-            batch_size = batch.batch_size
-
-            batch.y = batch.y.to(torch.long)
-            out = model(batch.x, batch.edge_index)[:batch_size]
-
-            pred = out.argmax(dim=-1)
-            y = batch.y[:batch_size].view(-1).to(torch.long)
-
-            total_correct += (pred == y).sum()
-            total_examples += y.size(0)
-
-        acc = total_correct.item() / total_examples
-    return acc
-
-
-def init_pytorch_worker(rank, world_size, cugraph_id):
-
-    rmm.reinitialize(
-        devices=rank,
-        managed_memory=True,
-        pool_allocator=True,
-    )
-
-    cupy.cuda.Device(rank).use()
-    from rmm.allocators.cupy import rmm_cupy_allocator
-
-    cupy.cuda.set_allocator(rmm_cupy_allocator)
-
-    import cudf
-    cudf.set_option("spill", True)
-    torch.cuda.set_device(rank)
-
-    os.environ['MASTER_ADDR'] = 'localhost'
-    os.environ['MASTER_PORT'] = '12355'
-    dist.init_process_group('nccl', rank=rank, world_size=world_size)
-
-    cugraph_comms_init(rank=rank, world_size=world_size, uid=cugraph_id,
-                       device=rank)
-
-
-def run_train(rank, args, data, world_size, cugraph_id, model, split_idx,
-              num_classes, wall_clock_start):
-
-    epochs = args.epochs
-    batch_size = args.batch_size
-    fan_out = args.fan_out
-    num_layers = args.num_layers
-
-    init_pytorch_worker(
-        rank,
-        world_size,
-        cugraph_id,
-    )
-
-    model = model.to(rank)
-    model = DistributedDataParallel(model, device_ids=[rank])
-    optimizer = torch.optim.Adam(model.parameters(), lr=args.lr,
-                                 weight_decay=args.wd)
-
-    kwargs = dict(
-        num_neighbors=[fan_out] * num_layers,
-        batch_size=batch_size,
-    )
-    from cugraph_pyg.data import GraphStore, TensorDictFeatureStore
-    from cugraph_pyg.loader import NeighborLoader
-
-    graph_store = GraphStore(is_multi_gpu=True)
-    ixr = torch.tensor_split(data.edge_index, world_size, dim=1)[rank]
-    graph_store[dict(
-        edge_type=('node', 'rel', 'node'),
-        layout='coo',
-        is_sorted=False,
-        size=(data.num_nodes, data.num_nodes),
-    )] = ixr
-
-    feature_store = TensorDictFeatureStore()
-    feature_store['node', 'x', None] = data.x
-    feature_store['node', 'y', None] = data.y
-
-    dist.barrier()
-
-    ix_train = torch.tensor_split(split_idx['train'], world_size)[rank].cuda()
-    train_loader = NeighborLoader(
-        (feature_store, graph_store),
-        input_nodes=ix_train,
-        shuffle=True,
-        drop_last=True,
-        **kwargs,
-    )
-
-    ix_val = torch.tensor_split(split_idx['valid'], world_size)[rank].cuda()
-    val_loader = NeighborLoader(
-        (feature_store, graph_store),
-        input_nodes=ix_val,
-        drop_last=True,
-        **kwargs,
-    )
-
-    ix_test = torch.tensor_split(split_idx['test'], world_size)[rank].cuda()
-    test_loader = NeighborLoader(
-        (feature_store, graph_store),
-        input_nodes=ix_test,
-        drop_last=True,
-        local_seeds_per_call=80000,
-        **kwargs,
-    )
-
-    dist.barrier()
-
-    warmup_steps = args.warmup_steps
-    dist.barrier()
-    torch.cuda.synchronize()
-
-    if rank == 0:
-        prep_time = time.perf_counter() - wall_clock_start
-        print("Total time before training begins (prep_time) =", prep_time,
-              "seconds")
-        print("Beginning training...")
-
-    val_accs = []
-    times = []
-    train_times = []
-    inference_times = []
-    best_val = 0.
-    start = time.perf_counter()
-    for epoch in range(1, epochs + 1):
-        train_start = time.perf_counter()
-        total_loss = 0
-        i = 0
-        for i, batch in enumerate(train_loader):
-            if i == warmup_steps:
-                torch.cuda.synchronize()
-            batch = batch.to(rank)
-            batch_size = batch.batch_size
-            batch.y = batch.y.to(torch.long)
-            optimizer.zero_grad()
-            out = model(batch.x, batch.edge_index)
-            loss = F.cross_entropy(out[:batch_size], batch.y[:batch_size])
-            loss.backward()
-            optimizer.step()
-            total_loss += loss
-        train_end = time.perf_counter()
-        train_times.append(train_end - train_start)
-        nb = i + 1.0
-        total_loss /= nb
-        dist.barrier()
-        torch.cuda.synchronize()
-
-        inference_start = time.perf_counter()
-        train_acc = evaluate(rank, train_loader, model)
-        dist.barrier()
-        val_acc = evaluate(rank, val_loader, model)
-        dist.barrier()
-
-        inference_times.append(time.perf_counter() - inference_start)
-        val_accs.append(val_acc)
-        if rank == 0:
-            print(f'Epoch {epoch:02d}, Loss: {total_loss:.4f}, Approx. Train:'
-                  f' {train_acc:.4f} Time: {train_end - train_start:.4f}s')
-            print(f'Train: {train_acc:.4f}, Val: {val_acc:.4f}, ')
-
-        times.append(time.perf_counter() - train_start)
-        if val_acc > best_val:
-            best_val = val_acc
-
-    print(f'Total time used for rank: {rank:02d} is '
-          f'{time.perf_counter()-start:.4f}')
-    if rank == 0:
-        val_acc = torch.tensor(val_accs)
-        print('============================')
-        print("Average Epoch Time on training: {:.4f}".format(
-            torch.tensor(train_times).mean()))
-        print("Average Epoch Time on inference: {:.4f}".format(
-            torch.tensor(inference_times).mean()))
-        print(f"Average Epoch Time: {torch.tensor(times).mean():.4f}")
-        print(f"Median time per epoch: {torch.tensor(times).median():.4f}s")
-        print(f'Final Validation: {val_acc.mean():.4f} ± {val_acc.std():.4f}')
-        print(f"Best validation accuracy: {best_val:.4f}")
-
-    if rank == 0:
-        print("Testing...")
-    final_test_acc = evaluate(rank, test_loader, model)
-    dist.barrier()
-    if rank == 0:
-        print(f'Test Accuracy: {final_test_acc:.4f} for rank: {rank:02d}')
-    if rank == 0:
-        total_time = time.perf_counter() - wall_clock_start
-        print(f"Total Training Runtime: {total_time - prep_time}s")
-        print(f"Total Program Runtime: {total_time}s")
-
-    cugraph_comms_shutdown()
-    dist.destroy_process_group()
-
-
-if __name__ == '__main__':
-
-    args = arg_parse()
-    seed_everything(123)
-    wall_clock_start = time.perf_counter()
-
-    root = osp.join(args.dataset_dir, args.dataset_subdir)
-    dataset = PygNodePropPredDataset(name=args.dataset, root=root)
-    split_idx = dataset.get_idx_split()
-    data = dataset[0]
-    if not args.use_directed_graph:
-        data.edge_index = to_undirected(data.edge_index, reduce="mean")
-    if args.add_self_loop:
-        data.edge_index, _ = remove_self_loops(data.edge_index)
-        data.edge_index, _ = add_self_loops(data.edge_index,
-                                            num_nodes=data.num_nodes)
-    data.y = data.y.reshape(-1)
-
-    print(f"Training {args.dataset} with {args.model} model.")
-    if args.model == "GAT":
-        model = torch_geometric.nn.models.GAT(dataset.num_features,
-                                              args.hidden_channels,
-                                              args.num_layers,
-                                              dataset.num_classes,
-                                              heads=args.num_heads)
-    elif args.model == "GCN":
-        model = torch_geometric.nn.models.GCN(
-            dataset.num_features,
-            args.hidden_channels,
-            args.num_layers,
-            dataset.num_classes,
-        )
-    elif args.model == "SAGE":
-        model = torch_geometric.nn.models.GraphSAGE(
-            dataset.num_features,
-            args.hidden_channels,
-            args.num_layers,
-            dataset.num_classes,
-        )
-    elif args.model == 'SGFormer':
-        # TODO add support for this with disjoint sampling
-        model = torch_geometric.nn.models.SGFormer(
-            in_channels=dataset.num_features,
-            hidden_channels=args.hidden_channels,
-            out_channels=dataset.num_classes,
-            trans_num_heads=args.num_heads,
-            trans_dropout=args.dropout,
-            gnn_num_layers=args.num_layers,
-            gnn_dropout=args.dropout,
-        )
-    else:
-        raise ValueError(f'Unsupported model type: {args.model}')
-
-    print("Data =", data)
-    if args.num_devices < 1:
-        world_size = torch.cuda.device_count()
-    elif args.num_devices <= torch.cuda.device_count():
-        world_size = args.num_devices
-    else:
-        world_size = torch.cuda.device_count()
-    print('Let\'s use', world_size, 'GPUs!')
-
-    # Create the uid needed for cuGraph comms
-    cugraph_id = cugraph_comms_create_unique_id()
-
-    if world_size > 1:
-        mp.spawn(
-            run_train,
-            args=(args, data, world_size, cugraph_id, model, split_idx,
-                  dataset.num_classes, wall_clock_start),
-            nprocs=world_size,
-            join=True,
-        )
-    else:
-        run_train(0, args, data, world_size, cugraph_id, model, split_idx,
-                  dataset.num_classes, wall_clock_start)
-
-    total_time = round(time.perf_counter() - wall_clock_start, 2)
-    print("Total Program Runtime (total_time) =", total_time, "seconds")
diff --git a/examples/multi_gpu/papers100m_gcn.py b/examples/multi_gpu/papers100m_gcn.py
deleted file mode 100644
index abf1b92ec15a..000000000000
--- a/examples/multi_gpu/papers100m_gcn.py
+++ /dev/null
@@ -1,202 +0,0 @@
-import argparse
-import os
-import tempfile
-import time
-
-import torch
-import torch.distributed as dist
-import torch.multiprocessing as mp
-import torch.nn.functional as F
-from ogb.nodeproppred import PygNodePropPredDataset
-from torch.nn.parallel import DistributedDataParallel
-from torchmetrics import Accuracy
-
-import torch_geometric
-from torch_geometric.loader import NeighborLoader
-
-
-def get_num_workers(world_size):
-    num_work = None
-    if hasattr(os, "sched_getaffinity"):
-        try:
-            num_work = len(os.sched_getaffinity(0)) / (2 * world_size)
-        except Exception:
-            pass
-    if num_work is None:
-        num_work = os.cpu_count() / (2 * world_size)
-    return int(num_work)
-
-
-def run_train(rank, data, world_size, model, epochs, batch_size, fan_out,
-              split_idx, num_classes, wall_clock_start, tempdir=None,
-              num_layers=3):
-
-    # init pytorch worker
-    os.environ['MASTER_ADDR'] = 'localhost'
-    os.environ['MASTER_PORT'] = '12355'
-    dist.init_process_group('nccl', rank=rank, world_size=world_size)
-
-    if world_size > 1:
-        split_idx['train'] = split_idx['train'].split(
-            split_idx['train'].size(0) // world_size, dim=0)[rank].clone()
-        split_idx['valid'] = split_idx['valid'].split(
-            split_idx['valid'].size(0) // world_size, dim=0)[rank].clone()
-        split_idx['test'] = split_idx['test'].split(
-            split_idx['test'].size(0) // world_size, dim=0)[rank].clone()
-    model = model.to(rank)
-    model = DistributedDataParallel(model, device_ids=[rank])
-    optimizer = torch.optim.Adam(model.parameters(), lr=0.01,
-                                 weight_decay=0.0005)
-
-    kwargs = dict(
-        num_neighbors=[fan_out] * num_layers,
-        batch_size=batch_size,
-    )
-    num_work = get_num_workers(world_size)
-    train_loader = NeighborLoader(data, input_nodes=split_idx['train'],
-                                  num_workers=num_work, shuffle=True,
-                                  drop_last=True, **kwargs)
-    val_loader = NeighborLoader(data, input_nodes=split_idx['valid'],
-                                num_workers=num_work, **kwargs)
-    test_loader = NeighborLoader(data, input_nodes=split_idx['test'],
-                                 num_workers=num_work, **kwargs)
-
-    eval_steps = 1000
-    warmup_steps = 20
-    acc = Accuracy(task="multiclass", num_classes=num_classes).to(rank)
-    dist.barrier()
-    torch.cuda.synchronize()
-    if rank == 0:
-        prep_time = round(time.perf_counter() - wall_clock_start, 2)
-        print("Total time before training begins (prep_time) =", prep_time,
-              "seconds")
-        print("Beginning training...")
-    for epoch in range(epochs):
-        for i, batch in enumerate(train_loader):
-            if i == warmup_steps:
-                torch.cuda.synchronize()
-                start = time.time()
-            batch = batch.to(rank)
-            batch_size = batch.num_sampled_nodes[0]
-            batch.y = batch.y.to(torch.long)
-            optimizer.zero_grad()
-            out = model(batch.x, batch.edge_index)
-            loss = F.cross_entropy(out[:batch_size], batch.y[:batch_size])
-            loss.backward()
-            optimizer.step()
-            if rank == 0 and i % 10 == 0:
-                print("Epoch: " + str(epoch) + ", Iteration: " + str(i) +
-                      ", Loss: " + str(loss))
-        nb = i + 1.0
-        dist.barrier()
-        torch.cuda.synchronize()
-        if rank == 0:
-            print("Average Training Iteration Time:",
-                  (time.time() - start) / (nb - warmup_steps), "s/iter")
-        with torch.no_grad():
-            for i, batch in enumerate(val_loader):
-                if i >= eval_steps:
-                    break
-
-                batch = batch.to(rank)
-                batch_size = batch.num_sampled_nodes[0]
-
-                batch.y = batch.y.to(torch.long)
-                out = model(batch.x, batch.edge_index)
-                acc_i = acc(  # noqa
-                    out[:batch_size].softmax(dim=-1), batch.y[:batch_size])
-            acc_sum = acc.compute()
-            if rank == 0:
-                print(f"Validation Accuracy: {acc_sum * 100.0:.4f}%", )
-        dist.barrier()
-        acc.reset()
-
-    with torch.no_grad():
-        for batch in test_loader:
-            batch = batch.to(rank)
-            batch_size = batch.num_sampled_nodes[0]
-
-            batch.y = batch.y.to(torch.long)
-            out = model(batch.x, batch.edge_index)
-            acc_i = acc(  # noqa
-                out[:batch_size].softmax(dim=-1), batch.y[:batch_size])
-        acc_sum = acc.compute()
-        if rank == 0:
-            print(f"Test Accuracy: {acc_sum * 100.0:.4f}%", )
-    dist.barrier()
-    acc.reset()
-    if rank == 0:
-        total_time = round(time.perf_counter() - wall_clock_start, 2)
-        print("Total Program Runtime (total_time) =", total_time, "seconds")
-        print("total_time - prep_time =", total_time - prep_time, "seconds")
-
-
-if __name__ == '__main__':
-
-    parser = argparse.ArgumentParser()
-    parser.add_argument('--hidden_channels', type=int, default=256)
-    parser.add_argument('--num_layers', type=int, default=2)
-    parser.add_argument('--lr', type=float, default=0.001)
-    parser.add_argument('--epochs', type=int, default=20)
-    parser.add_argument('--batch_size', type=int, default=1024)
-    parser.add_argument('--fan_out', type=int, default=30)
-    parser.add_argument(
-        "--use_gat_conv",
-        action='store_true',
-        help="Whether or not to use GATConv. (Defaults to using GCNConv)",
-    )
-    parser.add_argument(
-        "--n_gat_conv_heads",
-        type=int,
-        default=4,
-        help="If using GATConv, number of attention heads to use",
-    )
-    parser.add_argument(
-        "--n_devices", type=int, default=-1,
-        help="1-8 to use that many GPUs. Defaults to all available GPUs")
-
-    args = parser.parse_args()
-    wall_clock_start = time.perf_counter()
-    if args.n_devices == -1:
-        world_size = torch.cuda.device_count()
-    else:
-        world_size = args.n_devices
-    import psutil
-    gb_ram_needed = 190 + 200 * world_size
-    if (psutil.virtual_memory().total / (1024**3)) < gb_ram_needed:
-        print("Warning: may not have enough RAM to use this many GPUs.")
-        print("Consider upgrading RAM or using less GPUs if an error occurs.")
-        print("Estimated RAM Needed: ~" + str(gb_ram_needed))
-    print('Let\'s use', world_size, 'GPUs!')
-    dataset = PygNodePropPredDataset(name='ogbn-papers100M',
-                                     root='/datasets/ogb_datasets')
-    split_idx = dataset.get_idx_split()
-    data = dataset[0]
-    data.y = data.y.reshape(-1)
-    if args.use_gat_conv:
-        model = torch_geometric.nn.models.GAT(dataset.num_features,
-                                              args.hidden_channels,
-                                              args.num_layers,
-                                              dataset.num_classes,
-                                              heads=args.n_gat_conv_heads)
-    else:
-        model = torch_geometric.nn.models.GCN(
-            dataset.num_features,
-            args.hidden_channels,
-            args.num_layers,
-            dataset.num_classes,
-        )
-
-    print("Data =", data)
-    with tempfile.TemporaryDirectory() as tempdir:
-        if world_size > 1:
-            mp.spawn(
-                run_train,
-                args=(data, world_size, model, args.epochs, args.batch_size,
-                      args.fan_out, split_idx, dataset.num_classes,
-                      wall_clock_start, tempdir, args.num_layers),
-                nprocs=world_size, join=True)
-        else:
-            run_train(0, data, world_size, model, args.epochs, args.batch_size,
-                      args.fan_out, split_idx, dataset.num_classes,
-                      wall_clock_start, tempdir, args.num_layers)
diff --git a/examples/multi_gpu/papers100m_gcn_cugraph_multinode.py b/examples/multi_gpu/papers100m_gcn_cugraph_multinode.py
deleted file mode 100644
index 5fdc280483df..000000000000
--- a/examples/multi_gpu/papers100m_gcn_cugraph_multinode.py
+++ /dev/null
@@ -1,376 +0,0 @@
-# Multi-node, multi-GPU example with WholeGraph feature storage.
-# It is recommended that you download the dataset first before running.
-
-# To run, use sbatch
-# (i.e. sbatch -N2 -p <partition> -A <account> -J <job name>)
-# with the script shown below:
-#
-# head_node_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
-#
-# (yes || true) | srun -l \
-#        --container-image <container image> \
-#        --container-mounts "$(pwd):/workspace","/raid:/raid" \
-#          torchrun \
-#          --nnodes 2 \
-#          --nproc-per-node 8 \
-#          --rdzv-backend c10d \
-#          --rdzv-id 62 \
-#          --rdzv-endpoint $head_node_addr:29505 \
-#          /workspace/papers100m_gcn_cugraph_multinode.py \
-#            --epochs 1 \
-#            --dataset ogbn-papers100M \
-#            --dataset_root /workspace/datasets
-
-import argparse
-import json
-import os
-import os.path as osp
-import time
-from datetime import timedelta
-
-import torch
-import torch.distributed as dist
-import torch.nn.functional as F
-from cugraph.gnn import (
-    cugraph_comms_create_unique_id,
-    cugraph_comms_init,
-    cugraph_comms_shutdown,
-)
-from ogb.nodeproppred import PygNodePropPredDataset
-from pylibwholegraph.torch.initialize import finalize as wm_finalize
-from pylibwholegraph.torch.initialize import init as wm_init
-from torch.nn.parallel import DistributedDataParallel
-
-import torch_geometric
-from torch_geometric.io import fs
-
-# Allow computation on objects that are larger than GPU memory
-# https://docs.rapids.ai/api/cudf/stable/developer_guide/library_design/#spilling-to-host-memory
-os.environ['CUDF_SPILL'] = '1'
-
-# Ensures that a CUDA context is not created on import of rapids.
-# Allows pytorch to create the context instead
-os.environ['RAPIDS_NO_INITIALIZE'] = '1'
-
-
-def init_pytorch_worker(global_rank, local_rank, world_size, cugraph_id):
-    import rmm
-
-    rmm.reinitialize(
-        devices=local_rank,
-        managed_memory=True,
-        pool_allocator=True,
-    )
-
-    import cupy
-
-    cupy.cuda.Device(local_rank).use()
-    from rmm.allocators.cupy import rmm_cupy_allocator
-
-    cupy.cuda.set_allocator(rmm_cupy_allocator)
-
-    from cugraph.testing.mg_utils import enable_spilling
-
-    enable_spilling()
-
-    torch.cuda.set_device(local_rank)
-
-    cugraph_comms_init(rank=global_rank, world_size=world_size, uid=cugraph_id,
-                       device=local_rank)
-
-    wm_init(global_rank, world_size, local_rank, torch.cuda.device_count())
-
-
-def partition_data(dataset, split_idx, edge_path, feature_path, label_path,
-                   meta_path):
-    data = dataset[0]
-
-    os.makedirs(edge_path, exist_ok=True)
-    for (r, e) in enumerate(data.edge_index.tensor_split(world_size, dim=1)):
-        rank_path = osp.join(edge_path, f'rank={r}.pt')
-        torch.save(
-            e.clone(),
-            rank_path,
-        )
-
-    os.makedirs(feature_path, exist_ok=True)
-    for (r, f) in enumerate(torch.tensor_split(data.x, world_size)):
-        rank_path = osp.join(feature_path, f'rank={r}_x.pt')
-        torch.save(
-            f.clone(),
-            rank_path,
-        )
-    for (r, f) in enumerate(torch.tensor_split(data.y, world_size)):
-        rank_path = osp.join(feature_path, f'rank={r}_y.pt')
-        torch.save(
-            f.clone(),
-            rank_path,
-        )
-
-    os.makedirs(label_path, exist_ok=True)
-    for (d, i) in split_idx.items():
-        i_parts = torch.tensor_split(i, world_size)
-        for r, i_part in enumerate(i_parts):
-            rank_path = osp.join(label_path, f'rank={r}')
-            os.makedirs(rank_path, exist_ok=True)
-            torch.save(i_part, osp.join(rank_path, f'{d}.pt'))
-
-    meta = dict(
-        num_classes=int(dataset.num_classes),
-        num_features=int(dataset.num_features),
-        num_nodes=int(data.num_nodes),
-    )
-    with open(meta_path, 'w') as f:
-        json.dump(meta, f)
-
-
-def load_partitioned_data(rank, edge_path, feature_path, label_path, meta_path,
-                          wg_mem_type):
-    from cugraph_pyg.data import GraphStore, WholeFeatureStore
-
-    graph_store = GraphStore(is_multi_gpu=True)
-    feature_store = WholeFeatureStore(memory_type=wg_mem_type)
-
-    with open(meta_path) as f:
-        meta = json.load(f)
-
-    split_idx = {}
-    for split in ['train', 'test', 'valid']:
-        path = osp.join(label_path, f'rank={rank}', f'{split}.pt')
-        split_idx[split] = fs.torch_load(path)
-
-    path = osp.join(feature_path, f'rank={rank}_x.pt')
-    feature_store['node', 'x', None] = fs.torch_load(path)
-    path = osp.join(feature_path, f'rank={rank}_y.pt')
-    feature_store['node', 'y', None] = fs.torch_load(path)
-
-    eix = fs.torch_load(osp.join(edge_path, f'rank={rank}.pt'))
-    graph_store[dict(
-        edge_type=('node', 'rel', 'node'),
-        layout='coo',
-        is_sorted=False,
-        size=(meta['num_nodes'], meta['num_nodes']),
-    )] = eix
-
-    return (feature_store, graph_store), split_idx, meta
-
-
-def run(global_rank, data, split_idx, world_size, device, model, epochs,
-        batch_size, fan_out, num_classes, wall_clock_start, num_layers=3):
-
-    optimizer = torch.optim.Adam(model.parameters(), lr=0.01,
-                                 weight_decay=0.0005)
-
-    kwargs = dict(
-        num_neighbors=[fan_out] * num_layers,
-        batch_size=batch_size,
-    )
-    from cugraph_pyg.loader import NeighborLoader
-
-    ix_train = split_idx['train'].cuda()
-    train_loader = NeighborLoader(
-        data,
-        input_nodes=ix_train,
-        shuffle=True,
-        drop_last=True,
-        **kwargs,
-    )
-
-    ix_val = split_idx['valid'].cuda()
-    val_loader = NeighborLoader(
-        data,
-        input_nodes=ix_val,
-        shuffle=True,
-        drop_last=True,
-        **kwargs,
-    )
-
-    ix_test = split_idx['test'].cuda()
-    test_loader = NeighborLoader(
-        data,
-        input_nodes=ix_test,
-        shuffle=True,
-        drop_last=True,
-        local_seeds_per_call=80000,
-        **kwargs,
-    )
-
-    dist.barrier()
-
-    eval_steps = 1000
-    warmup_steps = 20
-    dist.barrier()
-    torch.cuda.synchronize()
-
-    if global_rank == 0:
-        prep_time = time.perf_counter() - wall_clock_start
-        print(f"Preparation time: {prep_time:.2f}s")
-
-    for epoch in range(epochs):
-        for i, batch in enumerate(train_loader):
-            if i == warmup_steps:
-                torch.cuda.synchronize()
-                start = time.time()
-
-            batch = batch.to(device)
-            batch_size = batch.batch_size
-
-            batch.y = batch.y.view(-1).to(torch.long)
-            optimizer.zero_grad()
-            out = model(batch.x, batch.edge_index)
-            loss = F.cross_entropy(out[:batch_size], batch.y[:batch_size])
-            loss.backward()
-            optimizer.step()
-            if global_rank == 0 and i % 10 == 0:
-                print(f"Epoch: {epoch:02d}, Iteration: {i}, Loss: {loss:.4f}")
-        nb = i + 1.0
-
-        if global_rank == 0:
-            print(f"Avg Training Iteration Time: "
-                  f"{(time.time() - start) / (nb - warmup_steps):.4f} s/iter")
-
-        with torch.no_grad():
-            total_correct = total_examples = 0
-            for i, batch in enumerate(val_loader):
-                if i >= eval_steps:
-                    break
-
-                batch = batch.to(device)
-                batch_size = batch.batch_size
-
-                batch.y = batch.y.to(torch.long)
-                out = model(batch.x, batch.edge_index)[:batch_size]
-
-                pred = out.argmax(dim=-1)
-                y = batch.y[:batch_size].view(-1).to(torch.long)
-
-                total_correct += int((pred == y).sum())
-                total_examples += y.size(0)
-
-            acc_val = total_correct / total_examples
-            if global_rank == 0:
-                print(f"Validation Accuracy: {acc_val * 100:.2f}%", )
-
-        torch.cuda.synchronize()
-
-    with torch.no_grad():
-        total_correct = total_examples = 0
-        for batch in test_loader:
-            batch = batch.to(device)
-            batch_size = batch.batch_size
-
-            batch.y = batch.y.to(torch.long)
-            out = model(batch.x, batch.edge_index)[:batch_size]
-
-            pred = out.argmax(dim=-1)
-            y = batch.y[:batch_size].view(-1).to(torch.long)
-
-            total_correct += int((pred == y).sum())
-            total_examples += y.size(0)
-
-        acc_test = total_correct / total_examples
-        if global_rank == 0:
-            print(f"Test Accuracy: {acc_test * 100:.2f}%", )
-
-    if global_rank == 0:
-        total_time = time.perf_counter() - wall_clock_start
-        print(f"Total Training Runtime: {total_time - prep_time}s")
-        print(f"Total Program Runtime: {total_time}s")
-
-    wm_finalize()
-    cugraph_comms_shutdown()
-
-
-def parse_args():
-    parser = argparse.ArgumentParser()
-    parser.add_argument('--hidden_channels', type=int, default=256)
-    parser.add_argument('--num_layers', type=int, default=2)
-    parser.add_argument('--lr', type=float, default=0.001)
-    parser.add_argument('--epochs', type=int, default=4)
-    parser.add_argument('--batch_size', type=int, default=1024)
-    parser.add_argument('--fan_out', type=int, default=30)
-    parser.add_argument('--dataset', type=str, default='ogbn-papers100M')
-    parser.add_argument('--root', type=str, default='dataset')
-    parser.add_argument('--skip_partition', action='store_true')
-    parser.add_argument('--wg_mem_type', type=str, default='distributed')
-
-    return parser.parse_args()
-
-
-if __name__ == '__main__':
-    args = parse_args()
-    wall_clock_start = time.perf_counter()
-
-    # Set a very high timeout so that PyTorch does not crash while
-    # partitioning the data.
-    dist.init_process_group('nccl', timeout=timedelta(minutes=60))
-    world_size = dist.get_world_size()
-    assert dist.is_initialized()
-
-    global_rank = dist.get_rank()
-    local_rank = int(os.environ['LOCAL_RANK'])
-    device = torch.device(local_rank)
-
-    print(
-        f'Global rank: {global_rank},',
-        f'Local Rank: {local_rank},',
-        f'World size: {world_size}',
-    )
-
-    # Create the uid needed for cuGraph comms
-    if global_rank == 0:
-        cugraph_id = [cugraph_comms_create_unique_id()]
-    else:
-        cugraph_id = [None]
-    dist.broadcast_object_list(cugraph_id, src=0, device=device)
-    cugraph_id = cugraph_id[0]
-
-    init_pytorch_worker(global_rank, local_rank, world_size, cugraph_id)
-
-    edge_path = osp.join(args.root, f'{args.dataset}_eix_part')
-    feature_path = osp.join(args.root, f'{args.dataset}_fea_part')
-    label_path = osp.join(args.root, f'{args.dataset}_label_part')
-    meta_path = osp.join(args.root, f'{args.dataset}_meta.json')
-
-    # We partition the data to avoid loading it in every worker, which will
-    # waste memory and can lead to an out of memory exception.
-    # cugraph_pyg.GraphStore and cugraph_pyg.WholeFeatureStore are always
-    # constructed from partitions of the edge index and features, respectively,
-    # so this works well.
-    if not args.skip_partition and global_rank == 0:
-        print("Partitioning the data into equal size parts per worker")
-        dataset = PygNodePropPredDataset(name=args.dataset, root=args.root)
-        split_idx = dataset.get_idx_split()
-
-        partition_data(
-            dataset,
-            split_idx,
-            meta_path=meta_path,
-            label_path=label_path,
-            feature_path=feature_path,
-            edge_path=edge_path,
-        )
-
-    dist.barrier()
-    print("Loading partitioned data")
-    data, split_idx, meta = load_partitioned_data(
-        rank=global_rank,
-        edge_path=edge_path,
-        feature_path=feature_path,
-        label_path=label_path,
-        meta_path=meta_path,
-        wg_mem_type=args.wg_mem_type,
-    )
-    dist.barrier()
-
-    model = torch_geometric.nn.models.GCN(
-        meta['num_features'],
-        args.hidden_channels,
-        args.num_layers,
-        meta['num_classes'],
-    ).to(device)
-    model = DistributedDataParallel(model, device_ids=[local_rank])
-
-    run(global_rank, data, split_idx, world_size, device, model, args.epochs,
-        args.batch_size, args.fan_out, meta['num_classes'], wall_clock_start,
-        args.num_layers)
diff --git a/examples/multi_gpu/papers100m_gcn_multinode.py b/examples/multi_gpu/papers100m_gcn_multinode.py
deleted file mode 100644
index af434b4d2ef7..000000000000
--- a/examples/multi_gpu/papers100m_gcn_multinode.py
+++ /dev/null
@@ -1,151 +0,0 @@
-"""Multi-node multi-GPU example on ogbn-papers100m.
-
-Example way to run using srun:
-srun -l -N<num_nodes> --ntasks-per-node=<ngpu_per_node> \
---container-name=cont --container-image=<image_url> \
---container-mounts=/ogb-papers100m/:/workspace/dataset
-python3 path_to_script.py
-"""
-import os
-import time
-from typing import Optional
-
-import torch
-import torch.distributed as dist
-import torch.nn.functional as F
-from ogb.nodeproppred import PygNodePropPredDataset
-from torch.nn.parallel import DistributedDataParallel
-from torchmetrics import Accuracy
-
-from torch_geometric.loader import NeighborLoader
-from torch_geometric.nn import GCN
-
-
-def get_num_workers() -> int:
-    num_workers = None
-    if hasattr(os, "sched_getaffinity"):
-        try:
-            num_workers = len(os.sched_getaffinity(0)) // 2
-        except Exception:
-            pass
-    if num_workers is None:
-        num_workers = os.cpu_count() // 2
-    return num_workers
-
-
-def run(world_size, data, split_idx, model, acc, wall_clock_start):
-    local_id = int(os.environ['LOCAL_RANK'])
-    rank = torch.distributed.get_rank()
-    torch.cuda.set_device(local_id)
-    device = torch.device(local_id)
-    if rank == 0:
-        print(f'Using {nprocs} GPUs...')
-
-    split_idx['train'] = split_idx['train'].split(
-        split_idx['train'].size(0) // world_size, dim=0)[rank].clone()
-    split_idx['valid'] = split_idx['valid'].split(
-        split_idx['valid'].size(0) // world_size, dim=0)[rank].clone()
-    split_idx['test'] = split_idx['test'].split(
-        split_idx['test'].size(0) // world_size, dim=0)[rank].clone()
-
-    model = DistributedDataParallel(model.to(device), device_ids=[local_id])
-    optimizer = torch.optim.Adam(model.parameters(), lr=0.001,
-                                 weight_decay=5e-4)
-
-    kwargs = dict(
-        data=data,
-        batch_size=1024,
-        num_workers=get_num_workers(),
-        num_neighbors=[30, 30],
-    )
-
-    train_loader = NeighborLoader(
-        input_nodes=split_idx['train'],
-        shuffle=True,
-        drop_last=True,
-        **kwargs,
-    )
-    val_loader = NeighborLoader(input_nodes=split_idx['valid'], **kwargs)
-    test_loader = NeighborLoader(input_nodes=split_idx['test'], **kwargs)
-
-    val_steps = 1000
-    warmup_steps = 100
-    acc = acc.to(device)
-    dist.barrier()
-    torch.cuda.synchronize()
-    if rank == 0:
-        prep_time = round(time.perf_counter() - wall_clock_start, 2)
-        print("Total time before training begins (prep_time)=", prep_time,
-              "seconds")
-        print("Beginning training...")
-
-    for epoch in range(1, 21):
-        model.train()
-        for i, batch in enumerate(train_loader):
-            if i == warmup_steps:
-                torch.cuda.synchronize()
-                start = time.time()
-            batch = batch.to(device)
-            optimizer.zero_grad()
-            y = batch.y[:batch.batch_size].view(-1).to(torch.long)
-            out = model(batch.x, batch.edge_index)[:batch.batch_size]
-            loss = F.cross_entropy(out, y)
-            loss.backward()
-            optimizer.step()
-
-            if rank == 0 and i % 10 == 0:
-                print(f'Epoch: {epoch:02d}, Iteration: {i}, Loss: {loss:.4f}')
-
-        dist.barrier()
-        torch.cuda.synchronize()
-        if rank == 0:
-            sec_per_iter = (time.time() - start) / (i + 1 - warmup_steps)
-            print(f"Avg Training Iteration Time: {sec_per_iter:.6f} s/iter")
-
-        @torch.no_grad()
-        def test(loader: NeighborLoader, num_steps: Optional[int] = None):
-            model.eval()
-            for j, batch in enumerate(loader):
-                if num_steps is not None and j >= num_steps:
-                    break
-                batch = batch.to(device)
-                out = model(batch.x, batch.edge_index)[:batch.batch_size]
-                y = batch.y[:batch.batch_size].view(-1).to(torch.long)
-                acc(out, y)
-            acc_sum = acc.compute()
-            return acc_sum
-
-        eval_acc = test(val_loader, num_steps=val_steps)
-        if rank == 0:
-            print(f"Val Accuracy: {eval_acc:.4f}%", )
-
-        acc.reset()
-        dist.barrier()
-
-    test_acc = test(test_loader)
-    if rank == 0:
-        print(f"Test Accuracy: {test_acc:.4f}%", )
-
-    dist.barrier()
-    acc.reset()
-    torch.cuda.synchronize()
-
-    if rank == 0:
-        total_time = round(time.perf_counter() - wall_clock_start, 2)
-        print("Total Program Runtime (total_time) =", total_time, "seconds")
-        print("total_time - prep_time =", total_time - prep_time, "seconds")
-
-
-if __name__ == '__main__':
-    wall_clock_start = time.perf_counter()
-    # Setup multi-node:
-    torch.distributed.init_process_group("nccl")
-    nprocs = dist.get_world_size()
-    assert dist.is_initialized(), "Distributed cluster not initialized"
-    dataset = PygNodePropPredDataset(name='ogbn-papers100M')
-    split_idx = dataset.get_idx_split()
-    model = GCN(dataset.num_features, 256, 2, dataset.num_classes)
-    acc = Accuracy(task="multiclass", num_classes=dataset.num_classes)
-    data = dataset[0]
-    data.y = data.y.reshape(-1)
-    run(nprocs, data, split_idx, model, acc, wall_clock_start)
diff --git a/examples/multi_gpu/pcqm4m_ogb.py b/examples/multi_gpu/pcqm4m_ogb.py
deleted file mode 100644
index 10e190251699..000000000000
--- a/examples/multi_gpu/pcqm4m_ogb.py
+++ /dev/null
@@ -1,647 +0,0 @@
-# Code adapted from OGB.
-# https://github.com/snap-stanford/ogb/tree/master/examples/lsc/pcqm4m-v2
-import argparse
-import math
-import os
-
-import torch
-import torch.distributed as dist
-import torch.multiprocessing as mp
-import torch.nn.functional as F
-import torch.optim as optim
-from ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder
-from torch.nn.parallel import DistributedDataParallel
-from torch.optim.lr_scheduler import StepLR
-from torch.utils.tensorboard import SummaryWriter
-from tqdm.auto import tqdm
-
-from torch_geometric.data import Data
-from torch_geometric.datasets import PCQM4Mv2
-from torch_geometric.io import fs
-from torch_geometric.loader import DataLoader
-from torch_geometric.nn import (
-    GlobalAttention,
-    MessagePassing,
-    Set2Set,
-    global_add_pool,
-    global_max_pool,
-    global_mean_pool,
-)
-from torch_geometric.utils import degree
-
-try:
-    from ogb.lsc import PCQM4Mv2Evaluator, PygPCQM4Mv2Dataset
-except ImportError as e:
-    raise ImportError(
-        "`PygPCQM4Mv2Dataset` requires rdkit (`pip install rdkit`)") from e
-
-from ogb.utils import smiles2graph
-
-
-def ogb_from_smiles_wrapper(smiles, *args, **kwargs):
-    """Returns `torch_geometric.data.Data` object from smiles while
-    `ogb.utils.smiles2graph` returns a dict of np arrays.
-    """
-    data_dict = smiles2graph(smiles, *args, **kwargs)
-    return Data(
-        x=torch.from_numpy(data_dict['node_feat']),
-        edge_index=torch.from_numpy(data_dict['edge_index']),
-        edge_attr=torch.from_numpy(data_dict['edge_feat']),
-        smiles=smiles,
-    )
-
-
-class GINConv(MessagePassing):
-    def __init__(self, emb_dim):
-        r"""GINConv.
-
-        Args:
-            emb_dim (int): node embedding dimensionality
-        """
-        super().__init__(aggr="add")
-        self.mlp = torch.nn.Sequential(
-            torch.nn.Linear(emb_dim, emb_dim),
-            torch.nn.BatchNorm1d(emb_dim),
-            torch.nn.ReLU(),
-            torch.nn.Linear(emb_dim, emb_dim),
-        )
-        self.eps = torch.nn.Parameter(torch.Tensor([0]))
-        self.bond_encoder = BondEncoder(emb_dim=emb_dim)
-
-    def forward(self, x, edge_index, edge_attr):
-        edge_embedding = self.bond_encoder(edge_attr)
-        return self.mlp(
-            (1 + self.eps) * x +
-            self.propagate(edge_index, x=x, edge_attr=edge_embedding))
-
-    def message(self, x_j, edge_attr):
-        return F.relu(x_j + edge_attr)
-
-    def update(self, aggr_out):
-        return aggr_out
-
-
-class GCNConv(MessagePassing):
-    def __init__(self, emb_dim):
-        super().__init__(aggr='add')
-        self.linear = torch.nn.Linear(emb_dim, emb_dim)
-        self.root_emb = torch.nn.Embedding(1, emb_dim)
-        self.bond_encoder = BondEncoder(emb_dim=emb_dim)
-
-    def forward(self, x, edge_index, edge_attr):
-        x = self.linear(x)
-        edge_embedding = self.bond_encoder(edge_attr)
-        row, col = edge_index
-        deg = degree(row, x.size(0), dtype=x.dtype) + 1
-        deg_inv_sqrt = deg.pow(-0.5)
-        deg_inv_sqrt[deg_inv_sqrt == float('inf')] = 0
-        norm = deg_inv_sqrt[row] * deg_inv_sqrt[col]
-        return self.propagate(
-            edge_index, x=x, edge_attr=edge_embedding, norm=norm
-        ) + F.relu(x + self.root_emb.weight) * 1. / deg.view(-1, 1)
-
-    def message(self, x_j, edge_attr, norm):
-        return norm.view(-1, 1) * F.relu(x_j + edge_attr)
-
-    def update(self, aggr_out):
-        return aggr_out
-
-
-class GNNNode(torch.nn.Module):
-    def __init__(self, num_layers, emb_dim, drop_ratio=0.5, JK="last",
-                 residual=False, gnn_type='gin'):
-        r"""GNN Node.
-
-        Args:
-            emb_dim (int): node embedding dimensionality.
-            num_layers (int): number of GNN message passing layers.
-            residual (bool): whether to add residual connection.
-            drop_ratio (float): dropout ratio.
-            JK (str): "last" or "sum" to choose JK concat strat.
-            residual (bool): Whether or not to add the residual
-            gnn_type (str): Type of GNN to use.
-        """
-        super().__init__()
-        if num_layers < 2:
-            raise ValueError("Number of GNN layers must be greater than 1.")
-
-        self.num_layers = num_layers
-        self.drop_ratio = drop_ratio
-        self.JK = JK
-        self.residual = residual
-        self.atom_encoder = AtomEncoder(emb_dim)
-        self.convs = torch.nn.ModuleList()
-        self.batch_norms = torch.nn.ModuleList()
-        for _ in range(num_layers):
-            if gnn_type == 'gin':
-                self.convs.append(GINConv(emb_dim))
-            elif gnn_type == 'gcn':
-                self.convs.append(GCNConv(emb_dim))
-            else:
-                raise ValueError(f'Undefined GNN type called {gnn_type}')
-
-            self.batch_norms.append(torch.nn.BatchNorm1d(emb_dim))
-
-    def forward(self, batched_data):
-        x = batched_data.x
-        edge_index = batched_data.edge_index
-        edge_attr = batched_data.edge_attr
-
-        # compute input node embedding
-        h_list = [self.atom_encoder(x)]
-        for layer in range(self.num_layers):
-            h = self.convs[layer](h_list[layer], edge_index, edge_attr)
-            h = self.batch_norms[layer](h)
-
-            if layer == self.num_layers - 1:
-                # remove relu for the last layer
-                h = F.dropout(h, self.drop_ratio, training=self.training)
-            else:
-                h = F.dropout(F.relu(h), self.drop_ratio,
-                              training=self.training)
-
-            if self.residual:
-                h += h_list[layer]
-
-            h_list.append(h)
-
-        # Different implementations of Jk-concat
-        if self.JK == "last":
-            node_representation = h_list[-1]
-        elif self.JK == "sum":
-            node_representation = 0
-            for layer in range(self.num_layers + 1):
-                node_representation += h_list[layer]
-
-        return node_representation
-
-
-class GNNNodeVirtualNode(torch.nn.Module):
-    """Outputs node representations."""
-    def __init__(self, num_layers, emb_dim, drop_ratio=0.5, JK="last",
-                 residual=False, gnn_type='gin'):
-        super().__init__()
-        if num_layers < 2:
-            raise ValueError("Number of GNN layers must be greater than 1.")
-
-        self.num_layers = num_layers
-        self.drop_ratio = drop_ratio
-        self.JK = JK
-        self.residual = residual
-        self.atom_encoder = AtomEncoder(emb_dim)
-
-        # set the initial virtual node embedding to 0.
-        self.virtualnode_embedding = torch.nn.Embedding(1, emb_dim)
-        torch.nn.init.constant_(self.virtualnode_embedding.weight.data, 0)
-
-        self.convs = torch.nn.ModuleList()
-        self.batch_norms = torch.nn.ModuleList()
-        self.mlp_virtualnode_list = torch.nn.ModuleList()
-        for _ in range(num_layers):
-            if gnn_type == 'gin':
-                self.convs.append(GINConv(emb_dim))
-            elif gnn_type == 'gcn':
-                self.convs.append(GCNConv(emb_dim))
-            else:
-                raise ValueError('Undefined GNN type called {gnn_type}')
-
-            self.batch_norms.append(torch.nn.BatchNorm1d(emb_dim))
-
-        for _ in range(num_layers - 1):
-            self.mlp_virtualnode_list.append(
-                torch.nn.Sequential(
-                    torch.nn.Linear(emb_dim, emb_dim),
-                    torch.nn.BatchNorm1d(emb_dim),
-                    torch.nn.ReLU(),
-                    torch.nn.Linear(emb_dim, emb_dim),
-                    torch.nn.BatchNorm1d(emb_dim),
-                    torch.nn.ReLU(),
-                ))
-
-    def forward(self, batched_data):
-        x = batched_data.x
-        edge_index = batched_data.edge_index
-        edge_attr = batched_data.edge_attr
-        batch = batched_data.batch
-
-        # virtual node embeddings for graphs
-        virtualnode_embedding = self.virtualnode_embedding(
-            torch.zeros(batch[-1].item() + 1).to(edge_index.dtype).to(
-                edge_index.device))
-
-        h_list = [self.atom_encoder(x)]
-        for layer in range(self.num_layers):
-            # add message from virtual nodes to graph nodes
-            h_list[layer] = h_list[layer] + virtualnode_embedding[batch]
-
-            # Message passing among graph nodes
-            h = self.convs[layer](h_list[layer], edge_index, edge_attr)
-
-            h = self.batch_norms[layer](h)
-            if layer == self.num_layers - 1:
-                # remove relu for the last layer
-                h = F.dropout(h, self.drop_ratio, training=self.training)
-            else:
-                h = F.dropout(F.relu(h), self.drop_ratio,
-                              training=self.training)
-
-            if self.residual:
-                h = h + h_list[layer]
-
-            h_list.append(h)
-
-            # update the virtual nodes
-            if layer < self.num_layers - 1:
-                # add message from graph nodes to virtual nodes
-                virtualnode_embedding_temp = global_add_pool(
-                    h_list[layer], batch) + virtualnode_embedding
-                # transform virtual nodes using MLP
-
-                if self.residual:
-                    virtualnode_embedding = virtualnode_embedding + F.dropout(
-                        self.mlp_virtualnode_list[layer]
-                        (virtualnode_embedding_temp), self.drop_ratio,
-                        training=self.training)
-                else:
-                    virtualnode_embedding = F.dropout(
-                        self.mlp_virtualnode_list[layer](
-                            virtualnode_embedding_temp), self.drop_ratio,
-                        training=self.training)
-
-        # Different implementations of Jk-concat
-        if self.JK == "last":
-            node_representation = h_list[-1]
-        elif self.JK == "sum":
-            node_representation = 0
-            for layer in range(self.num_layers + 1):
-                node_representation += h_list[layer]
-
-        return node_representation
-
-
-class GNN(torch.nn.Module):
-    def __init__(
-        self,
-        num_tasks=1,
-        num_layers=5,
-        emb_dim=300,
-        gnn_type='gin',
-        virtual_node=True,
-        residual=False,
-        drop_ratio=0,
-        JK="last",
-        graph_pooling="sum",
-    ):
-        r"""GNN.
-
-        Args:
-            num_tasks (int): number of labels to be predicted
-            num_layers (int): number of gnn layers.
-            emb_dim (int): embedding dim to use.
-            gnn_type (str): Type of GNN to use.
-            virtual_node (bool): whether to add virtual node or not.
-            residual (bool): Whether or not to add the residual
-            drop_ratio (float): dropout ratio.
-            JK (str): "last" or "sum" to choose JK concat strat.
-            graph_pooling (str): Graph pooling strat to use.
-        """
-        super().__init__()
-        if num_layers < 2:
-            raise ValueError("Number of GNN layers must be greater than 1.")
-
-        self.num_layers = num_layers
-        self.drop_ratio = drop_ratio
-        self.JK = JK
-        self.emb_dim = emb_dim
-        self.num_tasks = num_tasks
-        self.graph_pooling = graph_pooling
-        if virtual_node:
-            self.gnn_node = GNNNodeVirtualNode(
-                num_layers,
-                emb_dim,
-                JK=JK,
-                drop_ratio=drop_ratio,
-                residual=residual,
-                gnn_type=gnn_type,
-            )
-        else:
-            self.gnn_node = GNNNode(
-                num_layers,
-                emb_dim,
-                JK=JK,
-                drop_ratio=drop_ratio,
-                residual=residual,
-                gnn_type=gnn_type,
-            )
-
-        # Pooling function to generate whole-graph embeddings
-        if self.graph_pooling == "sum":
-            self.pool = global_add_pool
-        elif self.graph_pooling == "mean":
-            self.pool = global_mean_pool
-        elif self.graph_pooling == "max":
-            self.pool = global_max_pool
-        elif self.graph_pooling == "attention":
-            self.pool = GlobalAttention(gate_nn=torch.nn.Sequential(
-                torch.nn.Linear(emb_dim, emb_dim),
-                torch.nn.BatchNorm1d(emb_dim),
-                torch.nn.ReLU(),
-                torch.nn.Linear(emb_dim, 1),
-            ))
-        elif self.graph_pooling == "set2set":
-            self.pool = Set2Set(emb_dim, processing_steps=2)
-        else:
-            raise ValueError("Invalid graph pooling type.")
-
-        if graph_pooling == "set2set":
-            self.graph_pred_linear = torch.nn.Linear(2 * emb_dim, num_tasks)
-        else:
-            self.graph_pred_linear = torch.nn.Linear(emb_dim, num_tasks)
-
-    def forward(self, batched_data):
-        h_node = self.gnn_node(batched_data)
-        h_graph = self.pool(h_node, batched_data.batch)
-        output = self.graph_pred_linear(h_graph)
-        if self.training:
-            return output
-        else:
-            # At inference time, we clamp the value between 0 and 20
-            return torch.clamp(output, min=0, max=20)
-
-
-def train(model, rank, device, loader, optimizer):
-    model.train()
-    reg_criterion = torch.nn.L1Loss()
-    loss_accum = 0.0
-    for step, batch in enumerate(  # noqa: B007
-            tqdm(loader, desc="Training", disable=(rank > 0))):
-        batch = batch.to(device)
-        pred = model(batch).view(-1, )
-        optimizer.zero_grad()
-        loss = reg_criterion(pred, batch.y)
-        loss.backward()
-        optimizer.step()
-        loss_accum += loss.detach().cpu().item()
-    return loss_accum / (step + 1)
-
-
-def eval(model, device, loader, evaluator):
-    model.eval()
-    y_true = []
-    y_pred = []
-    for batch in tqdm(loader, desc="Evaluating"):
-        batch = batch.to(device)
-        with torch.no_grad():
-            pred = model(batch).view(-1, )
-
-        y_true.append(batch.y.view(pred.shape).detach().cpu())
-        y_pred.append(pred.detach().cpu())
-
-    y_true = torch.cat(y_true, dim=0)
-    y_pred = torch.cat(y_pred, dim=0)
-    input_dict = {"y_true": y_true, "y_pred": y_pred}
-    return evaluator.eval(input_dict)["mae"]
-
-
-def test(model, device, loader):
-    model.eval()
-    y_pred = []
-    for batch in tqdm(loader, desc="Testing"):
-        batch = batch.to(device)
-        with torch.no_grad():
-            pred = model(batch).view(-1, )
-
-        y_pred.append(pred.detach().cpu())
-
-    y_pred = torch.cat(y_pred, dim=0)
-    return y_pred
-
-
-def run(rank, dataset, args):
-    num_devices = args.num_devices
-    device = torch.device(
-        "cuda:" + str(rank)) if num_devices > 0 else torch.device("cpu")
-
-    if num_devices > 1:
-        os.environ["MASTER_ADDR"] = "localhost"
-        os.environ["MASTER_PORT"] = "12355"
-        dist.init_process_group("nccl", rank=rank, world_size=num_devices)
-
-    if args.on_disk_dataset:
-        train_idx = torch.arange(len(dataset.indices()))
-    else:
-        split_idx = dataset.get_idx_split()
-        train_idx = split_idx["train"]
-
-    if num_devices > 1:
-        num_splits = math.ceil(train_idx.size(0) / num_devices)
-        train_idx = train_idx.split(num_splits)[rank]
-
-    if args.train_subset:
-        subset_ratio = 0.1
-        n = len(train_idx)
-        subset_idx = torch.randperm(n)[:int(subset_ratio * n)]
-        train_dataset = dataset[train_idx[subset_idx]]
-    else:
-        train_dataset = dataset[train_idx]
-
-    train_loader = DataLoader(
-        train_dataset,
-        batch_size=args.batch_size,
-        shuffle=True,
-        num_workers=args.num_workers,
-    )
-
-    if rank == 0:
-        if args.on_disk_dataset:
-            valid_dataset = PCQM4Mv2(root='on_disk_dataset/', split="val",
-                                     from_smiles_func=ogb_from_smiles_wrapper)
-            test_dev_dataset = PCQM4Mv2(
-                root='on_disk_dataset/', split="test",
-                from_smiles_func=ogb_from_smiles_wrapper)
-            test_challenge_dataset = PCQM4Mv2(
-                root='on_disk_dataset/', split="holdout",
-                from_smiles_func=ogb_from_smiles_wrapper)
-        else:
-            valid_dataset = dataset[split_idx["valid"]]
-            test_dev_dataset = dataset[split_idx["test-dev"]]
-            test_challenge_dataset = dataset[split_idx["test-challenge"]]
-
-        valid_loader = DataLoader(
-            valid_dataset,
-            batch_size=args.batch_size,
-            shuffle=False,
-            num_workers=args.num_workers,
-        )
-        if args.save_test_dir != '':
-            testdev_loader = DataLoader(
-                test_dev_dataset,
-                batch_size=args.batch_size,
-                shuffle=False,
-                num_workers=args.num_workers,
-            )
-            testchallenge_loader = DataLoader(
-                test_challenge_dataset,
-                batch_size=args.batch_size,
-                shuffle=False,
-                num_workers=args.num_workers,
-            )
-
-        if args.checkpoint_dir != '':
-            os.makedirs(args.checkpoint_dir, exist_ok=True)
-
-        evaluator = PCQM4Mv2Evaluator()
-
-    gnn_type, virtual_node = args.gnn.split('-')
-    model = GNN(
-        gnn_type=gnn_type,
-        virtual_node=virtual_node,
-        num_layers=args.num_layers,
-        emb_dim=args.emb_dim,
-        drop_ratio=args.drop_ratio,
-        graph_pooling=args.graph_pooling,
-    )
-    if num_devices > 0:
-        model = model.to(rank)
-    if num_devices > 1:
-        model = DistributedDataParallel(model, device_ids=[rank])
-
-    optimizer = optim.Adam(model.parameters(), lr=0.001)
-
-    if args.log_dir != '':
-        writer = SummaryWriter(log_dir=args.log_dir)
-
-    best_valid_mae = 1000
-
-    if args.train_subset:
-        scheduler = StepLR(optimizer, step_size=300, gamma=0.25)
-        args.epochs = 1000
-    else:
-        scheduler = StepLR(optimizer, step_size=30, gamma=0.25)
-
-    current_epoch = 1
-
-    checkpoint_path = os.path.join(args.checkpoint_dir, 'checkpoint.pt')
-    if os.path.isfile(checkpoint_path):
-        checkpoint = fs.torch_load(checkpoint_path)
-        current_epoch = checkpoint['epoch'] + 1
-        model.load_state_dict(checkpoint['model_state_dict'])
-        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
-        scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
-        best_valid_mae = checkpoint['best_val_mae']
-        print(f"Found checkpoint, resume training at epoch {current_epoch}")
-
-    for epoch in range(current_epoch, args.epochs + 1):
-        train_mae = train(model, rank, device, train_loader, optimizer)
-
-        if num_devices > 1:
-            dist.barrier()
-
-        if rank == 0:
-            valid_mae = eval(
-                model.module if isinstance(model, DistributedDataParallel) else
-                model, device, valid_loader, evaluator)
-
-            print(f"Epoch {epoch:02d}, "
-                  f"Train MAE: {train_mae:.4f}, "
-                  f"Val MAE: {valid_mae:.4f}")
-
-            if args.log_dir != '':
-                writer.add_scalar('valid/mae', valid_mae, epoch)
-                writer.add_scalar('train/mae', train_mae, epoch)
-
-            if valid_mae < best_valid_mae:
-                best_valid_mae = valid_mae
-                if args.checkpoint_dir != '':
-                    checkpoint = {
-                        'epoch': epoch,
-                        'model_state_dict': model.state_dict(),
-                        'optimizer_state_dict': optimizer.state_dict(),
-                        'scheduler_state_dict': scheduler.state_dict(),
-                        'best_val_mae': best_valid_mae,
-                    }
-                    torch.save(checkpoint, checkpoint_path)
-
-                if args.save_test_dir != '':
-                    test_model = model.module if isinstance(
-                        model, DistributedDataParallel) else model
-
-                    testdev_pred = test(test_model, device, testdev_loader)
-                    evaluator.save_test_submission(
-                        {'y_pred': testdev_pred.cpu().detach().numpy()},
-                        args.save_test_dir,
-                        mode='test-dev',
-                    )
-
-                    testchallenge_pred = test(test_model, device,
-                                              testchallenge_loader)
-                    evaluator.save_test_submission(
-                        {'y_pred': testchallenge_pred.cpu().detach().numpy()},
-                        args.save_test_dir,
-                        mode='test-challenge',
-                    )
-
-            print(f'Best validation MAE so far: {best_valid_mae}')
-
-        if num_devices > 1:
-            dist.barrier()
-
-        scheduler.step()
-
-    if rank == 0 and args.log_dir != '':
-        writer.close()
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(
-        description='GNN baselines on pcqm4m with Pytorch Geometrics',
-        formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-    parser.add_argument('--gnn', type=str, default='gin-virtual',
-                        choices=['gin', 'gin-virtual', 'gcn',
-                                 'gcn-virtual'], help='GNN architecture')
-    parser.add_argument('--graph_pooling', type=str, default='sum',
-                        help='graph pooling strategy mean or sum')
-    parser.add_argument('--drop_ratio', type=float, default=0,
-                        help='dropout ratio')
-    parser.add_argument('--num_layers', type=int, default=5,
-                        help='number of GNN message passing layers')
-    parser.add_argument('--emb_dim', type=int, default=600,
-                        help='dimensionality of hidden units in GNNs')
-    parser.add_argument('--train_subset', action='store_true')
-    parser.add_argument('--batch_size', type=int, default=256,
-                        help='input batch size for training')
-    parser.add_argument('--epochs', type=int, default=100,
-                        help='number of epochs to train')
-    parser.add_argument('--num_workers', type=int, default=0,
-                        help='number of workers')
-    parser.add_argument('--log_dir', type=str, default="",
-                        help='tensorboard log directory')
-    parser.add_argument('--checkpoint_dir', type=str, default='',
-                        help='directory to save checkpoint')
-    parser.add_argument('--save_test_dir', type=str, default='',
-                        help='directory to save test submission file')
-    parser.add_argument('--num_devices', type=int, default='1',
-                        help="Number of GPUs, if 0 runs on the CPU")
-    parser.add_argument('--on_disk_dataset', action='store_true')
-    args = parser.parse_args()
-
-    available_gpus = torch.cuda.device_count() if torch.cuda.is_available(
-    ) else 0
-    if args.num_devices > available_gpus:
-        if available_gpus == 0:
-            print("No GPUs available, running w/ CPU...")
-        else:
-            raise ValueError(f"Cannot train with {args.num_devices} GPUs: "
-                             f"available GPUs count {available_gpus}")
-
-    # automatic dataloading and splitting
-    if args.on_disk_dataset:
-        dataset = PCQM4Mv2(root='on_disk_dataset/', split='train',
-                           from_smiles_func=ogb_from_smiles_wrapper)
-    else:
-        dataset = PygPCQM4Mv2Dataset(root='dataset/')
-
-    if args.num_devices > 1:
-        mp.spawn(run, args=(dataset, args), nprocs=args.num_devices, join=True)
-    else:
-        run(0, dataset, args)
diff --git a/examples/multi_gpu/taobao.py b/examples/multi_gpu/taobao.py
deleted file mode 100644
index 830840e629e2..000000000000
--- a/examples/multi_gpu/taobao.py
+++ /dev/null
@@ -1,285 +0,0 @@
-# An Multi GPU implementation of unsupervised bipartite GraphSAGE
-# using the Alibaba Taobao dataset.
-import argparse
-import os
-import os.path as osp
-
-import torch
-import torch.distributed as dist
-import torch.multiprocessing as mp
-import torch.nn.functional as F
-import tqdm
-from sklearn.metrics import roc_auc_score
-from torch.nn import Embedding, Linear
-from torch.nn.parallel import DistributedDataParallel
-
-import torch_geometric.transforms as T
-from torch_geometric.datasets import Taobao
-from torch_geometric.loader import LinkNeighborLoader
-from torch_geometric.nn import SAGEConv
-from torch_geometric.utils.convert import to_scipy_sparse_matrix
-
-
-class ItemGNNEncoder(torch.nn.Module):
-    def __init__(self, hidden_channels, out_channels):
-        super().__init__()
-        self.conv1 = SAGEConv(-1, hidden_channels)
-        self.conv2 = SAGEConv(hidden_channels, hidden_channels)
-        self.lin = Linear(hidden_channels, out_channels)
-
-    def forward(self, x, edge_index):
-        x = self.conv1(x, edge_index).relu()
-        x = self.conv2(x, edge_index).relu()
-        return self.lin(x)
-
-
-class UserGNNEncoder(torch.nn.Module):
-    def __init__(self, hidden_channels, out_channels):
-        super().__init__()
-        self.conv1 = SAGEConv((-1, -1), hidden_channels)
-        self.conv2 = SAGEConv((-1, -1), hidden_channels)
-        self.conv3 = SAGEConv((-1, -1), hidden_channels)
-        self.lin = Linear(hidden_channels, out_channels)
-
-    def forward(self, x_dict, edge_index_dict):
-        item_x = self.conv1(
-            x_dict['item'],
-            edge_index_dict[('item', 'to', 'item')],
-        ).relu()
-
-        user_x = self.conv2(
-            (x_dict['item'], x_dict['user']),
-            edge_index_dict[('item', 'rev_to', 'user')],
-        ).relu()
-
-        user_x = self.conv3(
-            (item_x, user_x),
-            edge_index_dict[('item', 'rev_to', 'user')],
-        ).relu()
-
-        return self.lin(user_x)
-
-
-class EdgeDecoder(torch.nn.Module):
-    def __init__(self, hidden_channels):
-        super().__init__()
-        self.lin1 = Linear(2 * hidden_channels, hidden_channels)
-        self.lin2 = Linear(hidden_channels, 1)
-
-    def forward(self, z_src, z_dst, edge_label_index):
-        row, col = edge_label_index
-        z = torch.cat([z_src[row], z_dst[col]], dim=-1)
-
-        z = self.lin1(z).relu()
-        z = self.lin2(z)
-        return z.view(-1)
-
-
-class Model(torch.nn.Module):
-    def __init__(self, num_users, num_items, hidden_channels, out_channels):
-        super().__init__()
-        self.user_emb = Embedding(num_users, hidden_channels)
-        self.item_emb = Embedding(num_items, hidden_channels)
-        self.item_encoder = ItemGNNEncoder(hidden_channels, out_channels)
-        self.user_encoder = UserGNNEncoder(hidden_channels, out_channels)
-        self.decoder = EdgeDecoder(out_channels)
-
-    def forward(self, x_dict, edge_index_dict, edge_label_index):
-        z_dict = {}
-        x_dict['user'] = self.user_emb(x_dict['user'])
-        x_dict['item'] = self.item_emb(x_dict['item'])
-        z_dict['item'] = self.item_encoder(
-            x_dict['item'],
-            edge_index_dict[('item', 'to', 'item')],
-        )
-        z_dict['user'] = self.user_encoder(x_dict, edge_index_dict)
-
-        return self.decoder(z_dict['user'], z_dict['item'], edge_label_index)
-
-
-def run_train(rank, data, train_data, val_data, test_data, args, world_size):
-    if rank == 0:
-        print("Setting up Data Loaders...")
-    train_edge_label_idx = train_data[('user', 'to', 'item')].edge_label_index
-    train_edge_label_idx = train_edge_label_idx.split(
-        train_edge_label_idx.size(1) // world_size, dim=1)[rank].clone()
-    train_loader = LinkNeighborLoader(
-        data=train_data,
-        num_neighbors=[8, 4],
-        edge_label_index=(('user', 'to', 'item'), train_edge_label_idx),
-        neg_sampling='binary',
-        batch_size=args.batch_size,
-        shuffle=True,
-        num_workers=args.num_workers,
-        drop_last=True,
-    )
-
-    val_loader = LinkNeighborLoader(
-        data=val_data,
-        num_neighbors=[8, 4],
-        edge_label_index=(
-            ('user', 'to', 'item'),
-            val_data[('user', 'to', 'item')].edge_label_index,
-        ),
-        edge_label=val_data[('user', 'to', 'item')].edge_label,
-        batch_size=args.batch_size,
-        shuffle=False,
-        num_workers=args.num_workers,
-    )
-
-    test_loader = LinkNeighborLoader(
-        data=test_data,
-        num_neighbors=[8, 4],
-        edge_label_index=(
-            ('user', 'to', 'item'),
-            test_data[('user', 'to', 'item')].edge_label_index,
-        ),
-        edge_label=test_data[('user', 'to', 'item')].edge_label,
-        batch_size=args.batch_size,
-        shuffle=False,
-        num_workers=args.num_workers,
-    )
-
-    def train():
-        model.train()
-
-        total_loss = total_examples = 0
-        for batch in tqdm.tqdm(train_loader, disable=rank != 0):
-            batch = batch.to(rank)
-            optimizer.zero_grad()
-
-            pred = model(
-                batch.x_dict,
-                batch.edge_index_dict,
-                batch['user', 'item'].edge_label_index,
-            )
-            loss = F.binary_cross_entropy_with_logits(
-                pred, batch['user', 'item'].edge_label)
-
-            loss.backward()
-            optimizer.step()
-            total_loss += float(loss)
-            total_examples += pred.numel()
-
-        return total_loss / total_examples
-
-    @torch.no_grad()
-    def test(loader):
-        model.eval()
-        preds, targets = [], []
-        for batch in tqdm.tqdm(loader, disable=rank != 0):
-            batch = batch.to(rank)
-
-            pred = model(
-                batch.x_dict,
-                batch.edge_index_dict,
-                batch['user', 'item'].edge_label_index,
-            ).sigmoid().view(-1).cpu()
-            target = batch['user', 'item'].edge_label.long().cpu()
-
-            preds.append(pred)
-            targets.append(target)
-
-        pred = torch.cat(preds, dim=0).numpy()
-        target = torch.cat(targets, dim=0).numpy()
-
-        return roc_auc_score(target, pred)
-
-    os.environ['MASTER_ADDR'] = 'localhost'
-    os.environ['MASTER_PORT'] = '12355'
-    dist.init_process_group('nccl', rank=rank, world_size=world_size)
-    model = Model(
-        num_users=data['user'].num_nodes,
-        num_items=data['item'].num_nodes,
-        hidden_channels=64,
-        out_channels=64,
-    ).to(rank)
-    # Initialize lazy modules
-    for batch in train_loader:
-        batch = batch.to(rank)
-        _ = model(
-            batch.x_dict,
-            batch.edge_index_dict,
-            batch['user', 'item'].edge_label_index,
-        )
-        break
-    model = DistributedDataParallel(model, device_ids=[rank])
-    optimizer = torch.optim.Adam(model.parameters(), lr=args.lr)
-    best_val_auc = 0
-    for epoch in range(1, args.epochs):
-        print("Train")
-        loss = train()
-        if rank == 0:
-            print("Val")
-            val_auc = test(val_loader)
-            best_val_auc = max(best_val_auc, val_auc)
-        if rank == 0:
-            print(
-                f'Epoch: {epoch:02d}, Loss: {loss:4f}, Val AUC: {val_auc:.4f}')
-    if rank == 0:
-        print("Test")
-        test_auc = test(test_loader)
-        print(f'Total {args.epochs:02d} epochs: Final Loss: {loss:4f}, '
-              f'Best Val AUC: {best_val_auc:.4f}, '
-              f'Test AUC: {test_auc:.4f}')
-
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser()
-    parser.add_argument('--num_workers', type=int, default=16,
-                        help="Number of workers per dataloader")
-    parser.add_argument('--lr', type=float, default=0.001)
-    parser.add_argument('--epochs', type=int, default=21)
-    parser.add_argument('--batch_size', type=int, default=2048)
-    parser.add_argument(
-        '--dataset_root_dir', type=str,
-        default=osp.join(osp.dirname(osp.realpath(__file__)),
-                         '../../data/Taobao'))
-    args = parser.parse_args()
-
-    def pre_transform(data):
-        # Compute sparsified item<>item relationships through users:
-        print('Computing item<>item relationships...')
-        mat = to_scipy_sparse_matrix(data['user', 'item'].edge_index).tocsr()
-        mat = mat[:data['user'].num_nodes, :data['item'].num_nodes]
-        comat = mat.T @ mat
-        comat.setdiag(0)
-        comat = comat >= 3.
-        comat = comat.tocoo()
-        row = torch.from_numpy(comat.row).to(torch.long)
-        col = torch.from_numpy(comat.col).to(torch.long)
-        data['item', 'item'].edge_index = torch.stack([row, col], dim=0)
-        return data
-
-    dataset = Taobao(args.dataset_root_dir, pre_transform=pre_transform)
-    data = dataset[0]
-
-    data['user'].x = torch.arange(0, data['user'].num_nodes)
-    data['item'].x = torch.arange(0, data['item'].num_nodes)
-
-    # Only consider user<>item relationships for simplicity:
-    del data['category']
-    del data['item', 'category']
-    del data['user', 'item'].time
-    del data['user', 'item'].behavior
-
-    # Add a reverse ('item', 'rev_to', 'user') relation for message passing:
-    data = T.ToUndirected()(data)
-
-    # Perform a link-level split into training, validation, and test edges:
-    print('Computing data splits...')
-    train_data, val_data, test_data = T.RandomLinkSplit(
-        num_val=0.1,
-        num_test=0.1,
-        neg_sampling_ratio=1.0,
-        add_negative_train_samples=False,
-        edge_types=[('user', 'to', 'item')],
-        rev_edge_types=[('item', 'rev_to', 'user')],
-    )(data)
-    print('Done!')
-
-    world_size = torch.cuda.device_count()
-    print('Let\'s use', world_size, 'GPUs!')
-    mp.spawn(run_train,
-             args=(data, train_data, val_data, test_data, args, world_size),
-             nprocs=world_size, join=True)
diff --git a/examples/ogbn_train_cugraph.py b/examples/ogbn_train_cugraph.py
deleted file mode 100644
index a3045b736ab3..000000000000
--- a/examples/ogbn_train_cugraph.py
+++ /dev/null
@@ -1,354 +0,0 @@
-import argparse
-import os
-import os.path as osp
-import time
-
-import cupy
-import psutil
-import rmm
-import torch
-import torch.distributed as dist
-from rmm.allocators.cupy import rmm_cupy_allocator
-from rmm.allocators.torch import rmm_torch_allocator
-
-# Must change allocators immediately upon import
-# or else other imports will cause memory to be
-# allocated and prevent changing the allocator
-# rmm.reinitialize() provides an easy way to initialize RMM
-# with specific memory resource options across multiple devices.
-# See help(rmm.reinitialize) for full details.
-rmm.reinitialize(devices=[0], pool_allocator=True, managed_memory=True)
-cupy.cuda.set_allocator(rmm_cupy_allocator)
-torch.cuda.memory.change_current_allocator(rmm_torch_allocator)
-
-import cudf  # noqa
-import cugraph_pyg  # noqa
-import torch.nn.functional as F  # noqa
-# Enable cudf spilling to save gpu memory
-from cugraph_pyg.loader import NeighborLoader  # noqa
-from ogb.nodeproppred import PygNodePropPredDataset  # noqa
-
-import torch_geometric  # noqa
-
-cudf.set_option("spill", True)
-
-
-# ---------------- Distributed helpers ----------------
-def safe_get_rank():
-    return dist.get_rank() if dist.is_initialized() else 0
-
-
-def safe_get_world_size():
-    return dist.get_world_size() if dist.is_initialized() else 1
-
-
-def init_distributed():
-    """Initialize distributed training if environment variables are set.
-    Fallback to single-GPU mode otherwise.
-    """
-    # Already initialized ? nothing to do
-    if dist.is_available() and dist.is_initialized():
-        return
-
-    # Default env vars for single-GPU / single-process fallback
-    default_env = {
-        "RANK": "0",
-        "LOCAL_RANK": "0",
-        "WORLD_SIZE": "1",
-        "LOCAL_WORLD_SIZE": "1",
-        "MASTER_ADDR": "127.0.0.1",
-        "MASTER_PORT": "29500"
-    }
-
-    # Update environment only if keys are missing
-    for k, v in default_env.items():
-        os.environ.setdefault(k, v)
-
-    # Set CUDA device
-    if torch.cuda.is_available():
-        local_rank = int(os.environ["LOCAL_RANK"])
-        torch.cuda.set_device(local_rank)
-
-    # Initialize distributed only if world_size > 1
-    world_size = int(os.environ["WORLD_SIZE"])
-    if world_size > 1:
-        dist.init_process_group(backend="nccl", init_method="env://")
-        rank = os.environ['RANK']
-        print(f"Initialized distributed: rank {rank}, world_size {world_size}")
-    else:
-        print("Running in single-GPU / single-process mode")
-
-    if not dist.is_initialized():
-        dist.init_process_group(backend="nccl", init_method="env://", rank=0,
-                                world_size=1)
-
-
-# ------------------------------------------------------
-
-
-def arg_parse():
-    parser = argparse.ArgumentParser(
-        formatter_class=argparse.ArgumentDefaultsHelpFormatter, )
-    parser.add_argument(
-        '--dataset',
-        type=str,
-        default='ogbn-arxiv',
-        choices=['ogbn-papers100M', 'ogbn-products', 'ogbn-arxiv'],
-        help='Dataset name.',
-    )
-    parser.add_argument(
-        '--dataset_dir',
-        type=str,
-        default='/workspace/data',
-        help='Root directory of dataset.',
-    )
-    parser.add_argument(
-        "--dataset_subdir",
-        type=str,
-        default="ogbn-arxiv",
-        help="directory of dataset.",
-    )
-    parser.add_argument('-e', '--epochs', type=int, default=50)
-    parser.add_argument('--num_layers', type=int, default=3)
-    parser.add_argument('-b', '--batch_size', type=int, default=1024)
-    parser.add_argument('--fan_out', type=int, default=10)
-    parser.add_argument('--hidden_channels', type=int, default=256)
-    parser.add_argument('--lr', type=float, default=0.003)
-    parser.add_argument('--wd', type=float, default=0.0,
-                        help='weight decay for the optimizer')
-    parser.add_argument('--dropout', type=float, default=0.5)
-    parser.add_argument('--num_workers', type=int, default=12)
-    parser.add_argument(
-        '--use_directed_graph',
-        action='store_true',
-        help='Whether or not to use directed graph',
-    )
-    parser.add_argument(
-        '--add_self_loop',
-        action='store_true',
-        help='Whether or not to add self loop',
-    )
-    parser.add_argument(
-        "--model",
-        type=str,
-        default='SAGE',
-        choices=[
-            'SAGE',
-            'GAT',
-            'GCN',
-            # TODO: Uncomment when we add support for disjoint sampling
-            # 'SGFormer',
-        ],
-        help="Model used for training, default SAGE",
-    )
-    parser.add_argument(
-        "--num_heads",
-        type=int,
-        default=1,
-        help="If using GATConv or GT, number of attention heads to use",
-    )
-    parser.add_argument('--tempdir_root', type=str, default=None)
-    args = parser.parse_args()
-    return args
-
-
-def create_loader(
-    input_nodes,
-    stage_name,
-    data,
-    num_neighbors,
-    replace,
-    batch_size,
-    shuffle=False,
-):
-    if safe_get_rank() == 0:
-        print(f'Creating {stage_name} loader...')
-
-    return NeighborLoader(
-        data,
-        num_neighbors=num_neighbors,
-        input_nodes=input_nodes,
-        replace=replace,
-        batch_size=batch_size,
-        shuffle=shuffle,
-    )
-
-
-def train(model, train_loader, optimizer):
-    model.train()
-
-    total_loss = total_correct = total_examples = 0
-    for batch in train_loader:
-        batch = batch.cuda()
-        optimizer.zero_grad()
-        out = model(batch.x, batch.edge_index)[:batch.batch_size]
-        y = batch.y[:batch.batch_size].view(-1).to(torch.long)
-        loss = F.cross_entropy(out, y)
-        loss.backward()
-        optimizer.step()
-
-        total_loss += loss * y.size(0)
-        total_correct += out.argmax(dim=-1).eq(y).sum()
-        total_examples += y.size(0)
-
-    return total_loss.item() / total_examples, total_correct.item(
-    ) / total_examples
-
-
-@torch.no_grad()
-def test(model, loader):
-    model.eval()
-
-    total_correct = total_examples = 0
-    for batch in loader:
-        batch = batch.cuda()
-        out = model(batch.x, batch.edge_index)[:batch.batch_size]
-        y = batch.y[:batch.batch_size].view(-1).to(torch.long)
-
-        total_correct += out.argmax(dim=-1).eq(y).sum()
-        total_examples += y.size(0)
-
-    return total_correct.item() / total_examples
-
-
-if __name__ == '__main__':
-    # init DDP if needed
-    init_distributed()
-
-    args = arg_parse()
-    torch_geometric.seed_everything(123)
-
-    if "papers" in str(args.dataset) and (psutil.virtual_memory().total /
-                                          (1024**3)) < 390:
-        if safe_get_rank() == 0:
-            print("Warning: may not have enough RAM to use this many GPUs.")
-            print("Consider upgrading RAM if an error occurs.")
-            print("Estimated RAM Needed: ~390GB.")
-
-    wall_clock_start = time.perf_counter()
-
-    root = osp.join(args.dataset_dir, args.dataset_subdir)
-
-    if safe_get_rank() == 0:
-        print('The root is: ', root)
-
-    dataset = PygNodePropPredDataset(name=args.dataset, root=root)
-    split_idx = dataset.get_idx_split()
-
-    data = dataset[0]
-    if not args.use_directed_graph:
-        data.edge_index = torch_geometric.utils.to_undirected(
-            data.edge_index, reduce="mean")
-    if args.add_self_loop:
-        data.edge_index, _ = torch_geometric.utils.remove_self_loops(
-            data.edge_index)
-        data.edge_index, _ = torch_geometric.utils.add_self_loops(
-            data.edge_index, num_nodes=data.num_nodes)
-
-    graph_store = cugraph_pyg.data.GraphStore()
-    graph_store[dict(
-        edge_type=('node', 'rel', 'node'),
-        layout='coo',
-        is_sorted=False,
-        size=(data.num_nodes, data.num_nodes),
-    )] = data.edge_index
-
-    feature_store = cugraph_pyg.data.FeatureStore()
-    feature_store['node', 'x', None] = data.x
-    feature_store['node', 'y', None] = data.y
-
-    data = (feature_store, graph_store)
-
-    if safe_get_rank() == 0:
-        print(f"Training {args.dataset} with {args.model} model.")
-
-    if args.model == "GAT":
-        model = torch_geometric.nn.models.GAT(dataset.num_features,
-                                              args.hidden_channels,
-                                              args.num_layers,
-                                              dataset.num_classes,
-                                              heads=args.num_heads).cuda()
-    elif args.model == "GCN":
-        model = torch_geometric.nn.models.GCN(dataset.num_features,
-                                              args.hidden_channels,
-                                              args.num_layers,
-                                              dataset.num_classes).cuda()
-    elif args.model == "SAGE":
-        model = torch_geometric.nn.models.GraphSAGE(
-            dataset.num_features, args.hidden_channels, args.num_layers,
-            dataset.num_classes).cuda()
-    elif args.model == 'SGFormer':
-        # TODO add support for this with disjoint sampling
-        model = torch_geometric.nn.models.SGFormer(
-            in_channels=dataset.num_features,
-            hidden_channels=args.hidden_channels,
-            out_channels=dataset.num_classes,
-            trans_num_heads=args.num_heads,
-            trans_dropout=args.dropout,
-            gnn_num_layers=args.num_layers,
-            gnn_dropout=args.dropout,
-        ).cuda()
-    else:
-        raise ValueError(f'Unsupported model type: {args.model}')
-
-    optimizer = torch.optim.Adam(model.parameters(), lr=args.lr,
-                                 weight_decay=args.wd)
-
-    loader_kwargs = dict(
-        data=data,
-        num_neighbors=[args.fan_out] * args.num_layers,
-        replace=False,
-        batch_size=args.batch_size,
-    )
-
-    train_loader = create_loader(split_idx['train'], 'train', **loader_kwargs,
-                                 shuffle=True)
-    val_loader = create_loader(split_idx['valid'], 'val', **loader_kwargs)
-    test_loader = create_loader(split_idx['test'], 'test', **loader_kwargs)
-
-    if dist.is_initialized():
-        dist.barrier()  # sync before training
-
-    if safe_get_rank() == 0:
-        prep_time = round(time.perf_counter() - wall_clock_start, 2)
-        print("Total time before training begins (prep_time) =", prep_time,
-              "seconds")
-        print("Beginning training...")
-
-    val_accs, times, train_times, inference_times = [], [], [], []
-    best_val = 0.
-    start = time.perf_counter()
-    for epoch in range(1, args.epochs + 1):
-        train_start = time.perf_counter()
-        loss, train_acc = train(model, train_loader, optimizer)
-        train_end = time.perf_counter()
-        train_times.append(train_end - train_start)
-        inference_start = time.perf_counter()
-        train_acc = test(model, train_loader)
-        val_acc = test(model, val_loader)
-        inference_times.append(time.perf_counter() - inference_start)
-        val_accs.append(val_acc)
-
-        if safe_get_rank() == 0:
-            print(f'Epoch {epoch:02d}, Loss: {loss:.4f}, '
-                  f'Train: {train_acc:.4f}, Val: {val_acc:.4f}, '
-                  f'Time: {train_end - train_start:.4f}s')
-
-        times.append(time.perf_counter() - train_start)
-        best_val = max(best_val, val_acc)
-
-    if safe_get_rank() == 0:
-        print(f"Total time used: {time.perf_counter()-start:.4f}")
-        print("Final Validation: {:.4f} ± {:.4f}".format(
-            torch.tensor(val_accs).mean(),
-            torch.tensor(val_accs).std()))
-        print(f"Best validation accuracy: {best_val:.4f}")
-        print("Testing...")
-        final_test_acc = test(model, test_loader)
-        print(f'Test Accuracy: {final_test_acc:.4f}')
-        total_time = round(time.perf_counter() - wall_clock_start, 2)
-        print("Total Program Runtime (total_time) =", total_time, "seconds")
-
-    if dist.is_initialized():
-        dist.barrier()
-        dist.destroy_process_group()