PipeTGL

This repository contains python codes for this paper:

PipeTGL: (Near) Zero Bubble Memory-based Temporal Graph Neural Network Training via Pipeline Optimization

📚 Introduction

In this paper, we propose a pipeline parallel approach for multi-GPU M-TGNN training that effectively addresses both inter-minibatch memory dependencies and intra-minibatch task dependencies, based on a runtime analysis DAG for M-TGNNs. We further optimize pipeline efficiency by incorporating improved scheduling, finer-grained operation reorganization, and targeted communication optimizations tailored to the specific training properties of M-TGNN. These enhancements significantly reduce GPU waiting and idle time caused by memory dependencies and frequent communication and result in zero pipeline bubbles for common training configurations. Extensive evaluations demonstrate that PipeTGL achieves a speedup of 1.27x to 4.74x over other baselines while also improving the accuracy of M-TGNN training across multiple GPUs.

🔑 Setup

1.Dependencies

Our development environment:

Ubuntu 22.04
g++ 9.4
CUDA 12.4
cmake 3.23
torch >= 1.10
dgl >= 1.1.1 (CUDA version)
tqdm >= 4.66.2
scikit-learn >= 1.3.0
pandas >= 2.2.1
numpy >= 1.26.4

2.Clone repository

git clone https://github.com/magicmoo/PipeTGL.git

3.Install GNNFlow

We implemented PipeTGL based on GNNFlow. Before you start PipeTGL, you should install GNNFlow in your python environment.

git clone https://github.com/jasperzhong/GNNFlow.git
python setup.py install

🗃️ Datasets

Download the datasets

cd scripts/ && ./download_data.sh

🚀 Get Started

1.Multi-GPU single machine

Training TGN model on the <DatasetName> dataset with <NumberOfEpochs> epochs on <NumberOfGpus> GPUs within a single mechine.

./scripts/run_single_mechine.sh <NumberOfEpochs> <DatasetName> <NumberOfGpus>


# Training TGN model on the REDDIT dataset with 50 epochs on 4 GPUs within a single mechine:
./scripts/run_single_mechine.sh 50 REDDIT 4

2.Distributed training

Training TGN model on the <DatasetName> dataset with <NumberOfEpochs> epochs on <NumberOfGpus> GPUs per machine across a total number of <NumberOfNodes> machines.

./scripts/run_multi_mechines.sh <NumberOfEpochs> <DatasetName> <NumberOfGpus> <NumberOfNodes>


# Training TGN model on the GDELT dataset with 100 epochs on 4 GPUs per machine across a total number of 2 machines:
./scripts/run_multi_mechines.sh 100 GDELT 4 2

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
baseline		baseline
img		img
modules		modules
scripts		scripts
test		test
.gitignore		.gitignore
README.md		README.md
config.py		config.py
optimizer.py		optimizer.py
pipeline.py		pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PipeTGL

📚 Introduction

🔑 Setup

1.Dependencies

2.Clone repository

3.Install GNNFlow

🗃️ Datasets

Download the datasets

🚀 Get Started

1.Multi-GPU single machine

2.Distributed training

About

Uh oh!

Releases

Packages

Languages

CGCL-codes/PipeTGL

Folders and files

Latest commit

History

Repository files navigation

PipeTGL

📚 Introduction

🔑 Setup

1.Dependencies

2.Clone repository

3.Install GNNFlow

🗃️ Datasets

Download the datasets

🚀 Get Started

1.Multi-GPU single machine

2.Distributed training

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages