Skip to content
/ PyAGC Public
forked from Cloudy1225/PyAGC

Attributed Graph Clustering Library for PyTorch

License

Notifications You must be signed in to change notification settings

dpsoft97/PyAGC

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyAGC Logo

Bridging Academia and Industry for Attributed Graph Clustering

PyPI Version Documentation License GitHub Stars

Paper | Documentation | PyPI | Benchmark Results

PyAGC is a production-ready, modular library and comprehensive benchmark for Attributed Graph Clustering (AGC), built on PyTorch and PyTorch Geometric. It unifies 20+ state-of-the-art algorithms under a principled Encode-Cluster-Optimize (ECO) framework, provides mini-batch implementations that scale to 111 million nodes on a single 32GB GPU, and introduces a holistic evaluation protocol spanning supervised, unsupervised, and efficiency metrics across 12 diverse datasets.

Battle-tested in high-stakes industrial workflows at Ant Group (Fraud Detection, Anti-Money Laundering, User Profiling), PyAGC offers the community a robust, reproducible, and scalable platform to advance AGC research towards realistic deployment.

PyAGC Framework Overview

Table of Contents


Why PyAGC?

Current AGC evaluation suffers from four critical limitations that PyAGC is designed to address:

Problem Status Quo PyAGC Solution
The Cora-fication of Datasets Over-reliance on small, homophilous citation networks 12 datasets spanning 5 orders of magnitude, including industrial graphs with tabular features and low homophily
The Scalability Bottleneck Full-batch training limits methods to ~10⁵ nodes Mini-batch implementations enabling training on 111M+ nodes with a single 32GB GPU
The Supervised Metric Paradox Unsupervised methods evaluated only with supervised metrics Holistic evaluation with unsupervised structural metrics (Modularity, Conductance) + efficiency profiling
The Reproducibility Gap Scattered codebases with hard-coded parameters Unified, configuration-driven framework with strict YAML-based experiment management

Key Features

  • 📊 Diverse Dataset Collection — 12 graphs from 2.7K to 111M nodes across Citation, Social, E-commerce, and Web domains, featuring both textual and tabular attributes with varying homophily levels.

  • 🧩 Unified Algorithm Framework — 20+ SOTA methods organized under the Encode-Cluster-Optimize taxonomy with modular, interchangeable encoders, cluster heads, and optimization strategies.

  • 📏 Holistic Evaluation Protocol — Supervised metrics (ACC, NMI, ARI, F1), unsupervised structural metrics (Modularity, Conductance), and comprehensive efficiency profiling (time, memory).

  • 🚀 Production-Grade Scalability — GPU-accelerated KMeans (via PyTorch + Triton) and neighbor-sampling-based mini-batch training that scales deep clustering to 111M nodes on a single 32GB V100 GPU.

  • 🛠️ Developer-Friendly Design — Plug-and-play components, YAML-driven configuration, and clean abstractions that make prototyping new methods as easy as swapping a single config line.


Project Structure

PyAGC/
├── pyagc/                          # Core library
│   ├── encoders/                   # GNN backbones (GCN, GAT, SAGE, GIN, Transformers)
│   ├── clusters/                   # Cluster heads (KMeans, DEC, DMoN, MinCut, Neuromap, ...)
│   ├── models/                     # Full model implementations (20+ methods)
│   ├── data/                       # Unified dataset loaders
│   ├── metrics/                    # Supervised + unsupervised metrics
│   ├── transforms/                 # Graph augmentations (edge drop, feature mask)
│   └── utils/                      # Checkpointing, logging, misc utilities
├── benchmark/                      # Reproducible experiments
│   ├── <Method>/                   # Per-method directory
│   │   ├── main.py                 # Entry point
│   │   ├── train.conf.yaml         # Hyperparameter configuration
│   │   └── logs/                   # Experiment logs per dataset
│   ├── data/                       # Cached datasets
│   └── results/                    # Aggregated benchmark results
├── tests/                          # Unit tests
└── docs/                           # Documentation (Sphinx → ReadTheDocs)

Installation

From PyPI (Recommended)

pip install pyagc

From Source

git clone https://github.com/Cloudy1225/PyAGC.git
cd PyAGC
pip install -e .

Prerequisites

  • Python >= 3.10
  • PyTorch >= 2.6.0
  • PyTorch Geometric >= 2.7.0

Quick Start

import torch
from torch_geometric.data import Data
from pyagc.data import get_dataset
from pyagc.encoders import GCN
from pyagc.models import DGI
from pyagc.clusters import KMeansClusterHead
from pyagc.metrics import label_metrics, structure_metrics

# 1. Load dataset
x, edge_index, y = get_dataset('Cora', root='data/')
data = Data(x=x, edge_index=edge_index, y=y)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 2. Build model (Encode + Optimize)
encoder = GCN(in_channels=data.num_features, hidden_channels=512, num_layers=1)
model = DGI(hidden_channels=512, encoder=encoder).to(device)

# 3. Train encoder
data = data.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(200):
    loss = model.train_full(data, optimizer, epoch, verbose=(epoch % 50 == 0))

# 4. Cluster (Cluster projection)
model.eval()
with torch.no_grad():
    z = model.infer_full(data)

n_clusters = int(y.max().item()) + 1
kmeans = KMeansClusterHead(n_clusters=n_clusters)
clusters = kmeans.fit_predict(z)

# 5. Evaluate — supervised + unsupervised
sup = label_metrics(y, clusters, metrics=['ACC', 'NMI', 'ARI', 'F1'])
unsup = structure_metrics(edge_index, clusters, metrics=['Modularity', 'Conductance'])
print(f"ACC: {sup['ACC']:.4f} | NMI: {sup['NMI']:.4f} | ARI: {sup['ARI']:.4f}")
print(f"Modularity: {unsup['Modularity']:.4f} | Conductance: {unsup['Conductance']:.4f}")

The ECO Framework

PyAGC organizes the landscape of AGC algorithms under a unified Encode-Cluster-Optimize (ECO) framework:

                    ┌────────────────────────────────────────────────────┐
                    │              Encode-Cluster-Optimize               │
                    │                                                    │
  (A, X) ──────►    │  ┌──────────┐    ┌───────────┐    ┌────────────┐   │ ──────► Clusters
                    │  │ Encoder  │───►│ Cluster   │◄──►│ Optimizer  │   │
                    │  │   (E)    │    │ Head (C)  │    │    (O)     │   │
                    │  └──────────┘    └───────────┘    └────────────┘   │
                    └────────────────────────────────────────────────────┘
Module Options Examples
Encoder Parametric GCN, GAT, GraphSAGE, GIN, SGFormer, Polynormer
Non-Parametric Fixed graph filters, adaptive smoothing, Markov diffusion
Cluster Differentiable Softmax pooling (DMoN, MinCut, Neuromap), Prototype-based (DEC, DinkNet)
Discrete (Post-hoc) KMeans, Spectral Clustering, Subspace Clustering
Optimizer Joint End-to-end: Self-supevised + Clustering- specific loss
Decoupled Pre-train encoder → Apply discrete clustering

This decomposition enables plug-and-play experimentation — swap a GCN encoder for a GAT within DAEGC by changing one line in the config file.


Benchmark

Datasets

Our benchmark curates 12 datasets spanning 5 orders of magnitude in scale, diverse domains, feature modalities, and homophily levels:

Scale Dataset Domain #Nodes #Edges Avg. Deg. #Feat. Feat. Type #Clusters $\mathcal{H}_e$ $\mathcal{H}_n$
Tiny Cora Citation 2,708 10,556 3.9 1,433 Textual 7 0.81 0.83
Tiny Photo Co-purchase 7,650 238,162 31.1 745 Textual 8 0.83 0.84
Small Physics Co-author 34,493 495,924 14.4 8,415 Textual 5 0.93 0.92
Small HM Co-purchase 46,563 21,461,990 460.9 120 Tabular 21 0.16 0.35
Small Flickr Social 89,250 899,756 10.1 500 Textual 7 0.32 0.32
Medium ArXiv Citation 169,343 1,166,243 6.9 128 Textual 40 0.65 0.64
Medium Reddit Social 232,965 23,213,838 99.6 602 Textual 41 0.78 0.81
Medium MAG Citation 736,389 10,792,672 14.7 128 Textual 349 0.30 0.31
Large Pokec Social 1,632,803 44,603,928 27.3 56 Tabular 183 0.43 0.39
Large Products Co-purchase 2,449,029 61,859,140 25.4 100 Textual 47 0.81 0.82
Large WebTopic Web 2,890,331 24,754,822 8.6 528 Tabular 28 0.22 0.24
Massive Papers100M Citation 111,059,956 1,615,685,872 14.5 128 Textual 172 0.57 0.50

Key diversity dimensions:

  • Scale: 5 orders of magnitude (2.7K → 111M nodes)
  • Attributes: textual (bag-of-words, embeddings) and tabular (categorical + numerical)
  • Structure: high-homophily (Physics, $\mathcal{H}_e$=0.93) to heterophilous (HM, $\mathcal{H}_e$=0.16)
  • Domain: citation, co-purchase, co-author, social networks, web graphs

Algorithms

Traditional Methods

Method Venue Encoder Clusterer Optimization
KMeans None (raw features) Discrete (KMeans) Decoupled
Node2Vec KDD'16 Random Walk Discrete (KMeans) Decoupled

Non-Parametric Methods

Method Venue Encoder Clusterer Optimization
SSGC ICLR'21 Adaptive Filter Discrete (KMeans) Decoupled
SAGSC AAAI'23 Fixed Filter Discrete (Subspace) Decoupled
MS2CAG KDD'25 Fixed Filter Discrete (SNEM) Decoupled

Deep Decoupled Methods

Method Venue Encoder Clusterer Core Objective
GAE NeurIPS-W'16 GCN KMeans Graph Reconstruction
DGI ICLR'19 GCN KMeans Mutual Info Maximization
CCASSG NeurIPS'21 GCN KMeans Redundancy Reduction
S3GC NeurIPS'22 GCN KMeans Contrastive (Random Walk)
NS4GC TKDE'24 GCN KMeans Contrastive (Node Similarity)
MAGI KDD'24 GNN KMeans Contrastive (Modularity)

Deep Joint Methods

Method Venue Encoder Clusterer Core Objective
DAEGC IJCAI'19 GAT Prototype (DEC) Reconstruction + KL Div.
MinCut ICML'20 GCN Softmax Cut Minimization
DMoN JMLR'23 GCN Softmax Modularity Maximization
DinkNet ICML'23 GCN Prototype Dilation + Shrink Loss
Neuromap NeurIPS'24 GCN Softmax Map Equation

Evaluation Protocol

We advocate for a holistic evaluation that goes beyond the standard supervised metric paradox:

Supervised Alignment Metrics

Measure agreement with ground-truth labels (when available):

  • ACC — Clustering Accuracy (with optimal Hungarian matching)
  • NMI — Normalized Mutual Information
  • ARI — Adjusted Rand Index
  • Macro-F1 — Macro-averaged F1 Score

Unsupervised Structural Metrics

Assess intrinsic cluster quality without labels — critical for real-world deployment:

  • Modularity — density of within-cluster edges vs. random expectation (↑ better)
  • Conductance — fraction of edge volume pointing outside clusters (↓ better)

Efficiency Profiling

  • Training time, inference latency, and peak GPU memory consumption
from pyagc.metrics import label_metrics, structure_metrics

# Supervised
sup = label_metrics(y_true, y_pred, metrics=['ACC', 'NMI', 'ARI', 'F1'])

# Unsupervised
unsup = structure_metrics(edge_index, y_pred, metrics=['Modularity', 'Conductance'])

Benchmark Results

Full results with all metrics are available in benchmark/results/ and our paper.

📋 Complete benchmark results including ACC, ARI, F1, Modularity, Conductance, training time, and GPU memory are available in the Structured Results and Unstructured Results.

Reproducibility

All experiments are fully reproducible via configuration files:

# Reproduce exact benchmark results
cd benchmark/DMoN
python main.py --config train.conf.yaml --dataset Cora --seed 0
python main.py --config train.conf.yaml --dataset Cora --seed 1
python main.py --config train.conf.yaml --dataset Cora --seed 2
python main.py --config train.conf.yaml --dataset Cora --seed 3
python main.py --config train.conf.yaml --dataset Cora --seed 4

Each run produces a timestamped log file in logs/<Dataset>/<method>/ containing:

  • All hyperparameters
  • Training loss curves
  • Final metric values (supervised + unsupervised)
  • Runtime and memory statistics

Usage

Running Benchmarks

Each algorithm has a self-contained directory with main.py and a YAML configuration:

# Run DMoN on Cora
cd benchmark/DMoN
python main.py --config train.conf.yaml --dataset Cora

# Run DAEGC on Reddit (mini-batch)
cd benchmark/DAEGC
python main.py --config train.conf.yaml --dataset Reddit

Results are automatically logged to benchmark/<Method>/logs/<Dataset>/.

Custom Experiments

PyAGC's modular design makes it easy to compose new methods:

from pyagc.encoders import GCN, GAT
from pyagc.models import DMoN

# Swap GCN → GAT in DMoN by changing one line
encoder = GAT(in_channels=1433, hidden_channels=256, num_layers=2)
model = DMoN(encoder=encoder, n_features=256, n_clusters=7)

Or simply modify the YAML config:

encoder:
  type: GAT            # Changed from GCN
  hidden_channels: 256
  num_layers: 2
cluster:
  type: DMoN
  n_clusters: 7

Scaling to Large Graphs

PyAGC enables training on massive graphs via mini-batch neighbor sampling:

from torch_geometric.loader import NeighborLoader

# Create mini-batch loader
loader = NeighborLoader(
    data,
    num_neighbors=[15, 10],
    batch_size=1024,
    shuffle=True,
)

# Mini-batch training loop
for batch in loader:
    batch = batch.to(device)
    loss = model.train_mini_batch(batch, optimizer)

Scalability highlight: Complex models (e.g., DAEGC) can be trained on Papers100M (111M nodes, 1.6B edges) on a single 32GB V100 GPU in under 2 hours.


Extending PyAGC

Adding a New Encoder

from pyagc.encoders import GCN

# Use any PyG-compatible encoder
encoder = GCN(
    in_channels=128,
    hidden_channels=256,
    num_layers=3,
    dropout=0.1
)

# Plug into any model
model = DMoN(encoder=encoder, n_clusters=7)

Adding a New Cluster Head

# pyagc/clusters/my_cluster_head.py
from pyagc.clusters import BaseClusterHead

class MyClusterHead(BaseClusterHead):
    def __init__(self, n_clusters, in_channels):
        super().__init__(n_clusters)
        # Define learnable parameters
        ...

    def forward(self, *args, **kwargs):
        # Return clustering loss
        ...
        return loss

    def cluster(self, z, soft=True):
      	# Return soft assignment matrix P of shape [N, K]
        ...
        return p

Adding a New Model

# pyagc/models/my_model.py
from pyagc.models import BaseModel

class MyModel(BaseModel):
    def __init__(self, encoder, cluster_head):
        super().__init__()
        self.encoder = encoder
        self.cluster_head = cluster_head

    def forward(self, data):
        z = self.encoder(data.x, data.edge_index)
        return z

    def loss(self, data):
        z = self.forward(data)
        rep_loss = ...       # Representation learning loss
        clust_loss = self.cluster_head(z, data.edge_index)
        return rep_loss + self.lambda_ * clust_loss

FAQ

Q: How do I run experiments on my own graph? 1. Format your graph as a PyTorch Geometric `Data` object with `x` (node features), `edge_index` (edge list), and optionally `y` (labels for evaluation). 3. Use any model from `pyagc.models` with your chosen encoder and cluster head.
Q: Can I use PyAGC without ground-truth labels? Absolutely — this is the core use case PyAGC is designed for. Use unsupervised structural metrics (Modularity, Conductance) via `pyagc.metrics.structure_metrics` to evaluate cluster quality without any labels.
Q: How does mini-batch training work for graph clustering? We use neighbor sampling (via PyTorch Geometric's `NeighborLoader`) to create computational subgraphs. The encoder processes these subgraphs, and losses are approximated over mini-batches. This decouples GPU memory from graph size, enabling training on graphs with 100M+ nodes on a single GPU.
Q: What GPU do I need? All benchmark experiments were conducted on a single NVIDIA Tesla V100 (32GB). For small/medium datasets, a GPU with 8–16GB is sufficient. For Papers100M, we recommend at least 32GB GPU memory.

Citation

If you find PyAGC useful in your research, please cite our paper:

@article{liu2026bridging,
  title={Bridging Academia and Industry: A Comprehensive Benchmark for Attributed Graph Clustering},
  author={Yunhui Liu and Pengyu Qiu and Yu Xing and Yongchao Liu and Peng Du and Chuntao Hong and Jiajun Zheng and Tao Zheng and Tieke He},
  year={2026},
  eprint={2602.08519},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

Contributing

We welcome contributions! Please see our contributing guidelines:

  1. Bug Reports: Open an issue with a minimal reproducible example.
  2. New Methods: Submit a PR adding your method under the ECO framework with a main.py, train.conf.yaml, and unit tests.
  3. New Datasets: Submit a PR with a data loader and dataset description.
  4. Documentation: Improvements to docs, tutorials, and examples are always appreciated.

License

PyAGC is released under the MIT License.


Acknowledgements

PyAGC is built upon the excellent open-source ecosystem:

We thank Ant Group for supporting the industrial validation of this benchmark.


GitHub · PyPI · Documentation · Paper
Made with ❤️ for the Graph ML Community

About

Attributed Graph Clustering Library for PyTorch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%