PyAGC is a production-ready, modular library and comprehensive benchmark for Attributed Graph Clustering (AGC), built on PyTorch and PyTorch Geometric. It unifies 20+ state-of-the-art algorithms under a principled Encode-Cluster-Optimize (ECO) framework, provides mini-batch implementations that scale to 111 million nodes on a single 32GB GPU, and introduces a holistic evaluation protocol spanning supervised, unsupervised, and efficiency metrics across 12 diverse datasets.
Battle-tested in high-stakes industrial workflows at Ant Group (Fraud Detection, Anti-Money Laundering, User Profiling), PyAGC offers the community a robust, reproducible, and scalable platform to advance AGC research towards realistic deployment.
- Why PyAGC?
- Key Features
- Project Structure
- Installation
- Quick Start
- The ECO Framework
- Benchmark
- Usage
- Extending PyAGC
- FAQ
- Citation
- Contributing
- License
- Acknowledgements
Current AGC evaluation suffers from four critical limitations that PyAGC is designed to address:
| Problem | Status Quo | PyAGC Solution |
|---|---|---|
| The Cora-fication of Datasets | Over-reliance on small, homophilous citation networks | 12 datasets spanning 5 orders of magnitude, including industrial graphs with tabular features and low homophily |
| The Scalability Bottleneck | Full-batch training limits methods to ~10⁵ nodes | Mini-batch implementations enabling training on 111M+ nodes with a single 32GB GPU |
| The Supervised Metric Paradox | Unsupervised methods evaluated only with supervised metrics | Holistic evaluation with unsupervised structural metrics (Modularity, Conductance) + efficiency profiling |
| The Reproducibility Gap | Scattered codebases with hard-coded parameters | Unified, configuration-driven framework with strict YAML-based experiment management |
-
📊 Diverse Dataset Collection — 12 graphs from 2.7K to 111M nodes across Citation, Social, E-commerce, and Web domains, featuring both textual and tabular attributes with varying homophily levels.
-
🧩 Unified Algorithm Framework — 20+ SOTA methods organized under the Encode-Cluster-Optimize taxonomy with modular, interchangeable encoders, cluster heads, and optimization strategies.
-
📏 Holistic Evaluation Protocol — Supervised metrics (ACC, NMI, ARI, F1), unsupervised structural metrics (Modularity, Conductance), and comprehensive efficiency profiling (time, memory).
-
🚀 Production-Grade Scalability — GPU-accelerated KMeans (via PyTorch + Triton) and neighbor-sampling-based mini-batch training that scales deep clustering to 111M nodes on a single 32GB V100 GPU.
-
🛠️ Developer-Friendly Design — Plug-and-play components, YAML-driven configuration, and clean abstractions that make prototyping new methods as easy as swapping a single config line.
PyAGC/
├── pyagc/ # Core library
│ ├── encoders/ # GNN backbones (GCN, GAT, SAGE, GIN, Transformers)
│ ├── clusters/ # Cluster heads (KMeans, DEC, DMoN, MinCut, Neuromap, ...)
│ ├── models/ # Full model implementations (20+ methods)
│ ├── data/ # Unified dataset loaders
│ ├── metrics/ # Supervised + unsupervised metrics
│ ├── transforms/ # Graph augmentations (edge drop, feature mask)
│ └── utils/ # Checkpointing, logging, misc utilities
├── benchmark/ # Reproducible experiments
│ ├── <Method>/ # Per-method directory
│ │ ├── main.py # Entry point
│ │ ├── train.conf.yaml # Hyperparameter configuration
│ │ └── logs/ # Experiment logs per dataset
│ ├── data/ # Cached datasets
│ └── results/ # Aggregated benchmark results
├── tests/ # Unit tests
└── docs/ # Documentation (Sphinx → ReadTheDocs)
pip install pyagcgit clone https://github.com/Cloudy1225/PyAGC.git
cd PyAGC
pip install -e .- Python >= 3.10
- PyTorch >= 2.6.0
- PyTorch Geometric >= 2.7.0
import torch
from torch_geometric.data import Data
from pyagc.data import get_dataset
from pyagc.encoders import GCN
from pyagc.models import DGI
from pyagc.clusters import KMeansClusterHead
from pyagc.metrics import label_metrics, structure_metrics
# 1. Load dataset
x, edge_index, y = get_dataset('Cora', root='data/')
data = Data(x=x, edge_index=edge_index, y=y)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 2. Build model (Encode + Optimize)
encoder = GCN(in_channels=data.num_features, hidden_channels=512, num_layers=1)
model = DGI(hidden_channels=512, encoder=encoder).to(device)
# 3. Train encoder
data = data.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(200):
loss = model.train_full(data, optimizer, epoch, verbose=(epoch % 50 == 0))
# 4. Cluster (Cluster projection)
model.eval()
with torch.no_grad():
z = model.infer_full(data)
n_clusters = int(y.max().item()) + 1
kmeans = KMeansClusterHead(n_clusters=n_clusters)
clusters = kmeans.fit_predict(z)
# 5. Evaluate — supervised + unsupervised
sup = label_metrics(y, clusters, metrics=['ACC', 'NMI', 'ARI', 'F1'])
unsup = structure_metrics(edge_index, clusters, metrics=['Modularity', 'Conductance'])
print(f"ACC: {sup['ACC']:.4f} | NMI: {sup['NMI']:.4f} | ARI: {sup['ARI']:.4f}")
print(f"Modularity: {unsup['Modularity']:.4f} | Conductance: {unsup['Conductance']:.4f}")PyAGC organizes the landscape of AGC algorithms under a unified Encode-Cluster-Optimize (ECO) framework:
┌────────────────────────────────────────────────────┐
│ Encode-Cluster-Optimize │
│ │
(A, X) ──────► │ ┌──────────┐ ┌───────────┐ ┌────────────┐ │ ──────► Clusters
│ │ Encoder │───►│ Cluster │◄──►│ Optimizer │ │
│ │ (E) │ │ Head (C) │ │ (O) │ │
│ └──────────┘ └───────────┘ └────────────┘ │
└────────────────────────────────────────────────────┘
| Module | Options | Examples |
|---|---|---|
| Encoder | Parametric | GCN, GAT, GraphSAGE, GIN, SGFormer, Polynormer |
| Non-Parametric | Fixed graph filters, adaptive smoothing, Markov diffusion | |
| Cluster | Differentiable | Softmax pooling (DMoN, MinCut, Neuromap), Prototype-based (DEC, DinkNet) |
| Discrete (Post-hoc) | KMeans, Spectral Clustering, Subspace Clustering | |
| Optimizer | Joint | End-to-end: Self-supevised + Clustering- specific loss |
| Decoupled | Pre-train encoder → Apply discrete clustering |
This decomposition enables plug-and-play experimentation — swap a GCN encoder for a GAT within DAEGC by changing one line in the config file.
Our benchmark curates 12 datasets spanning 5 orders of magnitude in scale, diverse domains, feature modalities, and homophily levels:
| Scale | Dataset | Domain | #Nodes | #Edges | Avg. Deg. | #Feat. | Feat. Type | #Clusters | ||
|---|---|---|---|---|---|---|---|---|---|---|
| Tiny | Cora | Citation | 2,708 | 10,556 | 3.9 | 1,433 | Textual | 7 | 0.81 | 0.83 |
| Tiny | Photo | Co-purchase | 7,650 | 238,162 | 31.1 | 745 | Textual | 8 | 0.83 | 0.84 |
| Small | Physics | Co-author | 34,493 | 495,924 | 14.4 | 8,415 | Textual | 5 | 0.93 | 0.92 |
| Small | HM | Co-purchase | 46,563 | 21,461,990 | 460.9 | 120 | Tabular | 21 | 0.16 | 0.35 |
| Small | Flickr | Social | 89,250 | 899,756 | 10.1 | 500 | Textual | 7 | 0.32 | 0.32 |
| Medium | ArXiv | Citation | 169,343 | 1,166,243 | 6.9 | 128 | Textual | 40 | 0.65 | 0.64 |
| Medium | Social | 232,965 | 23,213,838 | 99.6 | 602 | Textual | 41 | 0.78 | 0.81 | |
| Medium | MAG | Citation | 736,389 | 10,792,672 | 14.7 | 128 | Textual | 349 | 0.30 | 0.31 |
| Large | Pokec | Social | 1,632,803 | 44,603,928 | 27.3 | 56 | Tabular | 183 | 0.43 | 0.39 |
| Large | Products | Co-purchase | 2,449,029 | 61,859,140 | 25.4 | 100 | Textual | 47 | 0.81 | 0.82 |
| Large | WebTopic | Web | 2,890,331 | 24,754,822 | 8.6 | 528 | Tabular | 28 | 0.22 | 0.24 |
| Massive | Papers100M | Citation | 111,059,956 | 1,615,685,872 | 14.5 | 128 | Textual | 172 | 0.57 | 0.50 |
Key diversity dimensions:
- Scale: 5 orders of magnitude (2.7K → 111M nodes)
- Attributes: textual (bag-of-words, embeddings) and tabular (categorical + numerical)
-
Structure: high-homophily (Physics,
$\mathcal{H}_e$ =0.93) to heterophilous (HM,$\mathcal{H}_e$ =0.16) - Domain: citation, co-purchase, co-author, social networks, web graphs
| Method | Venue | Encoder | Clusterer | Optimization |
|---|---|---|---|---|
| KMeans | — | None (raw features) | Discrete (KMeans) | Decoupled |
| Node2Vec | KDD'16 | Random Walk | Discrete (KMeans) | Decoupled |
| Method | Venue | Encoder | Clusterer | Optimization |
|---|---|---|---|---|
| SSGC | ICLR'21 | Adaptive Filter | Discrete (KMeans) | Decoupled |
| SAGSC | AAAI'23 | Fixed Filter | Discrete (Subspace) | Decoupled |
| MS2CAG | KDD'25 | Fixed Filter | Discrete (SNEM) | Decoupled |
| Method | Venue | Encoder | Clusterer | Core Objective |
|---|---|---|---|---|
| GAE | NeurIPS-W'16 | GCN | KMeans | Graph Reconstruction |
| DGI | ICLR'19 | GCN | KMeans | Mutual Info Maximization |
| CCASSG | NeurIPS'21 | GCN | KMeans | Redundancy Reduction |
| S3GC | NeurIPS'22 | GCN | KMeans | Contrastive (Random Walk) |
| NS4GC | TKDE'24 | GCN | KMeans | Contrastive (Node Similarity) |
| MAGI | KDD'24 | GNN | KMeans | Contrastive (Modularity) |
| Method | Venue | Encoder | Clusterer | Core Objective |
|---|---|---|---|---|
| DAEGC | IJCAI'19 | GAT | Prototype (DEC) | Reconstruction + KL Div. |
| MinCut | ICML'20 | GCN | Softmax | Cut Minimization |
| DMoN | JMLR'23 | GCN | Softmax | Modularity Maximization |
| DinkNet | ICML'23 | GCN | Prototype | Dilation + Shrink Loss |
| Neuromap | NeurIPS'24 | GCN | Softmax | Map Equation |
We advocate for a holistic evaluation that goes beyond the standard supervised metric paradox:
Measure agreement with ground-truth labels (when available):
- ACC — Clustering Accuracy (with optimal Hungarian matching)
- NMI — Normalized Mutual Information
- ARI — Adjusted Rand Index
- Macro-F1 — Macro-averaged F1 Score
Assess intrinsic cluster quality without labels — critical for real-world deployment:
- Modularity — density of within-cluster edges vs. random expectation (↑ better)
- Conductance — fraction of edge volume pointing outside clusters (↓ better)
- Training time, inference latency, and peak GPU memory consumption
from pyagc.metrics import label_metrics, structure_metrics
# Supervised
sup = label_metrics(y_true, y_pred, metrics=['ACC', 'NMI', 'ARI', 'F1'])
# Unsupervised
unsup = structure_metrics(edge_index, y_pred, metrics=['Modularity', 'Conductance'])Full results with all metrics are available in benchmark/results/ and our paper.
📋 Complete benchmark results including ACC, ARI, F1, Modularity, Conductance, training time, and GPU memory are available in the Structured Results and Unstructured Results.
All experiments are fully reproducible via configuration files:
# Reproduce exact benchmark results
cd benchmark/DMoN
python main.py --config train.conf.yaml --dataset Cora --seed 0
python main.py --config train.conf.yaml --dataset Cora --seed 1
python main.py --config train.conf.yaml --dataset Cora --seed 2
python main.py --config train.conf.yaml --dataset Cora --seed 3
python main.py --config train.conf.yaml --dataset Cora --seed 4Each run produces a timestamped log file in logs/<Dataset>/<method>/ containing:
- All hyperparameters
- Training loss curves
- Final metric values (supervised + unsupervised)
- Runtime and memory statistics
Each algorithm has a self-contained directory with main.py and a YAML configuration:
# Run DMoN on Cora
cd benchmark/DMoN
python main.py --config train.conf.yaml --dataset Cora
# Run DAEGC on Reddit (mini-batch)
cd benchmark/DAEGC
python main.py --config train.conf.yaml --dataset RedditResults are automatically logged to benchmark/<Method>/logs/<Dataset>/.
PyAGC's modular design makes it easy to compose new methods:
from pyagc.encoders import GCN, GAT
from pyagc.models import DMoN
# Swap GCN → GAT in DMoN by changing one line
encoder = GAT(in_channels=1433, hidden_channels=256, num_layers=2)
model = DMoN(encoder=encoder, n_features=256, n_clusters=7)Or simply modify the YAML config:
encoder:
type: GAT # Changed from GCN
hidden_channels: 256
num_layers: 2
cluster:
type: DMoN
n_clusters: 7PyAGC enables training on massive graphs via mini-batch neighbor sampling:
from torch_geometric.loader import NeighborLoader
# Create mini-batch loader
loader = NeighborLoader(
data,
num_neighbors=[15, 10],
batch_size=1024,
shuffle=True,
)
# Mini-batch training loop
for batch in loader:
batch = batch.to(device)
loss = model.train_mini_batch(batch, optimizer)Scalability highlight: Complex models (e.g., DAEGC) can be trained on Papers100M (111M nodes, 1.6B edges) on a single 32GB V100 GPU in under 2 hours.
from pyagc.encoders import GCN
# Use any PyG-compatible encoder
encoder = GCN(
in_channels=128,
hidden_channels=256,
num_layers=3,
dropout=0.1
)
# Plug into any model
model = DMoN(encoder=encoder, n_clusters=7)# pyagc/clusters/my_cluster_head.py
from pyagc.clusters import BaseClusterHead
class MyClusterHead(BaseClusterHead):
def __init__(self, n_clusters, in_channels):
super().__init__(n_clusters)
# Define learnable parameters
...
def forward(self, *args, **kwargs):
# Return clustering loss
...
return loss
def cluster(self, z, soft=True):
# Return soft assignment matrix P of shape [N, K]
...
return p# pyagc/models/my_model.py
from pyagc.models import BaseModel
class MyModel(BaseModel):
def __init__(self, encoder, cluster_head):
super().__init__()
self.encoder = encoder
self.cluster_head = cluster_head
def forward(self, data):
z = self.encoder(data.x, data.edge_index)
return z
def loss(self, data):
z = self.forward(data)
rep_loss = ... # Representation learning loss
clust_loss = self.cluster_head(z, data.edge_index)
return rep_loss + self.lambda_ * clust_lossQ: How do I run experiments on my own graph?
1. Format your graph as a PyTorch Geometric `Data` object with `x` (node features), `edge_index` (edge list), and optionally `y` (labels for evaluation). 3. Use any model from `pyagc.models` with your chosen encoder and cluster head.Q: Can I use PyAGC without ground-truth labels?
Absolutely — this is the core use case PyAGC is designed for. Use unsupervised structural metrics (Modularity, Conductance) via `pyagc.metrics.structure_metrics` to evaluate cluster quality without any labels.Q: How does mini-batch training work for graph clustering?
We use neighbor sampling (via PyTorch Geometric's `NeighborLoader`) to create computational subgraphs. The encoder processes these subgraphs, and losses are approximated over mini-batches. This decouples GPU memory from graph size, enabling training on graphs with 100M+ nodes on a single GPU.Q: What GPU do I need?
All benchmark experiments were conducted on a single NVIDIA Tesla V100 (32GB). For small/medium datasets, a GPU with 8–16GB is sufficient. For Papers100M, we recommend at least 32GB GPU memory.If you find PyAGC useful in your research, please cite our paper:
@article{liu2026bridging,
title={Bridging Academia and Industry: A Comprehensive Benchmark for Attributed Graph Clustering},
author={Yunhui Liu and Pengyu Qiu and Yu Xing and Yongchao Liu and Peng Du and Chuntao Hong and Jiajun Zheng and Tao Zheng and Tieke He},
year={2026},
eprint={2602.08519},
archivePrefix={arXiv},
primaryClass={cs.LG}
}We welcome contributions! Please see our contributing guidelines:
- Bug Reports: Open an issue with a minimal reproducible example.
- New Methods: Submit a PR adding your method under the ECO framework with a
main.py,train.conf.yaml, and unit tests. - New Datasets: Submit a PR with a data loader and dataset description.
- Documentation: Improvements to docs, tutorials, and examples are always appreciated.
PyAGC is released under the MIT License.
PyAGC is built upon the excellent open-source ecosystem:
We thank Ant Group for supporting the industrial validation of this benchmark.
GitHub · PyPI · Documentation · Paper
Made with ❤️ for the Graph ML Community
