Add RelSC benchmark datasets and tests#10630
Open
MarcusVukojevic wants to merge 4 commits intopyg-team:masterfrom
Open
Add RelSC benchmark datasets and tests#10630MarcusVukojevic wants to merge 4 commits intopyg-team:masterfrom
MarcusVukojevic wants to merge 4 commits intopyg-team:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces the RelSC-H (homogeneous) and RelSC-M (multi-relational) datasets, a new benchmark for graph-level regression. Unlike most existing benchmarks focused on molecules or citations, RelSC provides large, directed program graphs extracted from Java source code to predict execution-time cost.
Key Features:
RelSC-H: A homogeneous variant providing rich node features on flow-augmented Abstract Syntax Trees (ASTs).
RelSC-M: A multi-relational variant that preserves semantic relationships by categorizing nodes into 7 semantic groups with up to 49 unique relation types.
Domain: Software Engineering / Performance Prediction.
Academic Context:
The associated paper, "A Benchmark Dataset for Graph Regression with Homogeneous and Multi-Relational Variants", is currently in the final stage of review at the Journal of Data-centric Machine Learning Research (DMLR).
Preprint: arXiv:2505.23875
Resources & Reproducibility:
Official Project Page: https://github.com/MarcusVukojevic/graph_regression_datasets
The project page includes comprehensive tutorials, scripts to reproduce paper results, and tools to build custom versions of the dataset from source code.
Implementation Details:
Both variants are implemented in a single relsc.py file to share data loading and download logic.
Unit tests are included in test/datasets/test_relsc.py, using generated dummy data to ensure CI passes without requiring large Zenodo downloads.
CHANGELOG.md and datasets/init.py have been updated.
Checklist:
[x] I have updated the CHANGELOG.md
[x] I have added unit tests
[x] I have updated torch_geometric/datasets/init.py
[x] Documentation follows the PyG style guide and includes dataset statistics.