Skip to content

Add from_relbench utility to convert RelBench databases to HeteroData#10628

Open
AJamal27891 wants to merge 3 commits intopyg-team:masterfrom
AJamal27891:pr-10353-part1-relbench-base
Open

Add from_relbench utility to convert RelBench databases to HeteroData#10628
AJamal27891 wants to merge 3 commits intopyg-team:masterfrom
AJamal27891:pr-10353-part1-relbench-base

Conversation

@AJamal27891
Copy link
Contributor

Description

This PR is Part 1 of 4 in splitting the monolithic Warehouse Intelligence system (#10353) into modular, composable pieces, as requested by the core maintainers.

This PR introduces the from_relbench utility to torch_geometric.utils.relbench. It allows users to convert complex, multi-table databases from the RelBench (Relational Deep Learning Benchmark) environment directly into PyG's native HeteroData format.

Addressing Assessor Feedback:
This specifically addresses @wsad1's feedback from #10353: "why do we need a RelBenchDataset?"
Based on that guidance, I have completely dropped the custom RelBenchDataset class wrapper. Instead, to align with PyG's stateless data processing philosophy and avoid reinventing the wheel, this PR only introduces a pure utility function. It decouples PyG's dataset classes from RelBench's internal state.

Proposed Changes

  • Added torch_geometric/utils/relbench.py housing the from_relbench conversion utility.
  • Added exhaustive unit tests in test/utils/test_relbench.py utilizing dummy fallback flags to ensure rapid CI execution without massive database downloads.
  • Updated CHANGELOG.md.

(Note: Parts 2, 3, and 4—covering the Warehouse Transforms, SAGEConv multi-task models, and LLM G-Retriever integrations—will follow in subsequent linked PRs once this foundational data layer is approved.)


Ref: #10353
Partially Closes #9839

@codecov
Copy link

codecov bot commented Mar 4, 2026

Codecov Report

❌ Patch coverage is 97.77778% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 84.21%. Comparing base (c211214) to head (9aa7a26).
⚠️ Report is 185 commits behind head on master.

Files with missing lines Patch % Lines
torch_geometric/utils/relbench.py 97.72% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10628      +/-   ##
==========================================
- Coverage   86.11%   84.21%   -1.91%     
==========================================
  Files         496      511      +15     
  Lines       33655    36058    +2403     
==========================================
+ Hits        28981    30365    +1384     
- Misses       4674     5693    +1019     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@AJamal27891 AJamal27891 force-pushed the pr-10353-part1-relbench-base branch 2 times, most recently from e1cb0cb to f8c76ea Compare March 4, 2026 16:19
Copy link
Contributor

@puririshi98 puririshi98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example of training a GNN+LLM system on this based on the code from examples/llm/txt2kg_rag.py

@AJamal27891
Copy link
Contributor Author

can you add an example of training a GNN+LLM system on this based on the code from examples/llm/txt2kg_rag.py

This PR includes examples/relbench_example.py — a lightweight hetero GNN example that demonstrates from_relbench end-to-end:
from_relbenchHeteroDataSAGEConv + to_hetero() → node-level regression (championship points, MAE 9.2 → 2.1 over 30 epochs, <30s on CPU).
Adding GRetriever end-to-end here would require bridging heterogeneous → homogeneous graphs (from_relbench produces HeteroData with multiple node/edge types, but GRetriever expects homogeneous input (x, edge_index, batch)). That bridging involves projecting all node types to a common dimension, concatenating, and remapping edges — model-level code that belongs in a dedicated PR, not the data utility PR.
The full GNN+LLM integration with GRetriever (using to_homogeneous) will land in a follow-up PR as part of the split from #10353:

PR Scope Status
#10628 (this) from_relbench utility + tests + hetero GNN example ✅ Ready
PR 2 to_hetero_edges bridging utility Planned
PR 3 GNN+LLM models + GRetriever example (w/ to_homogeneous) Planned
PR 4 End-to-end RAG pipeline Planned

The end-to-end code already exists in #10353 this split keeps each PR focused and independently reviewable suggested by @wsad1 .

@AJamal27891
Copy link
Contributor Author

AJamal27891 commented Mar 9, 2026

CI update: The pytest failures were caused by tabulate 0.10.0 (unrelated to this PR). Resolved in #10634 by @rusty1s. Branch rebased on latest master.

@AJamal27891 AJamal27891 requested a review from puririshi98 March 9, 2026 06:26
@AJamal27891 AJamal27891 force-pushed the pr-10353-part1-relbench-base branch from 4148f88 to fe3e67c Compare March 10, 2026 08:54
Copy link
Contributor

@puririshi98 puririshi98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm can you just share a log of running the example to convergence

@AJamal27891
Copy link
Contributor Author

lgtm can you just share a log of running the example to convergence

Loading RelBench rel-f1 dataset...
Done in 0.26 seconds.
Graph: 9 node types, 26 edge types

Training 30 epochs on "standings" point prediction...
Target stats (train): mean=6.25, std=13.49

Epoch: 001, Loss: 0.8340, Train MAE: 7.29, Val MAE: 7.36, Test MAE: 7.31 points
Epoch: 005, Loss: 0.2883, Train MAE: 5.02, Val MAE: 4.90, Test MAE: 4.89 points
Epoch: 010, Loss: 0.1753, Train MAE: 2.94, Val MAE: 2.91, Test MAE: 2.90 points
Epoch: 015, Loss: 0.1473, Train MAE: 2.78, Val MAE: 2.76, Test MAE: 2.72 points
Epoch: 020, Loss: 0.1134, Train MAE: 2.56, Val MAE: 2.50, Test MAE: 2.49 points
Epoch: 025, Loss: 0.1037, Train MAE: 2.44, Val MAE: 2.37, Test MAE: 2.37 points
Epoch: 030, Loss: 0.0890, Train MAE: 2.22, Val MAE: 2.19, Test MAE: 2.18 points

Final — Train MAE: 2.22, Val MAE: 2.19, Test MAE: 2.18 points

Copy link
Contributor

@puririshi98 puririshi98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, @akihironitta @wsad1 to merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Integrating GNNs and LLMs for Enhanced Data Warehouse Understanding and Lineage Analysis

2 participants