You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This section summarizes the main datasets and evaluation resources that are suitable for experiments in scientific citation recommendation, paper retrieval, and document representation learning with SciBERT.
The resources below serve two different purposes:
Training corpora, used to construct citation pairs, candidate sets, and retrieval experiments
Evaluation benchmarks, used to measure the quality of learned scientific document embeddings
More relevant for peer-review and acceptance-related experiments than for citation ranking alone
Recommended experimental usage
Training resources
Goal
Recommended resource
Alternative
Large-scale citation recommendation training
S2ORC
OpenAlex
Fully open and reproducible dataset pipeline
OpenAlex
OpenCitations combined with an external metadata source
Small pilot experiment in a focused domain
ACL Anthology
OpenAlex subset
Evaluation resources
Goal
Recommended resource
Alternative
Main evaluation of scientific document embeddings
SciRepEval
SciDocs
Comparison with older representation learning work
SciDocs
—
Review-aware or decision-related experiments
PeerRead
—
Resource selection guide
Use case
Recommended resource
Best large-scale training corpus
S2ORC
Best open scholarly metadata source
OpenAlex
Best open citation graph
OpenCitations
Best small domain-specific corpus for NLP
ACL Anthology
Best benchmark for embedding evaluation
SciRepEval
Best legacy benchmark for comparison
SciDocs
Best optional dataset for review-related extensions
PeerRead
Suggested setup for this project
For a SciBERT-based citation recommendation pipeline, a practical setup is:
S2ORC or OpenAlex to build positive citation pairs and candidate documents
SciRepEval as the primary benchmark for evaluating learned representations
SciDocs only as a secondary benchmark for comparison with prior work
ACL Anthology for smaller-scale prototyping or early-stage experiments
PeerRead only if the project is extended toward peer-review or acceptance prediction tasks
About
MetaGraphSci is a semi-supervised model for scientific document classification combining metadata and citation graphs with contrastive learning and pseudo-labeling for low supervised settings.