U-RxnDiagram-15k (formerly RxnCaption-15k) is a dataset of ~15,000 chemical reaction diagram images extracted from scientific PDFs, each annotated with:
- Bounding boxes for all chemical entities (structures, text, identifiers, supplements)
- Complete reaction graphs linking reactants → conditions → products
HuggingFace: songjhPKU/U-RxnDiagram-15k
| Type | Description |
|---|---|
single |
Single linear reaction (A → B) |
multiple |
Multiple parallel or sequential reactions |
tree |
Tree-structured multi-step synthesis |
graph |
Complex reaction network |
Both the ground-truth and prediction files use this schema:
{
"licenses": [],
"info": {},
"categories": [
{"id": 1, "name": "structure"},
{"id": 2, "name": "text"},
{"id": 3, "name": "identifier"},
{"id": 4, "name": "supplement"}
],
"images": [
{
"id": 0,
"file_name": "10.1002_foo_figure_0.png",
"diagram_type": "single",
"bboxes": [
{"id": 0, "category_id": 1, "bbox": [x, y, w, h]},
{"id": 1, "category_id": 2, "bbox": [x, y, w, h], "text": "NaOH"},
{"id": 2, "category_id": 1, "bbox": [x, y, w, h]}
],
"reactions": [
{"reactants": [0], "conditions": [1], "products": [2]}
]
}
]
}| category_id | Meaning |
|---|---|
| 1 | Chemical structure (molecule, complex, etc.) |
| 2 | Text label (reagent, solvent, temperature, yield, ...) |
| 3 | Compound identifier (number, letter, code) |
| 4 | Supplementary annotation (ignored in evaluation) |
[x, y, width, height] in pixels (COCO-style, 0-indexed top-left origin).
from datasets import load_dataset
ds = load_dataset("songjhPKU/U-RxnDiagram-15k")Or with HuggingFace CLI:
huggingface-cli download songjhPKU/U-RxnDiagram-15k --repo-type dataset --local-dir data/U-RxnDiagram-15kThe paper evaluates on two held-out test sets:
| Benchmark | Source | # Images |
|---|---|---|
| PDF-400 | Internal PDF corpus | 400 |
| RxnScribe | RxnScribe dataset | varies |
Raw PDF → cropped reaction images → YOLO detection → BIVP annotation → Qwen training JSONL
See scripts/prepare_data.sh for the full automated pipeline.