Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning
CVPR 2026 β Given a chemical reaction diagram from a scientific paper, RxnCaption identifies all molecular structures, text labels, and identifiers, then organises them into structured reaction graphs (reactants β conditions β products).
- π [02/21/2026] Our paper is accepted by CVPR 2026!
- π State-of-the-art on both RxnScribe-test and our U-RxnDiagram-15k benchmark
- π¬ Novel BIVP strategy β Bounding-box Index Visual Prompt turns detection into a structured captioning task
- π§ͺ 15k annotated diagrams β the largest reaction diagram dataset with 4 topology types (single / multiple / tree / graph)
- β‘ Plug-and-play β one script runs the full pipeline: detection β annotation β VL inference
- π Comprehensive evaluation β Hard / Soft / Hybrid metrics with visualization reports
| Method | Strategy | Hard F1 | Soft F1 |
|---|---|---|---|
| RxnScribe | BROS | 74.0 | 83.8 |
| RxnIm | BROS | 73.2 | 76.9 |
| Gemini-2.5-Pro | BIVP | 49.8 | 76.1 |
| RxnCaption-VL (Ours) | BIVP | 75.5 | 88.2 |
| Method | Strategy | Hard F1 | Soft F1 |
|---|---|---|---|
| RxnScribe | BROS | 34.9 | 45.9 |
| RxnIm | BROS | 37.4 | 40.5 |
| Gemini-2.5-Pro | BIVP | 40.4 | 66.6 |
| RxnCaption-VL (Ours) | BIVP | 55.5 | 67.6 |
RxnCaption/
βββ README.md / README_zh.md
βββ LICENSE # CC-BY-NC-4.0
βββ requirements.txt
β
βββ molyolo/ # Module 1 β Molecular structure detector (YOLOv10)
β βββ predict.py
β βββ weights/MolYOLO.pt # (download separately)
β
βββ rxncaption/ # Module 2/3 β Core pipeline
β βββ annotate.py # BIVP: bboxes + reading-order indices
β βββ inference.py # VL model inference (prompt templates)
β βββ evaluate.py # Hard / Soft / Hybrid evaluation
β
βββ tools/ # Data processing utilities
β βββ generate_mapdict.py
β βββ transform_yolo_detections.py
β βββ convert_to_qwen_format.py
β βββ transform_jsonl_to_json.py
β βββ transform_prediction_to_gtformat.py
β
βββ scripts/ # Shell pipelines
β βββ run_inference.sh # End-to-end inference
β βββ run_eval.sh # Evaluation
β βββ prepare_data.sh # Training data preparation
β
βββ demo/ # Quick demo with sample images
β βββ run_demo.sh
β βββ run_demo_slurm.sh
β
βββ docs/
βββ DATA.md # Dataset documentation
βββ TRAINING.md # Training guide
git clone https://github.com/songjhPKU/RxnCaption
cd RxnCaption
pip install -r requirements.txt
# Install the bundled ultralytics (YOLOv10) fork
pip install -e molyolo/# MolYOLO detector checkpoint
mkdir -p molyolo/weights
wget -O molyolo/weights/MolYOLO.pt \
https://github.com/songjhPKU/MolYOLO/raw/main/weights/MolYOLO.pt
# RxnCaption-VL model β two options:
# Option A: Auto-download from HuggingFace (default)
# The scripts use "songjhPKU/RxnCaption-VL" by default.
# swift will download it automatically on first run.
# Option B: Use a local copy (recommended for most users)
huggingface-cli download songjhPKU/RxnCaption-VL --local-dir /path/to/RxnCaption-VL
# Then pass the local path via --model:
# bash scripts/run_inference.sh --model /path/to/RxnCaption-VL ...bash scripts/run_inference.sh \
--image_dir /path/to/reaction_images \
--output_dir ./outputs \
--gpu_num 1This runs the full pipeline:
- MolYOLO detects molecular structures β per-image JSON bboxes
- BIVP annotates images with blue boxes + numeric labels
- RxnCaption-VL reads the annotated images and predicts reaction graphs
- Post-processing converts the output to evaluation format
Want to try it out quickly? Use the bundled demo script:
# 1. Put a few reaction images into demo/sample_images/
# 2. Run:
bash demo/run_demo.sh
# With evaluation (if you have ground truth):
GT_FILE=demo/sample_gt.json bash demo/run_demo.sh
# With a local model checkpoint:
MODEL=/path/to/RxnCaption-VL bash demo/run_demo.shSee demo/README.md for full details.
A fine-tuned YOLOv10 model detects all relevant entities (molecules, text, identifiers) in each reaction diagram image.
python molyolo/predict.py \
--img_dir /path/to/images \
--weights molyolo/weights/MolYOLO.pt \
--output_dir outputs/molyolo \
--output_name run01 \
--conf 0.5 \
--gpu_num 4 \
--visual_promptThe Bounding-box Index Visual Prompt (BIVP) module draws blue bounding boxes and reading-order numeric labels onto each image, turning raw detections into a visual prompt for the VL model.
python rxncaption/annotate.py \
--image_root_dir /path/to/images \
--det_json_root_dir outputs/molyolo/run01/json \
--middle_root_dir outputs/annotated \
--confidence_threshold 0.5The fine-tuned Qwen2.5-VL-7B model reads each annotated image and outputs a structured JSON reaction list.
swift infer \
--model songjhPKU/RxnCaption-VL \
--model_type qwen2_5_vl \
--infer_backend pt \
--val_dataset outputs/eval_input.jsonl \
--result_path outputs/infer_output.jsonl \
--max_batch_size 1 \
--max_new_tokens 16384Example output:
[
{
"reactants": [{"structure": 1}, {"text": "HβO"}],
"conditions": [{"text": "Ξ, 2h"}],
"products": [{"structure": 2}]
}
]Three evaluation modes reflect different levels of matching strictness:
| Mode | What is matched |
|---|---|
| Hard | All role members (molecules + text) must match with IoU β₯ 0.5 |
| Soft | Only molecule members are compared |
| Hybrid | Molecules matched by IoU; text compared as unordered bag |
bash scripts/run_eval.sh \
--gt_file data/ground_truth.json \
--raw_pred_file outputs/raw_prediction.json \
--mapdict data/mapdict_from_yolo_to_gt.json \
--image_dir data/images \
--output_dir results/ \
--mode allU-RxnDiagram-15k contains ~15,000 reaction diagram images from scientific PDFs with full annotation across 4 topology types.
from datasets import load_dataset
ds = load_dataset("songjhPKU/U-RxnDiagram-15k")See docs/DATA.md for the complete schema and download instructions.
See docs/TRAINING.md for the full training guide.
Short version:
# 1. Prepare data
bash scripts/prepare_data.sh \
--raw_gt_json data/ground_truth_ocr.json \
--yolo_det_dir data/det_json/ \
--image_dir data/annotated_images/ \
--output_dir data/processed/
# 2. Train (8 GPUs, full fine-tuning)
swift sft \
--model Qwen/Qwen2.5-VL-7B-Instruct \
--model_type qwen2_5_vl \
--dataset data/processed/train.jsonl \
--val_dataset data/processed/val.jsonl \
--output_dir outputs/train/ \
# ... see docs/TRAINING.md for full argsIf you find this work helpful, please cite:
@misc{song2026rxncaptionreformulatingreactiondiagram,
title={RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning},
author={Jiahe Song and Chuang Wang and Bowen Jiang and Yinfan Wang and Hao Zheng and Xingjian Wei and Chengjin Liu and Rui Nie and Junyuan Gao and Jiaxing Sun and Yubin Wang and Lijun Wu and Zhenhua Huang and Jiang Wu and Qian Yu and Conghui He},
year={2026},
eprint={2511.02384},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.02384},
}This project is licensed under the CC BY-NC 4.0 license β see LICENSE for details.
This research is supported by Shanghai AI Laboratory.

