Skip to content

opendatalab/RxnCaption

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RxnCaption Banner

RxnCaption

Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning

Project Page arXiv Model Dataset License

πŸ‡¨πŸ‡³ δΈ­ζ–‡ζ–‡ζ‘£


CVPR 2026 β€” Given a chemical reaction diagram from a scientific paper, RxnCaption identifies all molecular structures, text labels, and identifiers, then organises them into structured reaction graphs (reactants β†’ conditions β†’ products).

RxnCaption Pipeline

πŸ”₯ News

  • πŸš€ [02/21/2026] Our paper is accepted by CVPR 2026!

✨ Highlights

  • πŸ† State-of-the-art on both RxnScribe-test and our U-RxnDiagram-15k benchmark
  • πŸ”¬ Novel BIVP strategy β€” Bounding-box Index Visual Prompt turns detection into a structured captioning task
  • πŸ§ͺ 15k annotated diagrams β€” the largest reaction diagram dataset with 4 topology types (single / multiple / tree / graph)
  • ⚑ Plug-and-play β€” one script runs the full pipeline: detection β†’ annotation β†’ VL inference
  • πŸ“Š Comprehensive evaluation β€” Hard / Soft / Hybrid metrics with visualization reports

πŸ“Š Main Results

RxnScribe-test

Method Strategy Hard F1 Soft F1
RxnScribe BROS 74.0 83.8
RxnIm BROS 73.2 76.9
Gemini-2.5-Pro BIVP 49.8 76.1
RxnCaption-VL (Ours) BIVP 75.5 88.2

U-RxnDiagram-15k-test

Method Strategy Hard F1 Soft F1
RxnScribe BROS 34.9 45.9
RxnIm BROS 37.4 40.5
Gemini-2.5-Pro BIVP 40.4 66.6
RxnCaption-VL (Ours) BIVP 55.5 67.6

πŸ—οΈ Repository Structure

RxnCaption/
β”œβ”€β”€ README.md / README_zh.md
β”œβ”€β”€ LICENSE                    # CC-BY-NC-4.0
β”œβ”€β”€ requirements.txt
β”‚
β”œβ”€β”€ molyolo/                   # Module 1 β€” Molecular structure detector (YOLOv10)
β”‚   β”œβ”€β”€ predict.py
β”‚   └── weights/MolYOLO.pt     # (download separately)
β”‚
β”œβ”€β”€ rxncaption/                # Module 2/3 β€” Core pipeline
β”‚   β”œβ”€β”€ annotate.py            # BIVP: bboxes + reading-order indices
β”‚   β”œβ”€β”€ inference.py           # VL model inference (prompt templates)
β”‚   └── evaluate.py            # Hard / Soft / Hybrid evaluation
β”‚
β”œβ”€β”€ tools/                     # Data processing utilities
β”‚   β”œβ”€β”€ generate_mapdict.py
β”‚   β”œβ”€β”€ transform_yolo_detections.py
β”‚   β”œβ”€β”€ convert_to_qwen_format.py
β”‚   β”œβ”€β”€ transform_jsonl_to_json.py
β”‚   └── transform_prediction_to_gtformat.py
β”‚
β”œβ”€β”€ scripts/                   # Shell pipelines
β”‚   β”œβ”€β”€ run_inference.sh       # End-to-end inference
β”‚   β”œβ”€β”€ run_eval.sh            # Evaluation
β”‚   └── prepare_data.sh        # Training data preparation
β”‚
β”œβ”€β”€ demo/                      # Quick demo with sample images
β”‚   β”œβ”€β”€ run_demo.sh
β”‚   └── run_demo_slurm.sh
β”‚
└── docs/
    β”œβ”€β”€ DATA.md                # Dataset documentation
    └── TRAINING.md            # Training guide

⚑ Quick Start

Installation

git clone https://github.com/songjhPKU/RxnCaption
cd RxnCaption
pip install -r requirements.txt

# Install the bundled ultralytics (YOLOv10) fork
pip install -e molyolo/

Download Weights

# MolYOLO detector checkpoint
mkdir -p molyolo/weights
wget -O molyolo/weights/MolYOLO.pt \
    https://github.com/songjhPKU/MolYOLO/raw/main/weights/MolYOLO.pt

# RxnCaption-VL model β€” two options:
# Option A: Auto-download from HuggingFace (default)
#   The scripts use "songjhPKU/RxnCaption-VL" by default.
#   swift will download it automatically on first run.

# Option B: Use a local copy (recommended for most users)
huggingface-cli download songjhPKU/RxnCaption-VL --local-dir /path/to/RxnCaption-VL
#   Then pass the local path via --model:
#   bash scripts/run_inference.sh --model /path/to/RxnCaption-VL ...

Run Inference on Your Images

bash scripts/run_inference.sh \
    --image_dir  /path/to/reaction_images \
    --output_dir ./outputs \
    --gpu_num    1

This runs the full pipeline:

  1. MolYOLO detects molecular structures β†’ per-image JSON bboxes
  2. BIVP annotates images with blue boxes + numeric labels
  3. RxnCaption-VL reads the annotated images and predicts reaction graphs
  4. Post-processing converts the output to evaluation format

Quick Demo

Want to try it out quickly? Use the bundled demo script:

# 1. Put a few reaction images into demo/sample_images/
# 2. Run:
bash demo/run_demo.sh

# With evaluation (if you have ground truth):
GT_FILE=demo/sample_gt.json bash demo/run_demo.sh

# With a local model checkpoint:
MODEL=/path/to/RxnCaption-VL bash demo/run_demo.sh

See demo/README.md for full details.


πŸ”¬ Pipeline Details

Step 1 β€” MolYOLO Detection

A fine-tuned YOLOv10 model detects all relevant entities (molecules, text, identifiers) in each reaction diagram image.

python molyolo/predict.py \
    --img_dir       /path/to/images \
    --weights       molyolo/weights/MolYOLO.pt \
    --output_dir    outputs/molyolo \
    --output_name   run01 \
    --conf          0.5 \
    --gpu_num       4 \
    --visual_prompt

Step 2 β€” BIVP Annotation

The Bounding-box Index Visual Prompt (BIVP) module draws blue bounding boxes and reading-order numeric labels onto each image, turning raw detections into a visual prompt for the VL model.

python rxncaption/annotate.py \
    --image_root_dir    /path/to/images \
    --det_json_root_dir outputs/molyolo/run01/json \
    --middle_root_dir   outputs/annotated \
    --confidence_threshold 0.5

Step 3 β€” RxnCaption-VL Inference

The fine-tuned Qwen2.5-VL-7B model reads each annotated image and outputs a structured JSON reaction list.

swift infer \
    --model           songjhPKU/RxnCaption-VL \
    --model_type      qwen2_5_vl \
    --infer_backend   pt \
    --val_dataset     outputs/eval_input.jsonl \
    --result_path     outputs/infer_output.jsonl \
    --max_batch_size  1 \
    --max_new_tokens  16384

Example output:

[
  {
    "reactants":  [{"structure": 1}, {"text": "Hβ‚‚O"}],
    "conditions": [{"text": "Ξ”, 2h"}],
    "products":   [{"structure": 2}]
  }
]

Step 4 β€” Evaluation

Three evaluation modes reflect different levels of matching strictness:

Mode What is matched
Hard All role members (molecules + text) must match with IoU β‰₯ 0.5
Soft Only molecule members are compared
Hybrid Molecules matched by IoU; text compared as unordered bag
bash scripts/run_eval.sh \
    --gt_file        data/ground_truth.json \
    --raw_pred_file  outputs/raw_prediction.json \
    --mapdict        data/mapdict_from_yolo_to_gt.json \
    --image_dir      data/images \
    --output_dir     results/ \
    --mode           all

πŸ—ƒοΈ Dataset

U-RxnDiagram-15k contains ~15,000 reaction diagram images from scientific PDFs with full annotation across 4 topology types.

from datasets import load_dataset
ds = load_dataset("songjhPKU/U-RxnDiagram-15k")

See docs/DATA.md for the complete schema and download instructions.


πŸ‹οΈ Training

See docs/TRAINING.md for the full training guide.

Short version:

# 1. Prepare data
bash scripts/prepare_data.sh \
    --raw_gt_json  data/ground_truth_ocr.json \
    --yolo_det_dir data/det_json/ \
    --image_dir    data/annotated_images/ \
    --output_dir   data/processed/

# 2. Train (8 GPUs, full fine-tuning)
swift sft \
    --model Qwen/Qwen2.5-VL-7B-Instruct \
    --model_type qwen2_5_vl \
    --dataset data/processed/train.jsonl \
    --val_dataset data/processed/val.jsonl \
    --output_dir outputs/train/ \
    # ... see docs/TRAINING.md for full args

🀝 Citation

If you find this work helpful, please cite:

@misc{song2026rxncaptionreformulatingreactiondiagram,
      title={RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning}, 
      author={Jiahe Song and Chuang Wang and Bowen Jiang and Yinfan Wang and Hao Zheng and Xingjian Wei and Chengjin Liu and Rui Nie and Junyuan Gao and Jiaxing Sun and Yubin Wang and Lijun Wu and Zhenhua Huang and Jiang Wu and Qian Yu and Conghui He},
      year={2026},
      eprint={2511.02384},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.02384}, 
}

πŸ“œ License

This project is licensed under the CC BY-NC 4.0 license β€” see LICENSE for details.

πŸ™ Acknowledgements

This research is supported by Shanghai AI Laboratory.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors