📖Paper | 🤗Datasets | 🤗Models Weights (Huggingface) | 🤖Models Weights (ModelScope)
🌈Large Language Models (LLMs) have demonstrated remarkable performance across various domains, yet their capabilities in molecular reasoning remain insufficiently explored. Current approaches tend to rely heavily on general-purpose prompting, which lacks domain-specific molecular semantics, while those that use fine-tuning strategies often face challenges with interpretability and reasoning depth. To address these issues, we introduce MolReasoner, a two-stage framework designed to transition LLMs from memorization towards chemical reasoning. First, we propose Mol-SFT, which initializes the model’s reasoning abilities via synthetic Chain-of-Thought (CoT) samples generated by GPT-4o and verified for chemical accuracy. Subsequently, Mol-RL applies reinforcement learning with specialized reward functions designed explicitly to align chemical structures with linguistic descriptions, thereby enhancing molecular reasoning capabilities. Our approach notably enhances interpretability, improving the model’s molecular understanding and enabling better generalization. Extensive experiments demonstrate that MolReasoner outperforms existing methods, and marking a significant shift from memorization-based outputs to robust chemical reasoning.
The illustrations are shown below:
- 🚀 [08/05/2025] We release our MolReasoner's Paper.
- 🚀 [08/05/2025] We upload our checkpoints of MolReasoner to Huggingface.
- 🚀 [08/04/2025] We upload our checkpoints of MolReasoner to ModelScope.
- 🚀 [08/04/2025] We upload our training datasets of MolReasoner to Huggingface.
- 🚀 [08/04/2025] We release MolReasoner repository and our training, inference and evaluation code.
# 1. Install LLaMA-Factory from https://github.com/hiyouga/LLaMA-Factory repository.
# 2. Install Verl from https://github.com/volcengine/verl repository.
# 3. Install additional dependencies for both environments
pip3 install deepspeed
pip install --force-reinstall psutil==5.9.8
pip install -U "ray[data,train,tune,serve,default]"
pip install EFGs
pip install swanlab
pip install --upgrade boto3 botocore
pip install rdkit tensorboard
pip install python-Levenshtein
pip install selfies
pip install nltk
# 4. Configure NLTK data path
cp -r verl/nltk_data /root/nltk_data
# 5. Download the SciBERT model
# Download the SciBERT model from Hugging Face:
# https://huggingface.co/allenai/scibert_scivocab_uncased or
# https://huggingface.co/Sihangli/3D-MoLM
# 6. Download the QWEN 2.5-7B-Instruct model
# Download the QWEN 2.5-7B-Instruct model from Hugging Face:
# https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
Please download the data from Hugging Face and place the corresponding SFT data under the LLaMA-Factory/data directory. Then, store the data according to the information in LLaMA-Factory/data/dataset_info.json as follows:
{
"text_based_de_novo_molecule_generation_train": {
"file_name": "text_based_de_novo_molecule_generation_train.json"
},
"text_based_de_novo_molecule_generation_test": {
"file_name": "text_based_de_novo_molecule_generation_test.json"
},
"molecule_captioning_train": {
"file_name": "molecule_captioning_train.json"
},
"molecule_captioning_test": {
"file_name": "molecule_captioning_test.json"
}
}# 1. Molecule Captioning
bash LLaMA-Factory/train_molecule_captioning.sh
# 2. Text-based De Novo Molecule Generation
bash LLaMA-Factory/train_text_based_de_novo_molecule_generation.shPlease remember to update the base model paths (Qwen2.5-7B-Instruct) in the following YAML files:
LLaMA-Factory/examples/train_full/train_molecule_captioning/sft.ymlLLaMA-Factory/examples/train_full/train_text_guided_molecule_generation/sft.yml
Make sure to modify the paths to the downloaded model and adjust the save paths as needed.
Please download the grpo data from Huggingface.
# 1. Molecule Captioning
bash verl/examples/grpo_trainer/grpo_train_molecule_captioning.sh
# 2. Text-based De Novo Molecule Generation
bash verl/examples/grpo_trainer/grpo_train_text_based_de_novo_molecule_generation.shPlease make sure to follow the notes provided in the two shell files and update the paths accordingly.
Additionally, during the Molecule Captioning training, make sure to replace the line primary_path = 'xxxx/scibert_scivocab_uncased' in the file verl/verl/utils/reward_score/chembl_mol2desc.py with your own model path.
For convenience, in the code:
desc2molis set to refer totext_guided_molecule_generation, andmol2descis set to refer tomolecule_captioning.
Please refer to verl/scripts/model_merger.sh to merge the trained actor model with the Qwen2.5-7B-Instruct model.
Please refer to LLaMA-Factory/infer_molecule_captioning.sh and LLaMA-Factory/infer_text_based_de_novo_molecule_generation.sh. Make sure to replace the paths to the merged model and update the output paths for inference.
We have provided scripts to evaluate both tasks, located at:
verl/examples/data_preprocess/molecule/molecule_captioning/eval_molecule_captioning/eval_metrics.pyverl/examples/data_preprocess/molecule/text_guided_molecule_generation/eval_text_guided_molecule_generation/eval_metrics.py
Additionally, to demonstrate the effectiveness, we have included several baseline examples as well as the metrics from our method in the following directories:
-
verl/examples/data_preprocess/molecule/molecule_captioning/eval_molecule_captioning/saved_results -
verl/examples/data_preprocess/molecule/text_guided_molecule_generation/eval_text_guided_molecule_generation/saved_results
To perform the metric evaluation, we have also provided some baseline evaluation scripts. The results here are exactly the same as those presented in our paper.
python verl/examples/data_preprocess/molecule/molecule_captioning/eval_molecule_captioning/grpo_eval.py # MolReasoner Molecule Captioning Evaluation
python verl/examples/data_preprocess/molecule/molecule_captioning/eval_molecule_captioning/eval_mol_instruct.py # Mol-Instruction Molecule Captioning Evaluation
python verl/examples/data_preprocess/molecule/molecule_captioning/eval_molecule_captioning/eval_qwen2_5_7b.py # Qwen2.5-7B Molecule Captioning Evaluation
python verl/examples/data_preprocess/molecule/molecule_captioning/eval_molecule_captioning/eval_llama3_70b.py # Llama3-70B Molecule Captioning Evaluation
python verl/examples/data_preprocess/molecule/text_guided_molecule_generation/eval_text_guided_molecule_generation/grpo_eval.py # MolReasoner Text-based De Novo Molecule Generation EvaluationMolReasoner outperforms all closed-source and open-source baselines across BLEU-2/4, METEOR, and ROUGE metrics, establishing a new state-of-the-art in the molecule captioning task.
| Method (Size) | BLEU-2 ↑ | BLEU-4 ↑ | METEOR ↑ | ROUGE-1 ↑ | ROUGE-2 ↑ | ROUGE-L ↑ |
|---|---|---|---|---|---|---|
| Closed-Source Models | ||||||
| GPT-4o (–) | 0.1194 | 0.0433 | 0.1651 | 0.2315 | 0.0738 | 0.1792 |
| GPT-4o-mini (–) | 0.1080 | 0.0400 | 0.1545 | 0.2310 | 0.0723 | 0.1776 |
| Open-Source Models | ||||||
| Qwen2.5-7B-Instruct (7B) | 0.0792 | 0.0258 | 0.2132 | 0.2091 | 0.0601 | 0.1483 |
| DeepSeek-R1-Distill-Qwen-7B (7B) | 0.1173 | 0.0469 | 0.1544 | 0.2209 | 0.0749 | 0.1693 |
| Llama3.1-8B-Intstruct (8B) | 0.1670 | 0.0769 | 0.2164 | 0.2806 | 0.1182 | 0.2250 |
| Qwen3-8B (8B) | 0.0974 | 0.0289 | 0.1733 | 0.2067 | 0.0501 | 0.1567 |
| Llama3.1-70B-Instruct (70B) | 0.1466 | 0.0658 | 0.1832 | 0.2736 | 0.1072 | 0.2203 |
| Qwen2.5-72B-Instruct (72B) | 0.1519 | 0.0647 | 0.1949 | 0.2729 | 0.0948 | 0.2067 |
| Mol-Instruction (7B) | 0.0956 | 0.0667 | 0.1891 | 0.2801 | 0.1823 | 0.2582 |
| MolReasoner (Ours) (7B) | 0.4383 | 0.3220 | 0.4754 | 0.5530 | 0.3662 | 0.4821 |
MolReasoner surpasses both closed-source and open-source baselines across all metrics, achieving state-of-the-art performance in this molecule generation task.
| Method (Size) | BLEU ↑ | Exact ↑ | Levenshtein ↓ | RDK FTS ↑ | MACCS FTS ↑ | MORGAN FTS ↑ | Frag-J ↑ | Frag-R ↑ | FG-Match ↑ | VALIDITY ↑ |
|---|---|---|---|---|---|---|---|---|---|---|
| Closed-Source Models | ||||||||||
| GPT-4o (–) | 0.1949 | 0.0045 | 49.3545 | 0.0926 | 0.2066 | 0.0836 | 0.1296 | 0.1777 | 0.3753 | 0.2916 |
| GPT-4o-mini (–) | 0.0522 | 0.0058 | 49.1371 | 0.0863 | 0.2032 | 0.0883 | 0.0987 | 0.1324 | 0.3898 | 0.1946 |
| Open-Source Models | ||||||||||
| Qwen2.5-7B-Instruct (7B) | 0.0002 | 0.0024 | 40.0076 | 0.0776 | 0.1585 | 0.0520 | 0.0773 | 0.1037 | 0.3601 | 0.2395 |
| DeepSeek-R1-Distill-Qwen-7B (7B) | 0.0000 | 0.0018 | 50.6957 | 0.0619 | 0.1327 | 0.0461 | 0.1101 | 0.1428 | 0.3847 | 0.0697 |
| Llama3.1-8B-Intstruct (8B) | 0.0094 | 0.0027 | 40.2092 | 0.0556 | 0.1470 | 0.0470 | 0.0701 | 0.0918 | 0.3587 | 0.2319 |
| Qwen3-8B (8B) | 0.0000 | 0.0036 | 28.2564 | 0.3692 | 0.4733 | 0.3059 | 0.3406 | 0.3566 | 0.5280 | 0.0118 |
| Llama3.1-70B-Instruct (70B) | 0.0787 | 0.0055 | 44.1626 | 0.0824 | 0.2323 | 0.0785 | 0.1398 | 0.1963 | 0.3574 | 0.4641 |
| Qwen2.5-72B-Instruct (72B) | 0.0000 | 0.0048 | 18.0588 | 0.1584 | 0.3456 | 0.1432 | 0.1696 | 0.2300 | 0.3436 | 0.1134 |
| Mol-Instruction (7B) | 0.3049 | 0.0470 | 39.4268 | 0.2914 | 0.4427 | 0.2524 | 0.3333 | 0.4092 | 0.4324 | 0.9994 |
| MolReasoner (Ours) (7B) | 0.7841 | 0.0758 | 26.9255 | 0.4373 | 0.6759 | 0.3627 | 0.5213 | 0.6414 | 0.5390 | 0.9679 |
While MolReasoner makes significant strides toward interpretable and effective molecular reasoning—bridging CoT-style supervision with chemistry-aware reinforcement learning—there remains ample room for exploration. In future work, we plan to:
- Broaden Task Coverage: Extend MolReasoner to additional molecular tasks (e.g. property prediction, retrosynthesis planning) and richer input modalities (e.g. 3D conformers, reaction schemes).
- Enhance Reward Design: Incorporate experimentally grounded metrics—such as synthetic accessibility, reaction feasibility, and bioactivity scores—into our multi-level rewards to further align model outputs with real-world chemistry.
- Scale and Efficiency: Investigate more efficient training strategies (e.g. off-policy RL, distillation) and adapt MolReasoner to larger LLM backbones without prohibitive compute costs.
- Robustness & Fairness: Evaluate model performance on out-of-distribution and negatively biased datasets, and develop techniques to mitigate hallucinations and semantic drift in generated reasoning chains.
We hope MolReasoner not only serves as a strong baseline but also inspires the community to push the boundaries of molecular LLMs—fostering new ideas, benchmarks, and open-source collaborations aimed at truly autonomous chemical reasoning.
@misc{zhao2025molreasonereffectiveinterpretablereasoning,
title={MolReasoner: Toward Effective and Interpretable Reasoning for Molecular LLMs},
author={Guojiang Zhao and Sihang Li and Zixiang Lu and Zheng Cheng and Haitao Lin and Lirong Wu and Hanchen Xia and Hengxing Cai and Wentao Guo and Hongshuai Wang and Mingjun Xu and Siyu Zhu and Guolin Ke and Linfeng Zhang and Zhifeng Gao},
year={2025},
eprint={2508.02066},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.02066},
}
We sincerely thank projects LLaMA-Factory, Verl, Mol-Instructions, and Visual-RFT for providing their open-source resources.