A curated collection of papers, datasets, and resources on Scientific Datasets and Large Language Models (LLMs), organized in reference to our survey: "A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers"
If you spot any mistakes or have suggestions, feel free to reach out by email: [email protected]
(We also recommend CCβing [email protected] and [email protected] in case of any unsuccessful delivery issue.)
If you find our survey useful for your research, please cite the following paper:
If you find this repository or our survey helpful in your research, please kindly cite our paper:
@misc{hu2025surveyscientificlargelanguage,
title={A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers},
author={Ming Hu and Chenglong Ma and Wei Li and Wanghan Xu and Jiamin Wu and Jucheng Hu and Tianbin Li and Guohang Zhuang and Jiaqi Liu and Yingzhou Lu and Ying Chen and Chaoyang Zhang and Cheng Tan and Jie Ying and Guocheng Wu and Shujian Gao and Pengcheng Chen and Jiashi Lin and Haitao Wu and Lulu Chen and Fengxiang Wang and Yuanyuan Zhang and Xiangyu Zhao and Feilong Tang and Encheng Su and Junzhi Ning and Xinyao Liu and Ye Du and Changkai Ji and Cheng Tang and Huihui Xu and Ziyang Chen and Ziyan Huang and Jiyao Liu and Pengfei Jiang and Yizhou Wang and Chen Tang and Jianyu Wu and Yuchen Ren and Siyuan Yan and Zhonghua Wang and Zhongxing Xu and Shiyan Su and Shangquan Sun and Runkai Zhao and Zhisheng Zhang and Yu Liu and Fudi Wang and Yuanfeng Ji and Yanzhou Su and Hongming Shan and Chunmei Feng and Jiahao Xu and Jiangtao Yan and Wenhao Tang and Diping Song and Lihao Liu and Yanyan Huang and Lequan Yu and Bin Fu and Shujun Wang and Xiaomeng Li and Xiaowei Hu and Yun Gu and Ben Fei and Zhongying Deng and Benyou Wang and Yuewen Cao and Minjie Shen and Haodong Duan and Jie Xu and Yirong Chen and Fang Yan and Hongxia Hao and Jielan Li and Jiajun Du and Yanbo Wang and Imran Razzak and Chi Zhang and Lijun Wu and Conghui He and Zhaohui Lu and Jinhai Huang and Yihao Liu and Fenghua Ling and Yuqiang Li and Aoran Wang and Qihao Zheng and Nanqing Dong and Tianfan Fu and Dongzhan Zhou and Yan Lu and Wenlong Zhang and Jin Ye and Jianfei Cai and Wanli Ouyang and Yu Qiao and Zongyuan Ge and Shixiang Tang and Junjun He and Chunfeng Song and Lei Bai and Bowen Zhou},
year={2025},
eprint={2508.21148},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.21148},
}
In addition, "Awesome-Agent-Scientists" highlights the latest advances of AI agents in scientific research, which nicely complements our work.
@article{wei2025ai,
title={From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery},
author={Wei, Jiaqi and Yang, Yuejin and Zhang, Xiang and Chen, Yuhan and Zhuang, Xiang and Gao, Zhangyang and Zhou, Dongzhan and Wang, Guangshuai and Gao, Zhiqiang and Cao, Juntai and others},
journal={arXiv preprint arXiv:2508.14111},
year={2025}
}
Cumulative trend of publications on major preprint platforms whose titles or abstracts mention the keyword βlanguage modelβ or the combination βlanguage model + scientific domainβ (e.g., chemistry, physics, multi-omics, medicine, etc.). Left: Results from January 2018 to August 2025, from arXiv and PubMed. For arXiv, the matching includes βlanguage modelβ in combination with additional science-related keywords; PubMed results are limited to occurrences in titles and abstracts. Both platforms show rapid growth. Right: Results from 2020 to August 2025, from bioRxiv, medRxiv, and ChemRxiv, all based on direct matches of βlanguage modelβ in titles and abstracts. While the overall volumes are smaller than arXiv and PubMed, all three platforms, especially bioRxiv, show rapid acceleration, reflecting growing interdisciplinary interest in large language models across biomedical, chemical, and computational sciences
Evolution of Sci-LLMs reveals four paradigm shifts from 2018 to 2025, including (1) the progression from transfer learning approaches, (2) through the scaling era marked by knowledge integration in larger models, (3) instruction-following capabilities enabling flexible task adaptation, to (4) the latest paradigm introduces scientific agentsβAI systems capable of autonomously conducting scientific research, from hypothesis generation and experimental design to data analysis and discovery. Note: Model positions reflect their release dates (x-axis) rather than strict paradigm classification. The four paradigms represent evolving trends in Sci-LLM development with overlaps and continuities, not mutually exclusive categories.
Chronological overview of notable Sci-LLMs categorized by six scientific domains, spanning from 2019 through early 2025. Due to the rapid expansion of the field, this figure presents a selective overview.
- Awesome-Scientific-Datasets-and-LLMs
Dataset | Domain | Modality | Purpose | Type | Release | Language | Source | Annotation Pipeline | Human Annotators | Human Tasks | Auto-annotation Tools | Size |
---|---|---|---|---|---|---|---|---|---|---|---|---|
MIRAGE | Agriculture | Biological entity photos | SFT | VQA (multi-image) | 2025.06 | EN | Scientific databases | Semi-automated | N/A | Data generation | GPT-4.1 | 37,512 |
CROP | Agriculture | Academic papers | SFT | Text QA | 2024.09 | EN, ZH | Academic and research resources | Semi-automated | N/A | Data generation | GPT-4 | 211,909 |
ToTβBiology | General Biology | Biomedical QA | SFT, CoT | Text QA with CoT | 2025.01 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 23,000 |
BioASQ10b-factoid | General Biology | Clinical dialogue | SFT | Text QA | 2023.07 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 1.25K |
ReasonMed | Healthcare and Medical Sciences | Clinical dialogue | SFT, CoT | Text QA with CoT | 2025.06 | EN | Comprehensive multi-source integration | Automated | N/A | N/A | Qwen-2.5-72B, DeepSeek-R1-Distill-Llama-70B, HuatuoGPT-o1-70B | 194,925 |
Open-PMC-18M | Healthcare and Medical Sciences | CT, CFP | Pre-training | Image-text | 2025.06 | EN | Academic and research resources | Automated | N/A | N/A | N/A | 25,000,000 |
ReXVQA | Healthcare and Medical Sciences | X-ray | SFT | VQA | 2025.06 | EN | Integration of existing datasets | Semi-automated | 3 | Data review | GPT-4o, ClinicalBERT, MedEmbed | 613,277 |
RexGradient-160K | Healthcare and Medical Sciences | X-ray | Pre-training, SFT | Image-text | 2025.05 | EN | Scientific databases | Manual | N/A | N/A | N/A | 160K |
AlphaMed19K | Healthcare and Medical Sciences | Biomedical QA | SFT, CoT | Text QA | 2025.05 | EN | Integration of existing datasets | Automated | N/A | Data generation and review | N/A | 19,178 |
Derm1M | Healthcare and Medical Sciences | Dermatological images | Pre-training | Image-text | 2025.3 | EN | Social media and forums, Academic and research resources | Automated | N/A | N/A | DenseNet, DINO, GPT-4o, Whisper | 1,029,761 |
MedVideoCap-55K | Healthcare and Medical Sciences | Medical videos | Pre-training, SFT | Video-text | 2025.04 | EN | Web and Internet content | Automated | N/A | Data review | GPT-4o | 55,803 |
medical-o1-reasoning-SFT | Healthcare and Medical Sciences | Clinical dialogue | SFT, CoT | Text QA with CoT | 2025.04 | EN, ZH | Comprehensive multi-source integration | Automated | N/A | N/A | DeepSeek-R1 | 90,200 |
GMAI-Reasoning10K | Healthcare and Medical Sciences | CT, Dermatology, Endoscopy, CFP, Histopathology, MRI, Microscopy, OCT, PET, US, X-ray, \etc | SFT | VQA | 2025.04 | EN | Comprehensive multi-source integration | Semi-automated | N/A | Data review | GPT-4o | 17,004 |
MedReason | Healthcare and Medical Sciences | Clinical dialogue | SFT, CoT | Text QA with CoT | 2025.03 | EN | Comprehensive multi-source integration | Automated | N/A | N/A | N/A | 32,682 |
GEMeX-VQA | Healthcare and Medical Sciences | X-ray | Pre-training, SFT | VQA | 2025.03 | EN | Integration of existing datasets | Semi-automated | N/A | Data review | OpenBioLLM-70B, GPT-4o | 1,601,615 |
MIMIC-Diff-VQA | Healthcare and Medical Sciences | X-ray | SFT | VQA (multi-image) | 2025.02 | EN | Scientific databases | Semi-automated | 3 | Data generation and review | ScispaCy | 630,633 |
ICG-CXR | Healthcare and Medical Sciences | X-ray | SFT | VQA (multi-image) | 2025.03 | EN | Scientific databases | Automated | N/A | Data generation and review | GPT-4 | 11,439 |
VL-Health | Healthcare and Medical Sciences | CT, CFP, MRI, Microscopy, OCT, US, X-ray | Pre-training, SFT | Image-text, VQA | 2025.02 | EN, ZH | Comprehensive multi-source integration | Semi-automated | N/A | Data review | GPT-4o | 1,548,847 |
BIOMEDICA | Healthcare and Medical Sciences | Academic papers | Pre-training | Raw text | 2025.01 | EN | Academic and research resources | Semi-automated | 7 | Data review | N/A | 2,400,000 |
AfriMed-QA v2 | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2024.11 | EN | Comprehensive multi-source integration | Semi-automated | N/A | N/A | N/A | 15,275 |
GMAI-VL-5.5M | Healthcare and Medical Sciences | CT, Dermatology, Endoscopy, CFP, Histopathology, MRI, Microscopy, OCT, PET, US, X-ray, \etc | SFT | VQA, Text QA | 2024.11 | EN, ZH | Comprehensive multi-source integration | Semi-automated | 5 | Data review | GPT-4o | 5.5M |
OphVL | Healthcare and Medical Sciences | Ophthalmic Surgical Video | Pre-training | Video-text | 2024.11 | EN | Web and Internet content | Automated | N/A | Data generation and review | SurgicBERTa, GPT-4o | 375,198 |
Bora-v1 | Healthcare and Medical Sciences | Endoscopy, MRI, Microscopy, US | SFT | Video-text | 2024.10 | EN | Integration of existing datasets | Automated | N/A | Data review | N/A | 4,897 |
MedSyn | Healthcare and Medical Sciences | Clinical documentation | Pre-training | Raw text | 2024.08 | RU | Academic and research resources | Automated | N/A | N/A | GPT-4, Medical Knowledge Graph | 41,200 |
RealMedQA | Healthcare and Medical Sciences | Biomedical QA | SFT | Text QA | 2024.08 | EN | Encyclopedias and knowledge bases | Semi-automated | 6 | Data generation and review | GPT-3.5-turbo | 1,200 |
MedTrinity-25M | Healthcare and Medical Sciences | CT, MRI, X-ray, Histopathology, \etc | Pre-training | Image-text, VQA | 2024.08 | EN | Integration of existing datasets, Scientific databases | Automated | N/A | N/A | N/A | 25,000,000 |
MedPix-single | Healthcare and Medical Sciences | CT, MRI, US, X-ray | Pre-training | Image-text | 2024.07 | EN | Scientific databases | Manual | N/A | Data generation | N/A | 59,000 |
BIMCV-R | Healthcare and Medical Sciences | CT | Pre-training | Image-text | 2024.07 | EN | Scientific databases | Semi-automated | 20+ | Data review | GPT-4 | 8,069 |
MIMIC-Ext-MIMIC-CXR-VQA | Healthcare and Medical Sciences | X-ray | Pre-training, SFT | VQA | 2024.07 | EN | Integration of existing datasets | Semi-automated | 4 | Data review | GPT-4 | 377,391 |
EHRXQA | Healthcare and Medical Sciences | X-ray | Pre-training, SFT | VQA | 2024.07 | EN | Integration of existing datasets | Semi-automated | 4 | Data review | GPT-4 | 46,152 |
CheXpertPlus | Healthcare and Medical Sciences | X-ray | Pre-training | Image-text | 2024.06 | EN | Scientific databases | Semi-automated | 10 | Data generation and review | CheXbert, Radgraph | 223,228 |
PubMedVision | Healthcare and Medical Sciences | CT, Endoscopy, CFP, Infrared Reflectance, MRI, Microscopy, OCT, US, X-ray | SFT | VQA | 2024.06 | EN | Academic and research resources | Automated | N/A | N/A | GPT-4, GPT-4V, SentenceBERT | 1,294,092 |
MediQ | Healthcare and Medical Sciences | EHR | SFT | Text QA | 2024.06 | EN | Academic and research resources | Automated | N/A | N/A | GPT-3.5, LLaMAβ3 | 2,545 |
HuatuoGPT2-SFT-GPT4-140K | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2024.06 | ZH | Other sources | Automated | N/A | Data generation and review | GPT-4 | 140,000 |
Asclepius-Synthetic-Clinical-Notes | Healthcare and Medical Sciences | EHR | SFT | Text QA | 2024.06 | EN | Academic and research resources | Semi-automated | N/A | Data generation | GPT-3.5 | 158,114 |
Know Medical Dialogues | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2024.06 | EN | Web and Internet content | Automated | N/A | N/A | N/A | 480 |
Duvel | Healthcare and Medical Sciences | Academic papers | SFT | Classification | 2024.05 | EN | Scientific databases | Semi-automated | N/A | Data generation | ALAMBIC | 6,553 |
SkinCAP | Healthcare and Medical Sciences | Dermatology | Pre-training | Image-text | 2024.05 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 4,000 |
MM-Retinal | Healthcare and Medical Sciences | CFP, FFA, OCT | Pre-training, SFT | Image-text | 2024.05 | EN, ZH | Academic and research resources | Semi-automated | 6 | Data review | N/A | 4,349 |
M3D-Data (caption) | Healthcare and Medical Sciences | CT, Clinical reports | Pre-training, SFT | Image-text, Text QA, VQA | 2024.04 | EN | Scientific databases, Integration of existing datasets | Semi-automated | N/A | Data generation and review | GPT-4V | 120,092 |
M3D-Data (instruction) | Healthcare and Medical Sciences | CT, Clinical reports | SFT | Image-text, Text QA, VQA | 2024.04 | EN | Scientific databases, Integration of existing datasets | Semi-automated | N/A | Data generation and review | GPT-4V | 58,180 |
RadGenome-Chest CT | Healthcare and Medical Sciences | CT | Pre-training, SFT | VQA, Image-text | 2024.04 | EN | Academic and research resources | Semi-automated | N/A | Data review | SAT, GPT-4, GPT-2 | 1,965,000 |
CXR-LLM | Healthcare and Medical Sciences | X-ray | SFT | VQA | 2024.03 | EN | Integration of existing datasets | Semi-automated | N/A | Data generation | GPT-4 | 104,892 |
MedChatZH | Healthcare and Medical Sciences | Clinical dialogue | Pre-training, SFT | Text QA | 2024.03 | ZH | Comprehensive multi-source integration | Semi-automated | N/A | Data generation | N/A | 2,068,823 |
Mental health chatbot dataset | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2024.02 | EN | Web and Internet content | Automated | N/A | N/A | N/A | 172 |
StatPearls | Healthcare and Medical Sciences | Academic papers | Pre-training | Raw text | 2024.02 | EN | Scientific databases | Automated | N/A | N/A | N/A | 301,202 |
Quilt-Instruct | Healthcare and Medical Sciences | Histopathology | SFT | VQA | 2024.02 | EN | Web and Internet content | Semi-automated | N/A | Data review | GPT-4-turbo | 107,131 |
SHADR | Healthcare and Medical Sciences | EHR | SFT | Classification | 2024.01 | EN | Scientific databases | Semi-automated | N/A | Data review | GPT-3.5 | 446 |
RJUA-QA | Healthcare and Medical Sciences | Dianosis report, Clinical dialogue | SFT | Text QA | 2023.12 | ZH | Other sources | Manual | N/A | Data generation and review | N/A | 1,705 |
RP3D-DiagDS | Healthcare and Medical Sciences | CT, MRI, X-ray US, Fluoroscopy, \etc | Pre-training | Classification | 2023.12 | EN | Scientific databases | Semi-automated | N/A | Data generation and review | Custom crawlers, GPT-4 | 40,936 |
PMC-Inline | Healthcare and Medical Sciences | CT, MRI, PET, US, X-ray | Pre-training | Image-text | 2023.11 | EN | Academic and research resources | Automated | N/A | N/A | N/A | 11,000,000 |
ROCOv2 | Healthcare and Medical Sciences | CT, MRI, PET, US, X-ray | Pre-training, SFT | Image-text | 2023.11 | EN | Academic and research resources | Semi-automated | N/A | N/A | fastText, MedCAT | 80,080 |
PMC-CaseReport | Healthcare and Medical Sciences | X-ray | SFT | Image-text, VQA | 2023.11 | EN | Academic and research resources | Automated | N/A | N/A | N/A | 1,100,000 |
MedMD | Healthcare and Medical Sciences | CT, MRI, PET, US, X-ray | Pre-training, SFT | Image-text, VQA | 2023.11 | EN | Academic and research resources | Semi-automated | 8 | Data review | ChatGPT | 16,000,000 |
Taiyi-Instruction-Data-001 | Healthcare and Medical Sciences | Dianosis report, Clinical dialogue, EMR, Academic papers, \etc | Pre-training, SFT | Text QA | 2023.11 | EN, ZH | Integration of existing datasets | Automated | N/A | Data review | N/A | 1,114,315 |
MTS-DIALOG | Healthcare and Medical Sciences | Clinical dialogue | Pre-training | Text QA | 2023.11 | EN | Academic and research resources | Semi-automated | 12 | Data generation and review | GPT-4o | 23,977 |
MTS-Dialog | Healthcare and Medical Sciences | Clinical dialogue | Pre-training | Raw text | 2023.11 | EN | Patent databases | Semi-automated | 9 | Data generation and review | OPUS-MT, BART | 1,701 |
Clinical Guidelines | Healthcare and Medical Sciences | Clinical guidelines | Pre-training | Text QA with CoT | 2023.11 | EN | Scientific databases | Semi-automated | N/A | Data review | S2ORC, GROBID | 38,000 |
INSPECT | Healthcare and Medical Sciences | CT | Pre-training | Image-text | 2023.11 | EN | Scientific databases | Semi-automated | N/A | Data review, Data generation | Clinical Longformer | 23,248 |
AeroPath | Healthcare and Medical Sciences | CT | Agent | Segmentation | 2023.11 | EN | Scientific databases | Semi-automated | 2 | Data review | 3D Slicer | 27 (CT scans) |
MORFITT | Healthcare and Medical Sciences | Clinical papers | Pre-training | Classification | 2023.11 | FR | Academic and research resources | Manual | N/A | Data review | N/A | 3,556 |
NoteChat | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2023.10 | EN | Integration of existing datasets | Automated | N/A | N/A | N/A | 207,000 |
ChiMed-VL | Healthcare and Medical Sciences | X-ray, CT, MRI, \etc | Pre-training, SFT | Image-text, Text QA | 2023.10 | ZH, EN | Integration of existing datasets | Automated | N/A | N/A | GPT-3.5 | 1,049,455 |
OncQA | Healthcare and Medical Sciences | Dianosis report | SFT | Text QA | 2023.10 | EN | Other sources | Manual | 6 | Data generation and review | GPT-4 | 156 |
SDOH-NLI | Healthcare and Medical Sciences | Clinical notes | Pre-training | Classification | 2023.10 | EN | Integration of existing datasets | Manual | N/A | Data generation | N/A | 21.1K |
CMtMedQA | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2023.08 | ZH | Other sources | Automated | N/A | Data review | N/A | 68,000 |
DISC-Med-SFT | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2023.08 | ZH | Integration of existing datasets | Semi-automated | N/A | Data review | GPT-3.5, GPT-4 | 470,000 |
Healix-V1 | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2023.07 | EN | Comprehensive multi-source integration | N/A | N/A | N/A | N/A | 796,239 |
Medical Cord19 | Healthcare and Medical Sciences | Academic papers | Pre-training | Raw text | 2023.07 | EN | Academic and research resources | Automated | N/A | N/A | N/A | 250,000 |
Pile-PubMed Central | Healthcare and Medical Sciences | Academic papers | Pre-training | Raw text | 2023.07 | EN | Academic and research resources | Automated | N/A | Data generation | N/A | N/A |
AGCT | Healthcare and Medical Sciences | Biomedical knowledge base | Pre-training | Raw text | 2023.07 | EN, FR | Scientific databases | Automated | N/A | N/A | Custom generation | 421,216 |
Synthetic CSAW 100k Mammograms | Healthcare and Medical Sciences | Mammography | SFT | Image-text | 2023.07 | EN | Scientific databases | Automated | N/A | N/A | Diffusion Model | 100K |
Quilt-1M | Healthcare and Medical Sciences | Histopathology | Pre-training, SFT | Image-text | 2023.06 | EN | Academic and research resources, Web and Internet content, Other sources | Automated | N/A | N/A | N/A | 1,000,000 |
LLaVA-Med | Healthcare and Medical Sciences | CT, Histopathology, MRI, Microscopy, PET, US, X-ray | Pre-training, SFT | VQA, Image-text | 2023.06 | EN | Comprehensive multi-source integration | Automated | N/A | N/A | GPT-4 | 630,000 |
ShenNong-TCM-Dataset | Healthcare and Medical Sciences | Clinical dialogue | SFT, CoT | Text QA | 2023.06 | ZH | Comprehensive multi-source integration | Automated | N/A | Data generation | ChatGPT | 113,000 |
PMC-VQA | Healthcare and Medical Sciences | CT, CFP, Histopathology, MRI, Microscopy, US, X-ray | SFT | VQA | 2023.05 | EN | Academic and research resources | Automated | N/A | N/A | N/A | 226,946 |
ChatMed-Consult-Dataset | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2023.05 | ZH | Web and Internet content | Automated | N/A | Data generation | GPT-3.5-Turbo | 549,000 |
QiZhenGPT-20k | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2023.05 | ZH | Other sources | Automated | N/A | Data generation | N/A | 20,000 |
Huatuo-26M | Healthcare and Medical Sciences | Biomedical QA | Pre-training, SFT | Text QA | 2023.05 | EN | Encyclopedias and knowledge bases | Semi-automated | N/A | Data review | Bert, T5 | 26,000,000 |
Huatuo26M-Lite | Healthcare and Medical Sciences | Clinical dialogue, Dianosis report | Pre-training, SFT | Text QA | 2023.05 | ZH | Web and Internet content | Semi-automated | N/A | Data review | ChatGPT | 177,703 |
Visual Med-Alpaca | Healthcare and Medical Sciences | CT, CFP, Histopathology, MRI, Microscopy, US, X-ray | SFT | VQA | 2023.04 | EN | Scientific databases | Automated | N/A | N/A | GPT-3.5 | 54,000 |
MedAlpaca | Healthcare and Medical Sciences | Clinical dialogue, Academic papers | Pre-training, SFT | Raw text, Text QA | 2023.04 | EN | Comprehensive multi-source integration | Automated | N/A | Data generation and review | N/A | 860,076 |
Med-ChatGLM | Healthcare and Medical Sciences | Biomedical knowledge base | SFT | Text QA | 2023.04 | ZH | Integration of existing datasets | Automated | N/A | Data generation | GPT-3.5 | 7,622 |
PMC-OA | Healthcare and Medical Sciences | CT, Dermatology, Endoscopy, Histopathology, Microscopy, MRI, OCT, PET, X-ray | Pre-training | Image-text | 2023.03 | EN | Academic and research resources | Automated | N/A | Data generation and review | ResNet101 (DocFigure), ResNet34 (DETR MedICaT), PMC-CLIP | 1,646,592 |
ChatDoctor | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2023.03 | EN | Other sources | Semi-automated | N/A | Data generation and review | N/A | 115,000 |
WikiMedQA | Healthcare and Medical Sciences | Clinical Reports | SFT | Text QA | 2023.03 | EN | Web and Internet content | Semi-automated | N/A | N/A | SentenceBERT, BioLinkBERT | 111,895 |
MIMIC-IV | Healthcare and Medical Sciences | EHR | Pre-training, SFT | Raw text | 2023.01 | EN | Scientific databases | Semi-automated | N/A | N/A | Transformer-DeID | 364,627 |
BioRED | Healthcare and Medical Sciences | Academic papers | Pre-training | Classification | 2022.09 | EN | Scientific databases | Semi-automated | 6 | Data generation and review | PubTator | 500 |
ViHealthQA | Healthcare and Medical Sciences | Biomedical QA | SFT | Text QA | 2022.06 | VI | Social media and forums | Manual | N/A | Data generation | N/A | 10,015 |
MedMCQA | Healthcare and Medical Sciences | Medical exams | SFT | Text QA | 2022.03 | EN | Books and literary works | Automated | N/A | Data generation | N/A | 193,155 |
PMC-Patients-ReCDS | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2022.02 | EN | Academic and research resources | Automated | N/A | N/A | N/A | 293,000 |
PMC-Patients | Healthcare and Medical Sciences | Clinical report | Pre-training | Raw text | 2022.02 | EN | Scientific databases | Semi-automated | N/A | Data review | PubMedBERT, BioLinkBERT | 167,000 |
CMCQA | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2022.01 | ZH | Web and Internet content | Automated | N/A | Data review | N/A | 1,294,753 |
IMCS-V2 | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2022.01 | ZH | Other sources | Manual | N/A | Data generation and review | N/A | 4,116 |
MLEC-QA | Healthcare and Medical Sciences | Biomedical QA | SFT | Raw text | 2021.11 | ZH | Academic and research resources | Semi-automated | N/A | Data generation and review | N/A | 136,236 |
ImageClef-VQA Med 2021 | Healthcare and Medical Sciences | CT, MRI, US, X-ray | SFT | VQA | 2021.09 | EN | Academic and research resources | Automated | N/A | N/A | N/A | 4,500 |
BioLeaflets | Healthcare and Medical Sciences | Package leaflets | Pre-training | Raw text | 2021.09 | EN | Web and Internet content | Semi-automated | N/A | Data generation | Stanza, Amazon Comprehend Medical | 1,067 |
MedGPT-5k-ko | Healthcare and Medical Sciences | Clinical trials, EHR, Medical forum, Medical textbooks | SFT | Classification, Text QA | 2021.06 | ZH | Scientific databases, Books and literary works, Web and Internet content, Comprehensive multi-source integration | Manual | 3 | Data generation and review | N/A | 149,141 |
CBLUE | Healthcare and Medical Sciences | Clinical trials, EHR, Medical forum, Medical textbooks | SFT | Classification, Text QA | 2021.06 | ZH | Scientific databases, Books and literary works, Web and Internet content, Comprehensive multi-source integration | Manual | 3 | Data generation and review | N/A | 149,141 |
MedDG | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2021.05 | ZH | Web and Internet content | Automated | N/A | Data generation and review | N/A | 100,000 |
SLAKE | Healthcare and Medical Sciences | CT, MRI, X-ray | SFT | VQA | 2021.02 | EN, ZH | Academic and research resources | Automated | N/A | N/A | N/A | 11,958 |
Chinese-medical-dialogue-data | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2021.02 | ZH | Other sources | N/A | N/A | N/A | N/A | 792,099 |
DeepEyeNet | Healthcare and Medical Sciences | CFP, FFA | Pre-training | Image-text | 2021.01 | EN | Scientific databases | Manual | N/A | Data generation | N/A | 15,709 |
AIforCOVID | Healthcare and Medical Sciences | X-ray | Pre-training, SFT | Image-text | 2020.12 | EN | Scientific databases | Manual | N/A | Data generation | N/A | 820 |
MedICaT | Healthcare and Medical Sciences | CT, Endoscopy, Histopathology, MRI, Microscopy, PET, US, X-ray | Pre-training, SFT | Image-text | 2020.10 | EN | Academic and research resources | Semi-automated | 7 | Data generation | ResNet101-DocFigure, ScispaCy | 217,060 |
ImageClef-VQA Med 2020 | Healthcare and Medical Sciences | CT, MRI, US, X-ray | SFT | VQA | 2020.09 | EN | Academic and research resources | Automated | N/A | N/A | N/A | 4,000 |
MedQA | Healthcare and Medical Sciences | Medical exams | SFT | Text QA | 2020.09 | EN, ZH | Scientific databases | Manual | N/A | Data generation and review | N/A | 61,097 |
MedDialog-CN | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2020.07 | ZH | Web and Internet content | Automated | N/A | Data review | N/A | 1,100,000 |
MEDIQA-AnS | Healthcare and Medical Sciences | Consumer health QA | SFT | Text QA | 2020.05 | EN, ZH | Web and Internet content | Semi-automated | 2 | Data generation | Custom crawlers | 156 |
MedDialog | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2020.04 | EN, ZH | Web and Internet content | Automated | N/A | N/A | Custom crawlers | 14,668,058 |
PathVQA | Healthcare and Medical Sciences | Histopathology | SFT | VQA | 2020.03 | EN | Academic and research resources | Automated | N/A | N/A | N/A | 19,654 |
RetinaRocks | Healthcare and Medical Sciences | CFP | Pre-training, SFT | Image-text | 2019.12 | EN | Other sources | Manual | N/A | Data generation | N/A | 4,000 |
MedQuAD | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2019.10 | EN | Web and Internet content | Automated | N/A | N/A | Custom crawlers | 47,441 |
ImageClef-VQA Med 2019 | Healthcare and Medical Sciences | CT, MRI, US, X-ray | SFT | VQA | 2019.09 | EN | Academic and research resources | Automated | N/A | N/A | N/A | 15,292 |
PubMedQA | Healthcare and Medical Sciences | Academic papers | SFT | Text QA | 2019.09 | EN | Web and Internet content | Semi-automated | N/A | Data generation and review | N/A | 212,300 |
PubMedQA instruction | Healthcare and Medical Sciences | Academic papers | SFT | Text QA | 2019.09 | EN | Academic and research resources | Manual | N/A | Data generation | N/A | 1K |
MIMIC-CXR | Healthcare and Medical Sciences | X-ray | Pre-training | Image-text | 2019.08 | EN | Scientific databases | Manual | N/A | Data generation | N/A | 227,835 |
MIMIC-Extract | Healthcare and Medical Sciences | EHR | Pre-training | Text QA | 2019.07 | EN | Scientific databases | Automated | N/A | N/A | N/A | 2,000,000 |
webMedQA | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2019.03 | ZH | Web and Internet content | Automated | N/A | Data review | N/A | 63,284 |
VQA-RAD | Healthcare and Medical Sciences | CT, MRI, PET, US, X-ray | SFT | VQA | 2018.11 | EN | Academic and research resources | Manual | N/A | Data generation | N/A | 1,793 |
cMedQA2 | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2018.11 | ZH | Web and Internet content | Automated | N/A | Data review | N/A | 108,000 |
ROCO | Healthcare and Medical Sciences | CT, MRI, PET, US, X-ray | Pre-training, SFT | Image-text | 2018.09 | EN | Academic and research resources | Automated | N/A | N/A | N/A | 81,000 |
emrQA | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2018.09 | EN | Integration of existing datasets | Semi-automated | N/A | N/A | N/A | 455,000 |
ImageClef-VQA Med 2018 | Healthcare and Medical Sciences | CT, MRI, US, Unknown, X-ray | SFT | VQA | 2018.06 | EN | Academic and research resources | Automated | N/A | N/A | N/A | 6,413 |
LiveQA | Healthcare and Medical Sciences | Consumer health QA | SFT | Text QA | 2018.02 | EN | Scientific databases | Automated | N/A | Data review | N/A | 634 |
LiveQA trec2017 | Healthcare and Medical Sciences | Clinical dialogue | SFT | Text QA | 2017.08 | EN | Academic and research resources | Semi-automated | N/A | Data review | N/A | 634 |
OpenI | Healthcare and Medical Sciences | X-ray | Pre-training | Image-text | 2016.03 | EN | Scientific databases | Manual | N/A | Data generation | N/A | 3,955 |
Retina Image Bank | Healthcare and Medical Sciences | CFP, FFA | Pre-training, SFT | Image-text | 2012.08 | EN | Other sources | Manual | N/A | Data generation | N/A | 30,452 |
William Hoyt ImageText | Healthcare and Medical Sciences | CFP | Pre-training | Image-text | 2004.03 | EN | Scientific databases | Manual | N/A | Data generation | N/A | 856 |
Pima | Healthcare and Medical Sciences | EHR | SFT | Classification | 1988.11 | EN | Scientific databases | Manual | N/A | N/A | N/A | 691 |
COVID-19-Data-Hub | Healthcare and Medical Sciences | Global pandemic data (cases, vaccines, policies, \etc) | Pre-training, RAG | Classification, Regression | 2020.07 | EN | Comprehensive multi-source integration | Automated | N/A | N/A | R package | N/A |
BEACON | Molecular and Cellular Biology | RNA sequence | SFT | Raw text | 2024.06 | EN | Comprehensive multi-source integration | Semi-automated | N/A | Data generation and review | N/A | 870,883 |
SPICE | Molecular and Cellular Biology | SMILES | Pre-training, RAG | Classification, Regression | 2024.03 | EN | Scientific databases | Semi-automated | N/A | Data generation and review | N/A | 113,999 |
PubChemSTM | Molecular and Cellular Biology | SMILES | Pre-training, SFT | Raw text | 2024.01 | EN | Academic and research resources | Semi-automated | N/A | Data generation | SciBERT, spaCy | 281,000 |
SourceData | Molecular and Cellular Biology | Academic papers | Pre-training | VQA | 2023.10 | EN | Academic and research resources | Semi-automated | N/A | Data review | PubMedBERT, BioLinkBERT, GPT-4o | 62,543 |
Mol-Instructions | Molecular and Cellular Biology | Biomolecular instructions | SFT | Text QA | 2023.06 | EN | Comprehensive multi-source integration | Automated | N/A | Data review | GPT-3.5 | 2,043,000 |
PCdes | Molecular and Cellular Biology | SMILES | Pre-training, SFT | Raw text | 2022.12 | EN | Academic and research resources | Automated | N/A | N/A | Custom crawlers | 12,000 |
MoMu | Molecular and Cellular Biology | Graph | Pre-training, SFT | Raw text | 2022.12 | EN | Academic and research resources | Automated | N/A | N/A | OGB | 15,613 |
PEER | Molecular and Cellular Biology | Protein sequence | SFT | Classification, Regression | 2022.10 | EN | Comprehensive multi-source integration | Semi-automated | N/A | Data generation and review | N/A | 329,922 |
BioGPT | Molecular and Cellular Biology, Healthcare and Medical Sciences | Biomedical domain pretraining corpus | Pre-training, SFT | Raw text | 2022.08 | EN | Scientific databases, Academic and research resources | Automated | N/A | N/A | Moses tokenizer, fastBPE | 15M |
DISEASES | Molecular and Cellular Biology, Healthcare and Medical Sciences, Multi-omics | Disease-gene associations | SFT, RAG | Classification | 2015.01 | EN | Academic and research resources, Integration of existing datasets, Scientific databases | Semi-automated | N/A | Data generation and review | NER tagger | 8,336,442 |
BioReason | Molecular and Cellular Biology, Multi-omics | DNA sequence, KEGG pathways, Gene variants | SFT, CoT | Text QA with CoT | 2025.05 | EN | Scientific databases, Academic and research resources | Semiβautomated | N/A | N/A | Custom scripts | 87,620 |
GeneChat | Multi-omics | Nucleotide sequence | Pre-training | Text QA | 2025.06 | EN | Scientific databases | N/A | N/A | Data generation | N/A | 47,275 |
Genomics instructions | Multi-omics | Nucleotide sequence | SFT | Text QA | 2025.04 | EN | Academic and research resources | N/A | N/A | Data generation | N/A | 4,954,234 |
scMMGPT data | Multi-omics | scRNA-seq | Pre-training, SFT | scRNA-seq-text | 2025.03 | EN | Academic and research resources | Automated | N/A | N/A | N/A | 467K |
OPI | Multi-omics | Protein | SFT | Text QA | 2025.03 | EN | Scientific databases | Semi-automated | N/A | Data generation | GPT-3.5 | 1,640,000 |
OpenGenome2 | Multi-omics | Nucleotide sequence | Pre-training | Raw text | 2025.02 | EN | Integration of existing datasets | N/A | N/A | N/A | N/A | 8,800B (nucleotides) |
Seq2Func | Multi-omics | Nucleotide sequence | SFT | Text QA | 2025.02 | EN | Scientific databases | Automated | N/A | Data generation | N/A | 297,000 |
DNA2Image | Multi-omics | Nucleotide sequence | SFT | Generation | 2025.02 | EN | Scientific databases | Automated | N/A | Data generation | N/A | 43,200 |
LLaMA-Gene (protein) | Multi-omics | Protein sequence | Pre-training, SFT | Text QA | 2024.12 | EN | Scientific databases | N/A | N/A | Data generation | N/A | 62,918 |
LLaMA-Gene (DNA) | Multi-omics | DNA sequence | Pre-training, SFT | Text QA | 2024.12 | EN | Scientific databases | N/A | N/A | Data generation | N/A | 178,551 |
OpenGenome | Multi-omics | Nucleotide sequence | Pre-training | Raw text | 2024.11 | EN | Integration of existing datasets | N/A | N/A | N/A | N/A | 300B (nucleotides) |
The 1000G | Multi-omics | Nucleotide sequence | Pre-training | Raw text | 2024.10 | EN | Scientific databases | N/A | N/A | N/A | N/A | 20,500B (nucleotides) |
Multispecies dataset | Multi-omics | Nucleotide sequence | Pre-training | Raw text | 2024.10 | EN | Scientific databases | N/A | N/A | N/A | N/A | 174B (nucleotides) |
NT Benchmark | Multi-omics | Nucleotide sequence | SFT | Classification | 2024.10 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 493,242 |
ProteinLMDataset | Multi-omics | Protein sequence | Pre-training | Raw text | 2024.06 | EN | Academic and research resources | Automated | N/A | N/A | N/A | 893,000 |
RNAcentral | Multi-omics | RNA sequence | Pre-training | Raw text | 2024.05 | EN | Scientific databases | N/A | N/A | N/A | N/A | 23M |
RNA-QA | Multi-omics | RNA sequence | SFT | Text QA | 2024.05 | EN | Academic and research resources | Automated | N/A | N/A | GPT-4o | 407,616 |
ProCoT | Multi-omics | Biomedical QA | SFT, CoT | Text QA with CoT | 2024.05 | EN | Scientific databases, Academic and research resources | Semiβautomated | N/A | Data generation and review | embeddingβbased filtering | 4,967,723 |
UniProtKB/Swiss-Prot | Multi-omics | Protein sequence | Pre-training | Raw text | 2023.11 | EN | Scientific databases | N/A | N/A | N/A | N/A | 570K |
Multi-species genome | Multi-omics | Nucleotide sequence | Pre-training | Raw text | 2023.06 | EN | Integration of existing datasets | N/A | N/A | N/A | N/A | 32.49B (nucleotides) |
Genomic Benchmark | Multi-omics | Nucleotide sequence | SFT | Classification | 2023.05 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 699,116 |
CELLxGENE scRNA-seq Collection | Multi-omics | scRNA-seq | Pre-training | Gene Expression-pretrain | 2023.05 | EN | Scientific databases | N/A | N/A | N/A | N/A | 33 M |
Human Pancreas | Multi-omics | scRNA-seq | SFT | Classification | 2023.01 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 10,600 |
scFoundation Dataset | Multi-omics | scRNA-seq | Pre-training | Gene Expression-pretrain | 2022.10 | EN | Scientific databases | N/A | N/A | N/A | N/A | 50M |
Human genome | Multi-omics | Nucleotide sequence | Pre-training | Raw text | 2021.02 | EN | Scientific databases | N/A | N/A | N/A | N/A | 2.75B (nucleotides) |
GPD | Multi-omics | Nucleotide sequence | Pre-training | Raw text | 2021.02 | EN | Scientific databases | N/A | N/A | N/A | N/A | 142,809 |
Myeloid | Multi-omics | scRNA-seq | SFT | Classification | 2021.02 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 9,748 |
Human Cell Atlas Dataset | Multi-omics | scRNA-seq | SFT | Classification | 2021.02 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 84,363 |
GVD | Multi-omics | Nucleotide sequence | Pre-training | Raw text | 2019.07 | EN | Scientific databases | N/A | N/A | N/A | N/A | 13,203B |
Multiple Sclerosis | Multi-omics | scRNA-seq | SFT | Classification | 2019.07 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 7,844 |
PanglaoDB | Multi-omics | scRNA-seq | Pre-training | Gene Expression-pretrain | 2018.11 | EN | Scientific databases | N/A | N/A | N/A | N/A | 1,126,580 |
Zheng68k | Multi-omics | scRNA-seq | SFT | Classification | 2016.07 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 68,450 |
GRCh38/hg38 | Multi-omics | Nucleotide sequence | Pre-training | Raw text | 2013.12 | EN | Scientific databases | N/A | N/A | N/A | N/A | 3.1B (nucleotides) |
Biology-Instructions | Multi-omics | DNA, RNA, Protein sequence | SFT | Text QA | 2024.12 | EN | Academic and research resources | Semi-automated | N/A | Data generation | GPT-4o, Claude-3.5-sunnet | 3.3 M |
TCPA | Multi-omics | Protein sequence | Pre-training | Raw text | 2013.09 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 4,379 |
NCBI-GenBank | Multi-omics | Nucleotide sequence | Pre-training | Raw text | 2012.11 | EN | Scientific databases | N/A | N/A | N/A | N/A | 5,000B (nucleotides) |
GRCh37/hg19 | Multi-omics | Nucleotide sequence | Pre-training | Raw text | 2009.02 | EN | Scientific databases | N/A | N/A | N/A | N/A | 3.1B (nucleotides) |
Neuro-3D | Neuroscience | EEG | Pre-training, SFT | Classification | 2025.03 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 720 |
Things-MEG | Neuroscience | MEG | Pre-training, SFT | Classification | 2023.04 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 22,248 |
Things-EEG2 | Neuroscience | EEG | Pre-training, SFT | Classification | 2022.11 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 16,740 |
SHU | Neuroscience | EEG | Pre-training, SFT | Classification | 2022.08 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 11,988 |
Things-fMRI | Neuroscience | fMRI | Pre-training, SFT | Classification | 2022.07 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 8,740 |
NSD-Imagery | Neuroscience | fMRI | Pre-training, SFT | Classification | 2022.07 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 2,304 |
HMC | Neuroscience | EEG | Pre-training, SFT | Classification | 2022.03 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 154 |
Things-EEG1 | Neuroscience | EEG | Pre-training, SFT | Classification | 2022.01 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 22,248 |
NSD | Neuroscience | fMRI | Pre-training, SFT | Classification | 2021.09 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 70,566 |
ZuCo2 | Neuroscience | EEG | Pre-training, SFT | Text QA | 2019.11 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 739 |
DIR | Neuroscience | fMRI | Pre-training, SFT | Classification | 2019.01 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 6,000 |
Workload | Neuroscience | EEG | Pre-training, SFT | Classification | 2018.12 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 1080 |
ZuCo1 | Neuroscience | EEG | Pre-training, SFT | Text QA | 2018.11 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 1,107 |
SEED-IV | Neuroscience | EEG | Pre-training, SFT | Classification | 2018.02 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 143,610 |
TUSL | Neuroscience | EEG | Pre-training, SFT | Classification | 2018.01 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 245 |
TUEV | Neuroscience | EEG | Pre-training, SFT | Classification | 2015.12 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 112,237 |
TUAB | Neuroscience | EEG | Pre-training, SFT | Classification | 2015.12 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 409,083 |
SEED | Neuroscience | EEG | Pre-training, SFT | Classification | 2015.05 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 144,851 |
Sleep-EDF | Neuroscience | EEG | Pre-training, SFT | Classification | 2013.10 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 197 |
SHHS | Neuroscience | EEG | Pre-training, SFT | Classification | 1998.01 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 6,441 |
repoDB | Pharmacy, Healthcare and Medical Sciences | Drug-disease relationships, Clinical trials | RAG | Classification, Text QA | 2017.03 | EN | Scientific databases | Automated | N/A | N/A | scripts | 15,648 |
Dataset | Domain | Modality | Purpose | Type | Release | Language | Source | Annotation Pipeline | Human Annotators | Human Tasks | Auto-annotation Tools | Size |
---|---|---|---|---|---|---|---|---|---|---|---|---|
MOSES | Biochemistry | SMILES | Pre-training | Raw text | 2020.07 | EN | Academic and research resources | Automated | N/A | Data generation and review | N/A | 1,936,962 |
ChemBL | Biochemistry | SMILES | Pre-training | Raw text | 2012.01 | EN | Academic and research resources | Automated | N/A | Data generation and review | N/A | 1,961,462 |
ChemRxivQuest | General Chemistry | Academic papers | Pre-training, SFT | Text QA | 2025.05 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 970 |
ScholarChemQA | General Chemistry | Academic papers | Pre-training, SFT | Text QA | 2025.02 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 40K |
SMolInstruct | General Chemistry | SMILES | SFT | Text QA | 2024.08 | EN | Scientific databases | Semi-automated | N/A | Data generation and review | GPT-4 | 3.3M |
ChemNLP | General Chemistry | Text | Pre-training, SFT | Classification | 2023.01 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 110,342 |
PMO | General Chemistry | SMILES | Pre-training, SFT | Raw text | 2022.05 | EN | Academic and research resources | Automated | N/A | Data generation and review | N/A | 10K |
ZINC | General Chemistry | SMILES | Pre-training | Raw text | 2012.10 | EN | Academic and research resources | Automated | N/A | Data generation and review | N/A | 250K |
DeepProtein | Pharmacy | Protein sequence, SMILES | Pre-training, SFT | Raw text | 2025.05 | EN | Academic and research resources | Automated | N/A | Data generation and review | N/A | 78K |
TrialBench | Pharmacy | SMILES, Disease code | Pre-training, SFT | Raw text | 2024.09 | EN | Academic and research resources | Automated | N/A | Data generation and review | N/A | 470K |
TDC2 | Pharmacy | SMILES, Protein sequence, Genome sequence | Pre-training, SFT | Classification, Regression, Generation | 2024.09 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 3.4B (tokens) |
SBDDBench | Pharmacy | Text, Protein sequence, SMILES | Pre-training, SFT | Protein-ligand | 2022.06 | EN | Academic and research resources | Automated | N/A | Data generation and review | N/A | 5K |
TOP | Pharmacy | SMILES | Pre-training, SFT | Raw text | 2022.02 | EN | Academic and research resources | Automated | N/A | Data generation and review | N/A | 12K |
TDC | Pharmacy | SMILES, Protein sequence, Genome sequence | Pre-training, SFT | Classification, Regression, Generation | 2021.06 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 0.2B (tokens) |
DeepPurpose | Pharmacy | Protein sequence, SMILES | Pre-training, SFT | Raw text | 2020.12 | EN | Academic and research resources | Automated | N/A | Data generation and review | N/A | 5,074 |
DrugBank | Pharmacy | SMILES | Pre-training, SFT | Raw text | 2018.01 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 18K |
DrugCentral | Pharmacy | SMILES | Pre-training, SFT | Raw text | 2017.01 | EN | Academic and research resources | Automated | N/A | Data generation and review | N/A | 4,995 |
USPTO | Synthetic Chemistry | SMILES | Pre-training, SFT | Generation | 2015.07 | EN | Patent databases | Manual | N/A | Data generation and review | N/A | 1,939,253 |
Dataset | Domain | Modality | Purpose | Type | Release | Language | Source | Annotation Pipeline | Human Annotators | Human Tasks | Auto-annotation Tools | Size |
---|---|---|---|---|---|---|---|---|---|---|---|---|
MM-PhyQA | General Physics | High-school exams | SFT, CoT | VQA with CoT | 2024.04 | EN | Web and Internet content | Manual | N/A | Data generation and review | AFL 3.0 | 3,825 |
PIQA | General Physics | Text | SFT | Text QA | 2020.01 | EN | Other sources | Semi-automated | AFLite | Data generation and review | N/A | 19,838 |
Dataset | Domain | Modality | Purpose | Type | Release | Language | Source | Annotation Pipeline | Human Annotators | Human Tasks | Auto-annotation Tools | Size |
---|---|---|---|---|---|---|---|---|---|---|---|---|
AstroLLaVA | Astronomy | General dialog, Astronomical images | SFT | VQA | 2025.04 | EN | Comprehensive multi-source integration | Semi-automated | N/A | Data review | GPT-4 | 29,783 |
AstroPT | Astronomy | Astronomical images | Pre-training | Regression | 2024.05 | EN | Web and Internet content, Scientific databases | Automated | N/A | Data review | DESI Legacy Survey API | 8.6M (tokens) |
Astro-NER | Astronomy | Academic papers | SFT | Text QA | 2024.05 | EN | Academic and research resources | Semi-automated | 4 | Data generation and review | GPT-3.5 | 5000 |
AstroLLaMA-chat | Astronomy | Academic papers | SFT | Text QA | 2024.01 | EN | Academic and research resources | Manual | N/A | Data review | N/A | 10,356 |
AstroLLaMA | Astronomy | Academic papers | SFT | Text QA | 2023.09 | EN | Academic and research resources, Web and Internet content | Manual | N/A | Data review | N/A | 9.5M |
ATel | Astronomy | Academic papers | SFT | Text QA | 2023.05 | EN | Academic and research resources | Manual | N/A | Data review | N/A | 234 |
AstroBERT | Astronomy | Academic papers | Pre-training | Raw text | 2022.11 | EN | Academic and research resources | Automated | 12 | Data generation and review | N/A | 3.8B (tokens) |
AstroMLab 4 | Astronomy | Academic papers | SFT | Text QA | 2025.05 | EN | Integration of existing datasets | Automated | N/A | Data generation and review | Gemini-1.5-Pro | 250,000 arXiv preprints |
AstroMLab 3 | Astronomy | Academic papers | SFT | Text QA | 2025.04 | EN | Academic and research resources | Automated | N/A | Data generation and review | Gemini-1.5-Pro | 3.3B (tokens) |
AstroMLab 2 | Astronomy | Academic papers | SFT | Text QA | 2024.09 | EN | Academic and research resources | Automated | N/A | Data generation and review | Gemini-1.5-Pro | 10,356 |
Starwhisper-pilsar | Astrophysics | Text, pulsar diagnostic plots, pulsars signals | SFT | Classification | 2024.04 | EN | Integration of existing datasets | Manual | N/A | Data generation and review | DeepSeek-VL-7B, InternVL2-40B | 106,674 |
PAPERCLIP | Astrophysics | synthetic conversation text, Astronomical images | SFT | Image-text | 2024.03 | EN | Academic and research resources | Automated | N/A | Data review | Mixtral-8x7B-Instruct | 31,859 |
Dataset | Domain | Modality | Purpose | Type | Release | Language | Source | Annotation Pipeline | Human Annotators | Human Tasks | Auto-annotation Tools | Size |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ChEBI-20-MM | Materials Science | InChI, IUPAC, SELFIES, Molecular image | SFT | Text QA | 2025.01 | EN | Integration of existing datasets | Manual | N/A | Data generation and review | N/A | 29,706 |
Materials Project Trajectory | Materials Science | CIF | Pre-training | Raw text | 2023.07 | EN | Scientific databases | Manual | N/A | Data generation and review | N/A | 1,580,395 |
DigiMOF | Materials Science | CIF | Pre-training | Raw text | 2023.05 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 15,501 |
Novel Materials Discovery (NOMAD) | Materials Science | CIF | Pre-training | Raw text | 2023.03 | EN | Scientific databases | Manual | N/A | Data generation and review | N/A | 4,341,443 |
MOFX-DB (hMOF) | Materials Science | CIF | Pre-training | Raw text | 2023.02 | EN | Integration of existing datasets | Manual | N/A | Data generation and review | N/A | 160,000 |
MatScholar | Materials Science | Academic papers | Pre-training | Raw text | 2022.07 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 5M |
Pfeiffer et al. Chemical composition | Materials Science | Chemical Composition | Pre-training | Raw text | 2022.03 | EN | Comprehensive multi-source integration | Manual | N/A | Data generation and review | N/A | 14,884 |
Pfeiffer et al. Mechanical Properties | Materials Science | Numerical property | Pre-training | Raw text | 2022.03 | EN | Comprehensive multi-source integration | Manual | N/A | Data generation and review | N/A | 1,278 |
ChEBI-20 | Materials Science | Scientific instruction | SFT | Text QA | 2021.11 | EN | Integration of existing datasets | Manual | N/A | Data generation and review | N/A | 29709 |
ZINC | Materials Science | SMILES | Pre-training | Raw text | 2020.12 | EN | Scientific databases | Manual | N/A | Data generation and review | N/A | 230M |
JARVIS-DFT | Materials Science | InChI, IUPAC, SELFIES | Pre-training | Raw text | 2020.11 | EN | Integration of existing datasets | Manual | N/A | Data generation and review | N/A | 41,000 |
MOSES | Materials Science | SMILES | SFT | Text QA | 2020.11 | EN | Integration of existing datasets | Manual | N/A | Data generation and review | N/A | 1.6M |
QMOF | Materials Science | CIF | Pre-training | Raw text | 2020.05 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 20,000 |
Warwick Electron Microscopy Datasets | Materials Science | STEM image, TEM image, TEM exit wavefunction | SFT | VQA | 2020.05 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 135395 |
CoRE MOF 2019 | Materials Science | CIF | Pre-training | Raw text | 2019.12 | EN | Integration of existing datasets | Manual | N/A | Data generation and review | N/A | 14,000 |
Inorganic Crystal Structure Database (ICSD) | Materials Science | CIF | Pre-training | Raw text | 2019.10 | EN | Scientific databases | Manual | N/A | Data generation and review | N/A | 318,901 |
US Patent Office (USPTO) | Materials Science | SMILES | Pre-training | Raw text | 2017.06 | EN | Patent databases | Manual | N/A | Data generation and review | N/A | 2,830,616 |
Open Quantum Materials Database (OQMD) | Materials Science | CIF | Pre-training | Raw text | 2014.11 | EN | Scientific databases | Manual | N/A | Data generation and review | N/A | 1,317,811 |
Materials Project | Materials Science | CIF | Pre-training | Raw text | 2013.07 | EN | Scientific databases | Manual | N/A | Data generation and review | N/A | 577,813 |
Dataset | Domain | Modality | Purpose | Type | Release | Language | Source | Annotation Pipeline | Human Annotators | Human Tasks | Auto-annotation Tools | Size |
---|---|---|---|---|---|---|---|---|---|---|---|---|
WeatherQA | Atmosphere | Remote sensing, Science QA | SFT | VQA | 2024.06 | EN | Scientific databases | Semi-automated | 4 | Data review | GPT-4 | 8,511 |
SeafloorAI | Hydrosphere | Sonar images, Text | SFT | VQA | 2024.11 | EN | Scientific databases | Semi-automated | 4 | Data review | GPT-4 | 7M |
TEOChatlas | Lithosphere | Remote sensing | SFT | VQA | 2025.01 | EN | Scientific databases | Automated | N/A | Data generation | N/A | 554K |
EarthVQA | Lithosphere | Remote sensing, Science QA | SFT | VQA | 2023.12 | EN | Scientific databases | Automated | N/A | Data generation | ArcGIS toolbox | 208K |
Geochat | Lithosphere | Remote sensing, Science QA | SFT | VQA | 2023.11 | EN | Scientific databases | Automated | N/A | Data generation | Vicuna-v1.5 | 306K |
FloodNet | Lithosphere | Remote sensing, Science QA | SFT | VQA | 2021.05 | EN | Scientific databases | Manual | N/A | Data generation and review | N/A | 11K |
GeoSignal | Lithosphere, Hydrosphere, Atmosphere | Remote sensing, Science QA | SFT | Text QA | 2023.06 | EN | Encyclopedias and knowledge bases, Academic and research resources, Scientific databases, Comprehensive multi-source integration | Semi-automated | 10 | Data review | GPT-4 | 39,749 |
GeoLLaVA-8k | Remote Sensing | Remote sensing | SFT | Image-text, VQA | 2025.05 | EN | Academic and research resources | Semi-automated | 35 | Data generation and review | GPT-4o | 81,367 |
EVAttrs-95k | Remote Sensing | Remote sensing, Object property | SFT | Image-text, VQA | 2025.03 | EN | Academic and research resources | Semi-automated | N/A | Data generation and review | Qwen2-VL-72B, GPT-4o | 95.1K |
VersaD | Remote Sensing | Remote sensing | Pre-training | Image-text | 2024.11 | EN | Academic and research resources | Automated | N/A | N/A | Gemini-Vision | 1.4M |
RSVP | Remote Sensing | Remote sensing | SFT | Image-text, VQA | 2024.10 | EN | Integration of existing datasets, Academic and research resources | Automated | N/A | N/A | GPT-4V, DINOv2-ViT L/14, CLIP-ConvNeXt | 3.65M |
FIT-RS | Remote Sensing | Remote sensing, Relation graph, \etc | SFT | Image-text, VQA | 2024.07 | EN | Integration of existing datasets, Academic and research resources | Semi-automated | N/A | Data generation | TinyLLaVA-3.1B, GPT-4, GPT-3.5, CLIP-ViT-L14 | 1,415K |
VRSBench | Remote Sensing | Remote sensing | SFT | Image-text, VQA | 2024.06 | EN | Academic and research resources | Semi-automated | N/A | Data review | GPT-4V | 142,390 |
MMRS-1M | Remote Sensing | Remote sensing, Optical, SAR, Infrared, \etc | SFT | Image-text, VQA | 2024.03 | EN | Integration of existing datasets | Automated | N/A | N/A | N/A | 1.06M |
ChatEarthNet | Remote Sensing | Remote sensing, Optical, Multi-band | SFT | Image-text | 2024.02 | EN | Scientific databases | Semi-automated | N/A | Data review | GPT-3.5, GPT-4V | 173,488 |
LHRS-Align | Remote Sensing | Remote sensing | Pre-training | Image-text | 2024.02 | EN | Scientific databases | Automated | N/A | N/A | Vicuna-v1.5-13B | 1.15M |
LHRS-Instruct | Remote Sensing | Remote sensing | SFT | Image-text, VQA | 2024.02 | EN | Integration of existing datasets, Academic and research resources | Semi-automated | N/A | Data review | Vicuna-v1.5-13B, GPT-4 | 12K |
RS5M | Remote Sensing | Remote sensing | Pre-training | Image-text | 2024.01 | EN | Scientific databases | Automated | N/A | N/A | CLIP | 5.07M |
SkyEye-968k | Remote Sensing | Remote sensing | SFT | Image-text, Video-text, VQA | 2024.01 | EN | Integration of existing datasets | Semi-automated | N/A | Data review | N/A | 968K |
SkyScript | Remote Sensing | Remote sensing | Pre-training | Image-text | 2023.12 | EN | Academic and research resources, Scientific databases | Automated | N/A | N/A | CLIP, Logistic Regression model | 2.6M |
RSICap | Remote Sensing | Remote sensing | SFT | Image-text | 2023.07 | EN | Academic and research resources | Manual | 5 | Data generation, Data review | N/A | 2.5K |
Dataset | Domain | Modality | Purpose | Type | Release | Language | Source | Annotation Pipeline | Human Annotators | Human Tasks | Auto-annotation Tools | Size |
---|---|---|---|---|---|---|---|---|---|---|---|---|
NaturalReasoning | Multidisciplinary (incl. Physics) | Text | SFT | Text QA with CoT | 2025.02 | EN | Web and Internet content, Books and literary works, Academic and research resources | Semi-automated | N/A | Data review | LLaMA-70B | 2.8M |
Nemotron-Science | Multidisciplinary (incl. Physics) | Text with formulae and code | SFT, RLHF | Text QA with CoT | 2025.05 | EN | Social media and forums, Academic and research resources, Books and literary works | Semi-automated | N/A | Data review | DeepSeek-R1 | 2.7M |
Galactica | Multidisciplinary (incl. Chemistry) | Text (incl. formulas, code) | Pre-training | Raw text | 2022.11 | EN | Webpages | Fully-automated | N/A | Data generation and review | Custom crawlers, PDF parsers | 106B tokens |
SciBERT | Multidisciplinary (incl. Physics) | Academic papers | Pre-training | Raw text | 2019.09 | EN | Academic and research resources | Automated | N/A | Data generation | Crawlers, text processing tools | 3.3B (tokens) |
ArXivCap | Physics, Biology, \etc | Paper figures | Pre-training | Image-text | 2024.05 | EN | Academic and research resources | Semi-automated | 7 | Data review | PDF parsers | 6.4M |
SCP-116K | Physics, Chemistry, Biology, \etc | Text with formulae | SFT | Text QA with CoT | 2025.01 | EN | Academic and research resources, Books and literary works | Semi-automated | N/A | Data review | PDF parsers, OCR, LaTeX rendering | 116.8K |
MegaScience | Medicine, Physics, Chemistry, Biology | Science textbooks | SFT | Text QA with CoT | 2024.08 | EN | Web and Internet content, Books and literary works, Integration of existing datasets | Semi-automated | N/A | Data review | Llama3.3-70B-Instruct, DeepSeek-V3, BGE-large-en-v1.5 | 651,840 |
Dataset | Domain | Modality | Level | Type | Release | Language | Source | Annotation Pipeline | Human Annotators | Human Tasks | Auto-annotation Tools | Size | Evaluation Type | Metrics |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SeedBench | Agriculture | Breeding literature | Expert | Text QA | 2025.05 | EN, ZH | Academic and research resources | Semi-automated | N/A | Data generation and review | GPT-4 | 2,264 | MCQ, Open-ended | Acc, F1, ROUGE |
AgEval | Agriculture | Plant stress phenotyping photos and annotations | Expert | VQA | 2025.01 | EN | Scientific databases | N/A | N/A | N/A | N/A | 1,200 | Classification, Regression | F1, NMAE |
AgXQA | Agriculture | Agricultural extension records | Expert | Text QA | 2024.10 | EN | Academic and research resources | Semi-automated | N/A | N/A | N/A | 2,186 | Open-ended | EM, F1 |
Fundus-MMBench | Healthcare and Medical Sciences | CFP | Expert | VQA | 2025.07 | EN | Integration of existing datasets | Manual | N/A | Data review | N/A | 620 | MCQ | Acc |
ReXVQA | Healthcare and Medical Sciences | X-ray | N/A | VQA | 2025.06 | EN | Integration of existing datasets | Semi-automated | 3 | Data review | GPT-4o, ClinicalBERT, MedEmbed | 40,557 | MCQ | Acc |
HealthBench | Healthcare and Medical Sciences | Clinical dialogue, Medical task requests, Medical record summarization, \etc | Expert | Text QA | 2025.05 | EN | Comprehensive multi-source integration | Semi-automated | 262 | Data generation and review | GPT-o1, GPT-4.1 | 5,000 | Open-ended | Customized rubric criterion |
MedAlpaca | Healthcare and Medical Sciences | Biomedical knowledge base | Expert | Text QA | 2025.03 | EN | Web and Internet content | Semi-automated | N/A | Data review | GPT-3.5-Turbo | 374 | MCQ | Acc |
GEMeX-VQA | Healthcare and Medical Sciences | X-ray | N/A | VQA | 2025.03 | EN | Integration of existing datasets | Semi-automated | N/A | Data review | OpenBioLLM-70B, GPT-4o | 3,960 | MCQ, True/False, Open-ended | Acc |
MIMIC-Diff-VQA | Healthcare and Medical Sciences | X-ray | Expert | VQA (multi-image) | 2025.02 | EN | Scientific databases | Semi-automated | 3 | Data generation and review | ScispaCy | 70,070 | MCQ, Open-ended | BLEU, METEOR, ROUGE-L, CIDEr |
MedAgentBench | Healthcare and Medical Sciences | EHR, Lab results, Diagnosis codes, Medication orders | Expert | Text QA | 2025.01 | EN | Academic and research resources | Manual | 2 | Data generation and review | N/A | 300 | Open-ended | Success rate |
MedXpertQA | Healthcare and Medical Sciences | CT, ECG, Histopathology, MRI, US, X-ray, \etc | Expert | VQA, Text QA | 2025.01 | EN | Academic and research resources | Semi-automated | N/A | Data generation and review | GPT-4o, Claude | 4,460 | MCQ | Acc |
OpenMM-Medical | Healthcare and Medical Sciences | CT, Dermatology, Endoscopy, CFP, MRI, Microscopy, X-ray, \etc | N/A | VQA | 2025.01 | EN, ZH | Comprehensive multi-source integration | Semi-automated | N/A | Data review | GPT-4o | 88,996 | MCQ | Acc |
Asclepius | Healthcare and Medical Sciences | CT, Dermatology, CFP, Histopathology, MRI, Microscopy, OCT, X-ray, \etc | N/A | VQA, Image-text | 2024.11 | EN | Comprehensive multi-source integration | Semi-automated | 34 | Data generation and review | ChatGPT, GPT-4V, GPT-4o | 3,232 | MCQ | Acc |
ClinicalBench | Healthcare and Medical Sciences | EHR | N/A | Text QA | 2024.11 | EN | Integration of existing datasets | N/A | N/A | N/A | N/A | N/A | MCQ | F1, AUROC |
WorldMedQA-V | Healthcare and Medical Sciences | Dermatology, Microscopy, X-ray, \etc | N/A | VQA | 2024.10 | EN, JA, ES, HE, PT | Academic and research resources | Semi-automated | N/A | Data review | GPT-4o, Gemini Flash1-5, Yi-VL-34B | 568 | MCQ | Acc |
CRAFT-BioQA | Healthcare and Medical Sciences | Biomedical QA | N/A | Text QA | 2024.09 | EN | Academic and research resources | Automated | N/A | N/A | N/A | N/A | MCQ | Acc |
MedTrinity-25M | Healthcare and Medical Sciences | CT, MRI, X-ray, Histopathology, \etc | Expert | Image-text, VQA | 2024.08 | EN | Integration of existing datasets, Scientific databases | Automated | N/A | N/A | N/A | 100,000 | Open-ended | Acc |
GMAI-MMBench | Healthcare and Medical Sciences | CT, Dermatology, Endoscopy, CFP, Histopathology, MRI, Microscopy, OCT, PET, US, X-ray, \etc | N/A | VQA | 2024.08 | EN | Comprehensive multi-source integration | Semi-automated | N/A | Data review | GPT-4o | 26k | MCQ | Acc |
SlideBench | Healthcare and Medical Sciences | Histopathology | N/A | VQA | 2024.11 | EN | Scientific databases | Semi-automated | N/A | Data generation and review | GPT-4o | 16k | MCQ, Open-ended | Acc, BLEU |
Bio-ML | Healthcare and Medical Sciences | Ontology data | Expert | Text QA | 2024.07 | EN | Encyclopedias and knowledge bases | Semi-automated | N/A | Data generation and review | N/A | 25,270 | Retrieval | F1 |
MedBench | Healthcare and Medical Sciences | Dianosis report, Clinical dialogue, EHR | Expert | Text QA | 2024.06 | ZH | Integration of existing datasets | Manual | N/A | Data generation | N/A | 300,901 | MCQ, Open-ended | BLEU, ROUGE-L, F1, Acc |
ClinicalLab | Healthcare and Medical Sciences | Clinical notes | Expert | Text QA | 2024.06 | EN, ZH | Other sources | Manual | N/A | Data generation and review | GPT-4 | 1,500 | Open-ended | DWR, DIFR, CDR, Acceptability, Acc, BLEU, ROUGE, BERTScore |
AgentClinic-NEJM | Healthcare and Medical Sciences | Clinical dialog, Diagnosis report, CT, Dermatology, Histopathology, \etc | Expert | VQA | 2024.05 | EN | Academic and research resources, Comprehensive multi-source integration | Automated | N/A | N/A | N/A | 120 | Open-ended | Acc, Patient compliance, Consultation ratings |
AgentClinic-Lang | Healthcare and Medical Sciences | Medical exams | Expert | Text QA | 2024.05 | EN, ES, FA, FR, HI, KO, ZH | Academic and research resources, Comprehensive multi-source integration | Semi-automated | N/A | Data review | GPT-4 | 749 | Open-ended | Acc, Patient compliance, Consultation ratings |
AgentClinic-MedQA | Healthcare and Medical Sciences | Medical exams | Expert | Text QA | 2024.05 | EN | Academic and research resources, Comprehensive multi-source integration | Semi-automated | N/A | Data review | GPT-4 | 215 | Open-ended | Acc, Patient compliance, Consultation ratings |
AgentClinic-MIMIC-IV | Healthcare and Medical Sciences | EHR | Expert | Text QA | 2024.05 | EN | Scientific databases | Semi-automated | N/A | Data review | GPT-4 | 200 | Open-ended | Acc, Patient compliance, Consultation ratings |
AgentClinic-Spec | Healthcare and Medical Sciences | Medical exams | Expert | Text QA | 2024.05 | EN | Integration of existing datasets | Semi-automated | N/A | N/A | GPT-4 | 260 | Open-ended | Acc, Patient compliance, Consultation ratings |
M3D-Bench | Healthcare and Medical Sciences | CT, Clinical reports | Expert | Image-text, Text QA, VQA | 2024.04 | EN | Scientific databases, Integration of existing datasets | Semi-automated | N/A | Data generation and review | GPT-4V | 1,235 | MCQ, Open-ended, Retrieval | Acc, BLEU, ROUGE |
AMOS-MM | Healthcare and Medical Sciences | CT | Expert | Image-text, VQA | 2024.04 | EN, ZH | Integration of existing datasets, Scientific databases | N/A | N/A | N/A | N/A | 2300 | Open-ended, MCQ | Acc |
CMtMedQA | Healthcare and Medical Sciences | Clinical dialogue | Expert | Text QA | 2024.03 | ZH | Books and literary works | Semi-automated | 6 | Data review | GPT-3.5, CMeKG, RLHF-Label-Tool | 70k | Open-ended | GPT-4 score |
Medbullets | Healthcare and Medical Sciences | Medical exams | N/A | Text QA | 2024.02 | EN | Social media and forums | Automated | N/A | Data review | N/A | 618 | MCQ | ROUGE-L, BERTScore, CTC, G-Eval, BARTScore+ |
RareBench | Healthcare and Medical Sciences | EHR, Medical history, Lab tests | Expert | Text QA | 2024.02 | EN, ZH | Scientific databases, Academic and research resources, Other sources | Manual | N/A | Data generation and review | N/A | 2,185 | Open-ended | Precision, Recall, F1, Median Rank, \etc |
OmniMedVQA | Healthcare and Medical Sciences | CT, Dermatology, Endoscopy, CFP, Histopathology, MRI, Microscopy, OCT, PET, US, X-ray, \etc | N/A | VQA | 2024.02 | EN | Comprehensive multi-source integration | Semi-automated | N/A | Data review | GPT-4 | 127,995 | MCQ | Acc |
MultiMedEval | Healthcare and Medical Sciences | CT, Dermatology, CFP, Histopathology, MRI, Microscopy, OCT, US, X-ray, \etc | N/A | VQA | 2024.02 | EN | Integration of existing datasets | Semi-automated | N/A | Data review | CheXbert, GPT, RadGraph | 60k | MCQ | Acc |
Fhirfly Medical Questions | Healthcare and Medical Sciences | Biomedical QA | N/A | Text QA | 2024.01 | EN | Academic and research resources | Semi-automated | N/A | Data review | N/A | 25,102 | True/False | Acc |
RP3D-DiagDS | Healthcare and Medical Sciences | CT, MRI, X-ray, US, Fluoroscopy, \etc | Expert | Classification | 2023.12 | EN | Scientific databases | Semi-automated | N/A | Data generation and review | Custom crawlers, GPT-4 | 40,936 | True/False | AUROC, AP |
NEJM-AI Benchmarking | Healthcare and Medical Sciences | Medical exams | Expert | Text QA | 2023.11 | EN | Academic and research resources | Automated | N/A | N/A | NLTK, Regex | 858 | MCQ | Acc, BLEU, WER, Cosine |
MORFITT | Healthcare and Medical Sciences | Clinical papers | Expert | Classification | 2023.11 | FR | Academic and research resources | Manual | N/A | Data review | N/A | 1,560 | Classification | Precision, Rappel, F1 |
SourceData | Healthcare and Medical Sciences | Gene/protein entities | Expert | Raw text | 2023.10 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 620,000 | NER | Precision, Recall, F1 |
SDOH-NLI | Healthcare and Medical Sciences | Clinical notes | Expert | Classification | 2023.10 | EN | Integration of existing datasets | Manual | N/A | Data generation | N/A | 4.21k | Classification | Precision, Recall, F1 |
HealthsearchQA | Healthcare and Medical Sciences | Consumer health QA | Expert | Text QA | 2023.08 | EN | Web and Internet content | Semi-automated | N/A | Data review | N/A | 3,173 | Open-ended | Factuality, Comprehension, Reasoning, Possible harm and bias |
CMB-Exam | Healthcare and Medical Sciences | Medical exams | Expert | Text QA | 2023.08 | ZH | Web and Internet content | Semi-automated | N/A | Data review | N/A | 280,839 | MCQ | Acc |
CMB-Clin | Healthcare and Medical Sciences | Medical exams | Expert | Text QA | 2023.08 | ZH | Books and literary works | Semi-automated | N/A | Data review | N/A | 208 | Open-ended | Fluency, Relevance, Completeness, Proficiency |
MultiMedBench | Healthcare and Medical Sciences | CT, Dermatology, Histopathology, Microscopy, MRI, X-ray, \etc | N/A | Text QA, VQA | 2023.07 | Mixed | Integration of existing datasets | N/A | N/A | N/A | N/A | 1M | N/A | Acc, ROUGE-L, BLEU, F1-RadGraph, F1 |
GPT-4 BiasBenchmark | Healthcare and Medical Sciences | Clinical trials | Expert | Text QA | 2023.07 | EN | Academic and research resources | Semi-automated | N/A | Data generation and review | GPT-4 | 213 | Open-ended | Acc |
Lavita Medical QA | Healthcare and Medical Sciences | Clinical guidelines | N/A | Text QA | 2023.07 | EN | Academic and research resources | Automated | N/A | N/A | N/A | 11,500 | MCQ | Acc |
BioASQ10b-factoid | Healthcare and Medical Sciences | Clinical dialogue, PubMed snippets | Expert | Text QA | 2023.07 | EN | Scientific databases, Academic and research resources | Manual | N/A | Data generation and review | N/A | 166 | Open-ended | Acc, MRR |
MedNERF | Healthcare and Medical Sciences | Drug Prescription | Expert | Classification | 2023.06 | FR | Other sources | Manual | N/A | Data generation and review | N/A | 100 | NER | F1 |
WikiMedQA | Healthcare and Medical Sciences | Clinical reports | Expert | Text QA | 2023.03 | EN | Web and Internet content | Semi-automated | N/A | N/A | SentenceBERT, BioLinkBERT | 5,893 | MCQ | Acc |
BioASQ | Healthcare and Medical Sciences | Biomedical Documents | Expert | Text QA | 2022.12 | EN | Academic and research resources | Manual | 21 | Data generation and review | N/A | 4,721 | Open-ended | Acc |
BioRED | Healthcare and Medical Sciences | Biomedical papers | N/A | Classification | 2022.09 | EN | Scientific databases | Semi-automated | 6 | Data generation and review | PubTator | 100 | NER | Precision, Recall, F1 |
BioLeaflets | Healthcare and Medical Sciences | Package leaflets | Expert | Raw text | 2021.09 | EN | Web and Internet content | Semi-automated | N/A | Data generation | Stanza, Amazon Comprehend Medical | 134 | Generation | SacreBLEU, ROUGE-L, BERTScore, BLEURT, MoverScore-21 |
CBLUE | Healthcare and Medical Sciences | Clinical trials, EHR, Medical forum, Medical textbooks | N/A | Classification, Text QA | 2021.06 | ZH | Comprehensive multi-source integration | Manual | 3 | Data generation and review | N/A | 46,729 | NER, Open-ended, Retrieval | Acc, F1 |
SLAKE | Healthcare and Medical Sciences | CT, MRI, X-ray | N/A | VQA | 2021.02 | EN, ZH | Academic and research resources | Automated | N/A | N/A | N/A | 2,070 | MCQ, Open-ended | Acc |
MEDIQA-AnS | Healthcare and Medical Sciences | Consumer health QA | Undergraduate | Text QA | 2020.09 | EN | Web and Internet content | Manual | 2 | Data generation | N/A | 708 | Open-ended | ROUGE, BLEU |
RadVisDial (G) | Healthcare and Medical Sciences | X-ray | N/A | VQA | 2020.07 | EN | Integration of existing datasets | Semi-automated | 2 | Data generation | NegBio, CheXpert | 91k | MCQ | Acc |
CORD-19 | Healthcare and Medical Sciences | Academic papers | N/A | Text QA | 2020.03 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 280K | Retrieval, QA | MRR, Acc |
PathVQA | Healthcare and Medical Sciences | Histopathology | Expert | VQA | 2020.03 | EN | Academic and research resources | Automated | N/A | N/A | CoreNLP | 6,012 | MCQ, Open-ended | BLEU, Exact-match, F1 |
MedQuAD | Healthcare and Medical Sciences | Patient educational materials | Undergraduate | Text QA | 2019.11 | EN | Web and Internet content | Semi-automated | 2 | Data generation | MetaMap Lite, UMLS lookup | 47,457 | Open-ended | Acc, F1, MRR |
Pubmed Causal | Healthcare and Medical Sciences | Biomedical papers | N/A | Classification | 2019.11 | EN | Scientific databases | Manual | N/A | Data generation | N/A | 2,446 | Classification | Acc, F1 |
PubMedQA instruction | Healthcare and Medical Sciences | Clinical dialogue | Expert | Text QA | 2019.09 | EN | Academic and research resources | Manual | N/A | Data generation | N/A | 273k | Classification | Acc |
VQA-RAD | Healthcare and Medical Sciences | CT, MRI, PET, US, X-ray | N/A | VQA | 2018.11 | EN | Academic and research resources | Manual | N/A | Data generation | N/A | 451 | MCQ, Open-ended | Acc, BLEU |
Pima | Healthcare and Medical Sciences | EHR | Expert | Classification | 1988.11 | EN | Scientific databases | Manual | N/A | N/A | N/A | 77 | Classification | AUROC |
TOMG-Bench | Molecular and Cellular Biology | Molecule | Expert | Text QA | 2024.12 | EN | Scientific databases | Automated | N/A | N/A | N/A | 45,000 | Open-ended | Success Rate, Similarity, Novelty, Validity |
MoleculeQA | Molecular and Cellular Biology | Molecule | Expert | Text QA | 2024.11 | EN | Scientific databases | Manual | 2 | Data generation and review | N/A | 62,000 | MCQ | Acc |
BEACON | Molecular and Cellular Biology | RNA sequence | N/A | Raw text | 2024.06 | EN | Comprehensive multi-source integration | Semi-automated | N/A | Data generation and review | N/A | 96,283 | Classification, Regression | F1, AUROC, Precision, R\textsuperscript{2}, MSE, PCC |
GeneHop | Molecular and Cellular Biology | Multi-hop genomic QA | N/A | Text QA | 2023.04 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 150 | Open-ended | Acc |
PCdes | Molecular and Cellular Biology | SMILES | N/A | Text QA | 2022.12 | EN | Academic and research resources | Automated | N/A | N/A | Custom crawlers | 3,000 | Retrieval | Acc, Recall |
PEER | Molecular and Cellular Biology | Protein sequence | Expert | Classification, Regression | 2022.10 | EN | Comprehensive multi-source integration | Semi-automated | N/A | Data generation and review | N/A | 115,281 | Classification, Regression | Acc, RMSE, Precision, PCC |
BioPreDyn-bench | Molecular and Cellular Biology | Time-series (simulation data) | Expert | Regression | 2015.02 | EN | Academic and research resources | N/A | N/A | N/A | 6 | 6 | Open-ended | NRMSE |
MicroVQA | Molecular and Cellular Biology, Healthcare and Medical Sciences | Microscopy | Expert | VQA | 2025.03 | EN | Academic and research resources | Semi-automated | 12 | Data generation and review | GPT-4o | 1042 | MCQ | Acc |
DISEASES | Molecular and Cellular Biology, Healthcare and Medical Sciences, Multi-omics | Disease-gene associations | Expert | Classification | 2015.01 | EN | Academic and research resources, Integration of existing datasets, Scientific databases | Semi-automated | N/A | Data generation and review | NER tagger | 8,336,442 | Open-ended, True/False, Retrieval | Precision, Recall, F1, AUROC, AUPRC |
LAB-Bench | Molecular and Cellular Biology, Multi-omics, Neuroscience | Research problems | N/A | Text QA | 2024.07 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 2,457 | MCQ | Acc |
BioProBench | Molecular and Cellular Biology, Multi-omics, Pharmacy, Neuroscience, \etc | Protocol | N/A | Text QA | 2025.05 | EN | Academic and research resources | Semi-automated | N/A | Data review | N/A | 556,171 | Open-ended | Acc, F1, EM, BLEU |
Genome-Bench | Multi-omics | Research problems | N/A | Text QA | 2025.06 | EN | Academic and research resources | N/A | N/A | Data generation and review | GPT-4 Turbo, GPT-4o | 3,332 | MCQ | Acc |
GeneChat-test | Multi-omics | Nucleotide sequence | N/A | Text QA | 2025.06 | EN | Scientific databases | N/A | N/A | Data generation | N/A | N/A | Open-ended | BLUE, METEOR |
GeneChat | Multi-omics | Nucleotide sequence | N/A | Text QA | 2025.06 | EN | Scientific databases | N/A | N/A | Data generation | N/A | 2,973 | Open-ended | BLEU, METEOR |
Genomics instructions | Multi-omics | Nucleotide sequence | N/A | Text QA | 2025.04 | EN | Academic and research resources | N/A | N/A | Data generation | N/A | 403,814 | Classification, Regression | F1, MCC, AUROC, PCC |
BixBench | Multi-omics | Genomics transcriptomics text | Expert | Text QA | 2025.03 | EN | Academic and research resources | Semi-automated | Data generation and review | 53 | Claude 3.5 Sonnet | 296 | Open-ended, MCQ | Acc |
Seq2Func | Multi-omics | Nucleotide sequence | N/A | Text QA | 2025.02 | EN | Scientific databases | Automated | N/A | Data generation | N/A | 33,000 | MCQ | MCC, F1 |
DNA2Image | Multi-omics | Nucleotide sequence | N/A | Generation | 2025.02 | EN | Scientific databases | Automated | N/A | Data generation | N/A | 4,800 | Generation | Invalid percentage, F1 |
DNA Long Bench | Multi-omics | DNA sequence | N/A | Classification, Regression | 2025.01 | EN | Scientific databases; Academic and research resources | Automated | N/A | N/A | N/A | 213,416 | Classification, Regression | SCC, PCC, AUROC |
LLaMA-Gene (protein) | Multi-omics | Protein sequence | N/A | Text QA | 2024.12 | EN | Scientific databases | N/A | N/A | Data generation | N/A | 6,991 | Open-ended | Acc |
LLaMA-Gene (DNA) | Multi-omics | DNA sequence | N/A | Text QA | 2024.12 | EN | Scientific databases | N/A | N/A | Data generation | N/A | 19,839 | Open-ended | Acc |
NT Benchmark | Multi-omics | Nucleotide sequence | N/A | Classification | 2024.10 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 38,822 | MCQ | MCC |
BioinformaticsBench | Multi-omics | Textbook | Undergraduate | Text QA | 2024.06 | EN | Books and literary works, Academic and research resources | Semi-automated | 4 | N/A | GPT-3.5, GPT-4, GPT-4 Turbo | 602 | MCQ, True/False, Open-ended | Acc |
genomics-long-range-benchmark | Multi-omics | Nucleotide sequence | N/A | Classification, Regression | 2024.05 | EN | Academic and research resources | N/A | N/A | N/A | N/A | N/A | Classification, Regression | MCC |
RNA-QA | Multi-omics | RNA sequence | N/A | Text QA | 2024.05 | EN | Academic and research resources | Automated | N/A | N/A | N/A | 121 K | Open-ended | Precision, Recall, F1, ROUGE |
BioinfoBench | Multi-omics | RNA sequence | Undergraduate | Text QA | 2023.10 | EN | Other sources | Semi-automated | N/A | Data review | ChatGPT | 200 | MCQ | Acc, Perplexity, Next-token likelihood |
BioCoder | Multi-omics | Codes | Undergraduate | Text QA | 2023.08 | EN | Integration of existing datasets, Academic and research resources | Automated | N/A | N/A | N/A | 2522 | Open-ended | Acc |
SpeciesClassification | Multi-omics | Nucleotide sequence | N/A | Classification | 2023.06 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 5 species genomes | MCQ | Acc |
GUE Benchmark | Multi-omics | Nucleotide sequence | N/A | Classification | 2023.06 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 80,648 | MCQ | MCC, F1 |
Genomic Benchmark | Multi-omics | Nucleotide sequence | N/A | Classification | 2023.05 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 191,589 | MCQ | Acc, F1 |
GeneTuring | Multi-omics | Biomedical knowledge base | N/A | Text QA | 2023.03 | EN | Academic and research resources | N/A | N/A | Data generation | GPT-2, BioGPT, BioMedLM, GPT-3, ChatGPT, New Bing | 600 | MCQ | Acc |
Human Pancreas | Multi-omics | scRNA-seq | N/A | Classification | 2023.01 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 4,218 | Classification | Acc, Precision, Recall, F1 |
Myeloid | Multi-omics | scRNA-seq | N/A | Classification | 2021.02 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 3,430 | Classification | Acc, Precision, Recall, F1 |
Human Cell Atlas Dataset | Multi-omics | scRNA-seq | N/A | Classification | 2021.02 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 84,363 | Classification | Acc, F1 |
Human enhancers Ensembl | Multi-omics | Nucleotide sequence | N/A | Classification | 2021.01 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 154,842 | MCQ | MCC |
Human regulatory Ensembl | Multi-omics | Nucleotide sequence | N/A | Classification | 2021.01 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 289,061 | MCQ | MCC |
Human ocr Ensembl | Multi-omics | Nucleotide sequence | N/A | Classification | 2021.01 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 174,756 | MCQ | MCC |
Multiple Sclerosis | Multi-omics | scRNA-seq | N/A | Classification | 2019.07 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 13,468 | Classification | Acc, Precision, Recall, F1 |
APARENT | Multi-omics | Nucleotide sequence | N/A | Regression | 2019.06 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 8,000 | Regression | R\textsuperscript{2} |
Human enhancers Cohn | Multi-omics | Nucleotide sequence | N/A | Classification | 2018.02 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 27,791 | MCQ | MCC |
Human non-TATA promoters | Multi-omics | Nucleotide sequence | N/A | Classification | 2017.02 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 36,131 | MCQ | MCC |
Zheng68k | Multi-omics | scRNA-seq | N/A | Classification | 2016.07 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 68,450 | Classification | Acc, F1 |
Drosophila enhancers Stark | Multi-omics | Nucleotide sequence | N/A | Classification | 2014.06 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 6,914 | MCQ | MCC |
COMET | Multi-omics | DNA, RBA, Protein Sequence/Residue | N/A | Classification, Regression | 2024.12 | EN | Academic and research resources | N/A | N/A | N/A | N/A | 1.22M | Classification, Regression | R\textsuperscript{2}, PCC, F1, SCC |
AdaBrain-Bench | Neuroscience | EEG | N/A | Classification | 2025.07 | EN | Integration of existing datasets | N/A | N/A | N/A | N/A | N/A | Open-ended | Acc, AUROC, AUPRC, F1, PCC, R\textsuperscript{2} |
FDA Pharmaceuticals FAQ | Pharmacy | FAQ-style text | Expert | Text QA | 2023.03 | EN | Web and Internet content | Automated | N/A | N/A | N/A | 1,681 | MCQ | Acc |
repoDB | Pharmacy, Healthcare and Medical Sciences | Drug-disease relationships, Clinical trial outcomes | Expert | Classification, Text QA | 2017.03 | EN | Scientific databases | Automated | N/A | N/A | scripts | 15,648 | MCQ, Retrieval | AUROC, AUPRC, Acc |
Dataset | Domain | Modality | Level | Type | Release | Language | Source | Annotation Pipeline | Human Annotators | Human Tasks | Auto-annotation Tools | Size | Evaluation Type | Metrics |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
OmniGenBench | Biochemistry, Multi-omics | DNA sequence, RNA sequence, TF binding, \etc | N/A | Classification | 2025.05 | N/A | Integration of existing datasets, Academic and research resources | N/A | N/A | N/A | N/A | N/A | N/A | AUROC, F1, RMSE, R\textsuperscript{2} |
MOSES | Biochemistry | SMILES | Expert | Raw text | 2020.07 | EN | Academic and research resources | Manual | N/A | Data review | N/A | 1,936,962 | Generation | Chemical validity, Drug-likeness |
ChEMBL | Biochemistry | SMILES | Expert | Raw text | 2012.01 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 1.96M | Generation | Chemical validity, Drug-likeness |
ChemRxivQuest | General Chemistry | Academic papers | Expert | Text QA | 2025.05 | EN | Academic and research resources | Manual | N/A | Data generation and review | NA | 970 | Open-ended | Acc |
ScholarChemQA | General Chemistry | Academic papers | Expert | Text QA | 2025.02 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 40K | MCQ | Acc |
ChemSafetyBench | General Chemistry | Text | Expert | Raw text | 2024.11 | EN | Academic and research resources | Automated | N/A | Data generation and review | NA | 30K+ | Open-ended | Acc, Recall, Precision, F1, safety/quality score, \etc |
ChemEval | General Chemistry | Text | Expert | Raw text | 2024.09 | EN | Academic and research resources | Automated | N/A | Data generation and review | NA | unknown (42 tasks) | Open-ended | Acc, BLEU-2, F1, \etc |
ChemNLP | General Chemistry | Text | Secondary school | Text QA | 2023.01 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 27.6K | Classification, NER, Generation | Acc, ROUGE |
ZINC | General Chemistry | SMILES | Expert | Raw text | 2012.10 | EN | Academic and research resources | Manual | N/A | Data review | N/A | 250K | Generation | Chemical validity, Drug-likeness |
PMO | General Chemistry | SMILES | Expert | Raw text | 2022.05 | EN | Academic and research resources | Automated | N/A | Data generation and review | N/A | 10K | Open-ended | target property, Chemical validity, Drug-likeness |
DeepProtein | Pharmacy | Protein sequence, SMILES | Expert | Raw text | 2025.05 | EN | Academic and research resources | Automated | N/A | Data generation and review | N/A | 78K | Classification, Regression | Acc, MAE, F1, AUPRC, AUROC, R\textsuperscript{2}, \etc |
TrialBench | Pharmacy | SMILES, Disease code | Expert | Raw text | 2024.09 | EN | Academic and research resources | Automated | N/A | Data generation and review | N/A | 470K | Open-ended | F1, Recall, Precision, MSE, \etc |
TDC2 | Pharmacy | SMILES, Protein sequence, Genome sequence | Expert | Raw text | 2024.09 | EN | Academic and research resources | Manual | N/A | Data generation and review | NA | 3.4B tokens | Open-ended | F1, Recall, Precision, MSE, \etc |
PCQM4Mv2 | Pharmacy | Molecular graph | N/A | Regression | 2022.11 | EN | Academic and research resources | Automated | N/A | N/A | N/A | 3,746,619 | Regression | MAE |
SBDDBench | Pharmacy | Protein sequence, SMILES | Expert | Protein-ligand | 2022.06 | EN | Academic and research resources | Automated | N/A | Data generation and review | N/A | 5K | Open-ended | binding affinity, Chemical validity, Drug-likeness |
GEOM | Pharmacy | 3D conformation | N/A | Regression | 2022.04 | EN | Academic and research resources | Automated | N/A | N/A | CREST (GFN2-xTB) | 37M conformations | Regression | MAE, RMSD |
TOP | Pharmacy | SMILES | Expert | Raw text | 2022.02 | EN | Academic and research resources | Automated | N/A | Data generation and review | N/A | 12K | Open-ended | F1, Recall, AUPRC, \etc |
TDC | Pharmacy | SMILES, Protein sequence, Genome sequence | Expert | Raw text | 2021.06 | EN | Academic and research resources | Manual | N/A | Data generation and review | NA | 0.2B tokens | Open-ended | F1, Recall, Precision, MSE, \etc |
DeepPurpose | Pharmacy | Protein sequence, SMILES | Expert | Raw text | 2020.12 | EN | Academic and research resources | Automated | N/A | Data generation and review | N/A | 5,074 | Classification, Regression | MSE, PCC, F1, AUROC, AUPRC, \etc |
DrugBank | Pharmacy | SMILES | Expert | Raw text | 2018.01 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 17.47K | Open-ended | Acc |
USPTO | Synthetic Chemistry | SMILES | Expert | Raw text | 2015.07 | EN | Patent databases | Manual | N/A | Data generation and review | NA | 1,939,253 | Open-ended | Acc, F1, MSE, \etc |
Dataset | Domain | Modality | Level | Type | Release | Language | Source | Annotation Pipeline | Human Annotators | Human Tasks | Auto-annotation Tools | Size | Evaluation Type | Metrics |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FSReD | General Physics | Text | Expert | Text QA | 2019.05 | EN | Comprehensive multi-source integration | Automated | N/A | Data review | N/A | 120 | Regression | MSE, Exact Match |
PIQA | General Physics | Text | Primary school | Text QA | 2020.01 | EN | Other sources | Semi-automated | N/A | Data generation and review | N/A | 2,000 | MCQ | Acc |
SRBench | General Physics | Text | N/A | Text QA | 2021.07 | EN | Comprehensive multi-source integration | Semi-automated | N/A | Data generation and review | N/A | 252 | Open-ended | Acc, Simplicity, Exact Match |
PROST | General Physics | Text | N/A | Text QA | 2021.08 | EN | Other sources | Semi-automated | 4 | Data generation | N/A | 18,736 | MCQ | Acc |
MM-PhyQA | Kinematics, Mechanics, Electrostatics and Current Electricity, Thermodynamics, Optics, Magnetism, \etc | Text | High school | VQA with CoT | 2024.04 | EN | Comprehensive multi-source integration | Semi-automated | 8+ | Data generation and review | ChatGPT | 675 | MCQ | Acc, ROUGE |
MVBench | General Physics | Video | N/A | Video QA | 2024.05 | EN | Comprehensive multi-source integration | Automated | 0 | Data review | N/A | 4,000 | MCQ | Acc |
UGPhysics | Mechanics, Thermodynamics, Electromagnetism, Modern Physics | Text (problem statements, equations, reasoning) | Undergraduate | Text QA | 2025.01 | EN, ZH | Academic and research resources | Semi-automated | N/A | Data generation and review | GPT-4o | 11,040 | MCQ, Open-ended, True/False, Retrieval | Acc |
PhysReason | Mechanics, Electromagnetism, Thermodynamics, \etc | Text (problem statements, equations), Diagrams (physics illustrations) | Undergraduate, Graduate, Expert | Text QA, VQA | 2025.02 | EN | Comprehensive multi-source integration | Semi-automated | 4 | Data generation and review | GPT-4 | 1,200 | MCQ | Acc |
TPBench | Cosmology, High Energy Theory, General Relativity, Astrophysics, Electromagnetism, Quantum Mechanics, Mechanics, \etc | Text | N/A | Text QA | 2025.02 | N/A | Other sources | Manual | N/A | Data generation and review | N/A | 57 | Open-ended | Acc, AI-based Holistic Grading |
PHYSICS | Mechanics, Electromagnetism, Thermodynamics, Optics, \etc | Text (problem statements, equations, reasoning), Diagrams (illustrations, charts, experimental setups) | Undergraduate | Text QA, VQA | 2025.03 | EN | Comprehensive multi-source integration | Manual | N/A | Data generation and review | N/A | 1,297 | Open-ended | Acc |
PhysicsArena | Mechanics, Electromagnetism, Thermodynamics, \etc | Text (problem statements, equations, reasoning), Diagrams (illustrations, charts, experimental setups) | Expert | Text QA, VQA | 2025.05 | EN | Comprehensive multi-source integration | Semi-automated | N/A | Data generation and review | N/A | 5,100 | Open-ended | Acc |
PHYBench | Mechanics, Electricity, Thermodynamics, Optics, Modern Physics, \etc | Research problems | Undergraduate | Text QA | 2025.05 | EN | Comprehensive multi-source integration | Semi-automated | 178 | Data generation and review | o1, DeepSeek-R1 | 500 | Open-ended | EED |
PhyX | Mechanics, Quantum Mechanics, Thermodynamics, Electromagnetism, Atomic Physics, \etc | Text (problem statements, equations, reasoning), Diagrams (illustrations, charts, experimental setups) | Undergraduate | Text QA, VQA | 2025.05 | EN | Comprehensive multi-source integration | Semi-automated | N/A | Data generation and review | GPT-4o | 3,000 | MCQ, Open-ended | Acc |
PhysUniBench | Mechanics, Electromagnetism, Optics, Atomic Physics, \etc | Text (problem statements, equations), Diagrams (physics illustrations) | Undergraduate | VQA | 2025.06 | EN, ZH | Comprehensive multi-source integration | Manual | N/A | Data generation and review | N/A | 3,304 | Open-ended, MCQ | Acc |
IntPhys 2 | General Physics | Video, Text (scene parameters, object categories, trajectories, physical attributes) | N/A | Video QA | 2025.06 | N/A | Other sources | Semi-automated | N/A | Data generation and review | N/A | 1,400 | Open-ended | Acc |
MVP-Bench | General Physics | Video | N/A | Video QA | 2025.06 | EN | Encyclopedias and knowledge bases | Semi-automated | N/A | Data generation and review | OpenAI CLIP (ViT-L/14) | 55,000 | Open-ended | Acc |
SeePhys | Mechanics, Electromagnetism, Particle Physics, Optics=, Astrophysics, Thermodynamics, Quantum Mechanics, \etc | Text (problem statements, equations), Diagrams (physics illustrations) | Secondary school, Undergraduate, Graduate | VQA | 2025.07 | EN,ZH | Comprehensive multi-source integration | Semi-automated | N/A | Data generation and review | GPT-4o | 2,000 | Open-ended | Acc |
Dataset | Domain | Modality | Level | Type | Release | Language | Source | Annotation Pipeline | Human Annotators | Human Tasks | Auto-annotation Tools | Size | Evaluation Type | Metrics |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Astro-QA | Astronomy | Astronomy Olympiad competitions, Astronomy exams, Online encyclopedias | Undergraduate | Text QA | 2025.06 | Mixed | Comprehensive multi-source integration | Manual | 30+ | Data generation and review | N/A | 3,082 | Open-ended | DGscore, BLEU, ROUGE, chrF |
Astrovisbench | Astronomy | Galaxy images | Expert | VQA | 2025.06 | EN | Comprehensive multi-source integration | Semi-automated | 6 | Data review | GPT-4o, Claude 3.5 Sonnet | 432 | Open-ended | VIscore, Image error level, Expert evaluation |
AstroMLab 1 | Astronomy | Academic papers | Expert | Text QA | 2024.11 | EN | Academic and research resources | Automated | N/A | Data review | Gemini-1.5-Pro | 4,425 | MCQ | Acc |
AstroPT | Astronomy | Astronomical images | Expert | Image-text | 2024.05 | EN | Web and Internet content, Scientific databases | Automated | N/A | Data review | DESI Legacy Survey API | 8.6 M | Classification | PCC, Acc |
Astro-NER | Astronomy | Academic papers | Expert | Text QA | 2024.05 | EN | Academic and research resources | Semi-automated | 4 | Data generation and review | GPT-3.5 | 5,000 | Open-ended | Precision, Recall, F1 |
AstroLLaMA | Astronomy | Academic papers | Expert | Text QA | 2023.09 | EN | Academic and research resources,Web and Internet content | Manual | N/A | Data review | N/A | 9.5 M | Open-ended | Perplexity, Cosine similarity |
ATel | Astronomy | Academic papers | Expert | Text QA | 2023.05 | EN | Academic and research resources | Manual | N/A | Data review | N/A | 234 | Open-ended | Acc |
PhyE2Es | Astrophysics | Text with formulae | Expert | Raw text | 2025.03 | EN | Scientific databases | Automated | N/A | Data generation and review | OpenLLAMA-2-3B | 8,000 | Regression | Acc, Numerical precision, Formula complexity, Formula depth |
Pathfinder Dataset | Astrophysics | Academic papers, ADS | Expert | Text QA with CoT | 2024.11 | EN | Web and Internet content, Academic and research resources | Automated | 36+ | Data generation and review | text-embedding-3-small | 385,166 | Open-ended | Acc, MRR, Recall, NDCG, relevance score |
Starwhisper-pilsar | Astrophysics | Pulsar diagnostic plots, Pulsars signals | Expert | VQA | 2024.04 | EN | Integration of existing datasets | Manual | N/A | Data generation and review | DeepSeek-VL-7B, InternVL2-40B | 106,674 | Open-ended | Acc, Recall, Precision, F1, \etc |
PAPERCLIP | Astrophysics | Synthetic conversation, Astronomical images | Expert | Text QA | 2024.03 | EN | Academic and research resources,Scientific databases | Automated | N/A | Data generation and review | Mixtral-8x7B-Instruct | 31,859 | Open-ended | Acc |
Dataset | Domain | Modality | Level | Type | Release | Language | Source | Annotation Pipeline | Human Annotators | Human Tasks | Auto-annotation Tools | Size | Evaluation Type | Metrics |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CheMatAgent | Materials Science | Scientific instruction | Expert | Text QA | 2025.06 | EN | Other sources | Manual | N/A | Data generation and review | N/A | 137 | Open-ended | Acc |
ChEBI-20-MM | Materials Science | InChI, IUPAC, SELFIES, Science QA, Molecular Image | Expert | Text QA | 2025.01 | EN | Integration of existing datasets | Manual | N/A | Data generation and review | N/A | 3,300 | Open-ended | BLEU, ROUGE, METEOR, CIDEr |
LLM4MatBench | Materials Science | CIF, Chemical composition, Numerical property | Expert | Text QA | 2024.10 | EN | Scientific databases | Manual | N/A | Data generation and review | N/A | 1.9M | Open-ended | Acc |
MatText | Materials Science | Chemical composition, Numerical property | Expert | Text QA | 2024.08 | EN | Scientific databases | Manual | N/A | Data generation and review | N/A | 2,000,000 | Open-ended | MAE, AUROC |
MatBookQA | Materials Science | Science QA | Expert | Text QA | 2024.05 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 650 | Open-ended | Acc |
MaSCQA | Materials Science | Science QA | Expert | Text QA | 2023.08 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 650 | Open-ended | Acc |
MatSci-NLP | Materials Science | Chemical composition, Numerical property | Expert | Text QA | 2023.05 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 169,197 | Open-ended | Acc, F1 |
ChEBI-20 | Materials Science | Scientific instruction | Expert | Text QA | 2021.11 | EN | Integration of existing datasets | Manual | N/A | Data generation and review | N/A | 3,301 | Open-ended | BLEU, ROUGE, METEOR, CIDEr |
MOSES | Materials Science | SMILES | Expert | Text QA | 2020.11 | EN | Integration of existing datasets | Manual | N/A | Data generation and review | N/A | 176,000 | Open-ended | Uniqueness, Validity, Frag, Scaff, SNN |
MatBench | Materials Science | CIF, Numerical property, Chemical composition | Expert | Text QA | 2020.09 | EN | Scientific databases | Manual | N/A | Data generation and review | N/A | 408,062 | Open-ended | MAE, AUROC |
GuacaMol | Materials Science | SMILES | Expert | Text QA | 2019.03 | EN | Integration of existing datasets | Manual | N/A | Data generation and review | N/A | 2M | Open-ended | Validity, Uniqueness, Novelty |
MoleculeNet | Materials Science | SMILES | Expert | Text QA | 2017.03 | EN | Integration of existing datasets | Manual | N/A | Data generation and review | N/A | 700,000 | Open-ended | AUROC, AUPRC, RMSE, MAE |
MaCBench | Materials Science, Chemistry | Science QA, AFM Image | Expert | VQA | 2024.11 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 628 | Open-ended | Acc |
MMSci | Materials Science, Chemistry | Science QA | Graduate | VQA | 2024.05 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 742,273 | Open-ended | Acc |
Dataset | Domain | Modality | Level | Type | Release | Language | Source | Annotation Pipeline | Human Annotators | Human Tasks | Auto-annotation Tools | Size | Evaluation Type | Metrics |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ClimaQA | Atmosphere | Textbooks | Expert | Text QA | 2025.03 | EN | Books and literary works | Semi-automated | Data review | GPT-4o | 3,633 | MCQ, Open-ended | Acc, BLEU, \etc | |
WeatherQA | Atmosphere | Remote sensing, Science QA | Expert | VQA (multi-image) | 2024.06 | EN | Scientific databases | Semi-automated | 4 | Data review | 600 | MCQ, Open-ended | Acc, F1, BLEU, \etc | |
ClimateBERT | Atmosphere | Corporate annual reports, Sustainability reports | Secondary school | Text QA | 2022.12 | EN | Web and Internet content | Manual | 4+ | Data review | Prodigy | 320 | MCQ | Acc |
OceanBench | Hydrosphere | Academic papers | Expert | Text QA | 2024.09 | EN | Academic and research resources | Automated | 10+ | Data review | GPT-4, GPT-3.5 | 13,000 | Open-ended | Win Rate |
OmniEarth-Bench | Hydrosphere, Biosphere, Lithosphere, Atmosphere, Cryosphere | Remote sensing, Science QA | Expert | VQA with CoT (multi-image) | 2025.05 | EN | Integration of existing datasets | Manual | 40+ | Data generation and review | 29,779 | MCQ | Acc, Precision, Recall, F1 | |
MSEarth | Hydrosphere, Biosphere, Lithosphere, Atmosphere, Cryosphere | Academic papers | Expert | VQA with CoT | 2025.05 | EN | Academic and research resources | Semi-automated | 20+ | Data review | GPT-4o | 11,500 | MCQ, Open-ended | BLEU, BERTScore, Acc |
EarthSE | Hydrosphere, Biosphere, Lithosphere, Atmosphere, Cryosphere | Academic papers | Expert | Text QA with CoT | 2025.05 | EN | Academic and research resources | Semi-automated | 20+ | Data review | GPT-4o | 10,000 | Open-ended | Acc |
GeoBench | Lithosphere | Science QA | Expert | Text QA | 2023.06 | EN | Web and Internet content, Academic and research resources | Semi-automated | 10+ | Data review | 2,516 | MCQ, Open-ended | Acc, GPTScore | |
XLRS-Bench | Remote Sensing | Remote sensing | N/A | Image-text, VQA | 2025.03 | EN, ZH | Academic and research resources | Semi-automated | 55 | Data generation and review | GPT-4o | 32,389 | MCQ, Open-ended | Acc, IoU, BLEU, \etc |
LRS-VQA | Remote Sensing | Remote sensing | N/A | Image-text, VQA | 2025.03 | EN | Academic and research resources | Automated | N/A | N/A | Qwen2-VL, GPT-4V | 7,333 | Open-ended | Acc |
MME-RealWorld-RS | Remote Sensing | Remote sensing | N/A | Image-text, VQA | 2024.08 | EN, ZH | Academic and research resources | Manual | N/A | Data generation and review | N/A | 3,738 | MCQ | Acc |
VRSBench | Remote Sensing | Remote sensing | N/A | Image-text, VQA | 2024.06 | EN | Academic and research resources | Semi-automated | N/A | Data review | GPT-4V | 62,917 | Open-ended | Acc, IoU, BLEU, \etc |
GeoChat | Remote Sensing | Remote sensing | N/A | Image-text, VQA | 2023.11 | EN | Academic and research resources, Integration of existing datasets | Automated | N/A | N/A | Vicuna | 10K | Open-ended | Acc, IoU, METEOR |
RSIEval | Remote Sensing | Remote sensing | N/A | Image-text, VQA | 2023.07 | EN | Academic and research resources | Manual | 5 | Data generation and review | N/A | 1,036 | Open-ended | Acc, BLEU, ROUGE, \etc |
DIOR-RSVG | Remote Sensing | Remote sensing | N/A | Image-text | 2022.10 | EN | Academic and research resources | Semi-automated | N/A | Data review | N/A | 17,402 | Open-ended | IoU |
NWPU-Captions | Remote Sensing | Remote sensing | N/A | Image-text | 2022.08 | EN | Academic and research resources | Manual | N/A | Data generation | N/A | 31,500 | Open-ended | BLEU, METEOR, \etc |
RSVQA-HRBEN | Remote Sensing | Remote sensing | N/A | Image-text, VQA | 2020.05 | EN | Scientific databases | Automated | N/A | N/A | N/A | 77,232 | Open-ended | Acc |
RSVQA-LRBEN | Remote Sensing | Remote sensing | N/A | Image-text, VQA | 2020.05 | EN | Scientific databases | Automated | N/A | N/A | N/A | 1,066,316 | Open-ended | Acc |
RSICD | Remote Sensing | Remote sensing | N/A | Image-text | 2017.12 | EN | Academic and research resources | Manual | N/A | Data generation | N/A | 10,921 | Open-ended | BLEU, METEOR, CIDEr |
UCM-Captions | Remote Sensing | Remote sensing | N/A | Image-text | 2016.07 | EN | Academic and research resources | Manual | N/A | Data generation | N/A | 2,100 | Open-ended | BLEU, METEOR, CIDEr |
Sydney-Captions | Remote Sensing | Remote sensing | N/A | Image-text | 2016.07 | EN | Academic and research resources | Manual | N/A | Data generation | N/A | 613 | Open-ended | BLEU, METEOR, CIDEr |
Dataset | cScientific Domain | Modality | Type | Release | Language | Source | Annotation Pipeline | Human Annotators | Human Tasks | Auto-annotation Tools | Size | Level | Evaluation Type | Metrics |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MMMU | Science (Biology, Chemistry, Geography, Math, Physics), Health & Medicine (Basic Medical Science, Clinical Medicine, Diagnostics, Pharmacy, Public Health), Tech & Engineering (Materials, \etc) | Scientific VQA, MRI, CT, X-ray, \etc | VQA | 2023.11 | EN | Comprehensive multi-source inte- gration | Semi-automatic | 50 | Data review | Claude, GPT-4, GPT-4V | 11,550 | Expert | MCQ | Acc |
MMMU Pro | Science (Biology, Chemistry, Geography, Math, Physics), Health & Medicine (Basic Medical Science, Clinical Medicine, Diagnostics, Pharmacy, Public Health), Tech & Engineering (Materials, \etc) | Scientific VQA, MRI, CT, X-ray, \etc | VQA | 2023.11 | EN | Comprehensive multi-source integration | Semi-automatic | N/A | Data review | Claude, GPT-4, GPT-4V | 5,190 | Expert | MCQ | Acc |
ScienceQA | Biology, Earth Science, Physics, Chemistry, Geography | Scientific query, Scientific instruction, Science textbooks and literature | VQA | 2022.01 | EN | Books and literary works | Manual | 9+ | Data generation and review | ViT, GPT-2 | 21.2k | Primary school, Secondary school | MCQ | Acc |
SciQA | Material Science, Chemistry, Life sciences | Scientific query | Text QA | 2023.05 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 2,565 | Expert | Open-ended | Acc |
Scicode | Material Science, Biology, Chemistry, Physics, Mathematics | Scientific Instruction | Text QA | 2024.08 | EN | Academic and research resources | Manual | N/A | Data generation and review | N/A | 338 | Expert | Open-ended | pass@1 |
CURIE | Materials Science, Life Sciences, Physics, Earth Science | Scientific query | VQA | 2025.04 | EN | Integration of existing datasets | Manual | N/A | Data generation and review | N/A | 580 | Expert | Open-ended | Acc |
TheoremQA | Physics, Mathematics | Theorems | Text QA | 2023.12 | EN | Books and literary works, Encyclopedias and knowledge bases | Manual | N/A | Data generation and review | N/A | 800 | Undergraduate, Expert | Open-ended | Acc |
SciBench | Physics, Chemistry | Science QA | Text QA | 2023.09 | EN | Books and literary works | Manual | 7 | Data review | N/A | 695 | Undergraduate | Open-ended | Acc |
JEEBench | Physics, Chemistry | Science Exams | Text QA | 2023.12 | EN | Other sources | Semi-automated | N/A | Data generation and review | N/A | 515 | Expert | MCQ | Acc |
MMLU | Physics (College Physics, Conceptual Physics, High School Physics), Chemistry (College Chemistry, High School Chemistry), Biology (College Biology, High School Biology) | Science QA | Text QA | 2020.09 | EN | Books and literary works | Manual | 7 | Data generation and review | N/A | 15.9k | Secondary School, Undergraduate, Expert | MCQ | Acc |
C-Eval | Chemistry (College Chemistry, High School Chemistry, Middle School Chemistry), Physics (College Physics, High School Physics, Middle School Physics), Biology (High School Biology, Middle School Biology), Medicine (Veterinary Medicine, Basic Medicine, Clinical Medicine, Physician), Earth Science (High School Geography, Middle School Geography) | Exam questions, Chinese educational assessments | Text QA | 2023.05 | ZH | Books and literary works | Manual | 12 | Data generation and review | N/A | 13.9k | Primary school, Secondary school, Undergraduate | MCQ | Acc |
GPQA | Chemistry, Biology, Physics | Graduate-level scientific questions | Text QA | 2023.11 | EN | Other sources | Manual | 8 | Data generation and review | N/A | 448 | Expert | MCQ | Acc |
ArXivQA | Physics (Accelerator Physics, High Energy Physics - Lattice, Mathematical Physics, \etc), Chemistry (Chemical Physics), Biology (Quantitative Biology), Material (Materials Theory) | scientific figure question-answer | Text QA | 2024.05 | EN, ZH | Other sources | Semi-automated | N/A | Data generation and review | GPT-4V | 249,587 | Expert | MCQ | Acc |
Xiezhi | Agronomy (Crop Science, Veterinary Medicine), Science (Chemistry, Physics), Medicine (Traditional Chinese Medicine) | Professional exams | Text QA | 2024.05 | EN | Comprehensive multi-source integration | Semi-automated | N/A | Data review | ChatGPT, Llama-7B | 250k | Expert | MCQ | Acc |
SuperGPQA | Medicine, Science, Agriculture | Graduate Disciplines QA | Text QA | 2025.02 | EN | Other sources | Semi-automatic | 80+ | Data generation and review | N/A | 26.5k | Expert | MCQ | Acc |
BMMR | Health (Medicine, Pharmacy, \etc), Natural Sciences (Physics, Biology, \etc), Agriculture | Image, College-level visual question answering, OCR-based QA | VQA | 2025.07 | EN, ZH | Web and Internet content, Books and literary works, Integration of existing datasets | Semi-automated | N/A | Data generation and review | N/A | 109,449 | Primary school, Undergraduate, Secondary school | MCQ | Acc |
OlympiadBench | Physics, Mathematics | QA from math and physics competitions, Image | Text QA, VQA | 2024.02 | ZH, EN | Other sources | Semi-automated | 14 | Data generation and review | N/A | 8.5k | Expert | Open-ended | Acc |
LLM-SRBench | LSR-Synth (Chemistry, Biology, Physics, Material Science), LSR-Transform | Structured Data | text | 2025.04 | EN | Comprehensive multi-source integration | Fully-automated | 0 | Data generation and review | N/A | 239 | Expert | Open-ended | Exact Match, MSE |
HLE | Biology (Marine Biology, Molecular Biology, Computational Biology, Ecology, \etc), Chemistry (Chemical Engineering, Biochemistry, \etc), Physics (Biophysics), Materials Science | Organic reaction analysis, Molecular text, Chemical equations, Medical question answering, Textbook QA, \etc | Text QA | 2025.01 | EN | Academic and research resources | Manual | nearly 1000 | Data generation and review | GPT-4o | 2,500 | Expert | MCQ, Open-ended | Acc, Calibration Error |
SFE | Astronomy, Chemistry, Life Science, Materials Science, Earth Science | Protein structure, RNA structure, Molecular structure, \etc | VQA | 2025.06 | EN, ZH | Scientific databases, Academic and research resources | Manual | N/A | Data generation and review | GPT-4o | 1,660 | Expert | MCQ, Open-ended | Exact Match, LLM-as-a-Judge score, BERTScore, IoU |
SciEval | Chemistry, Physics, Biology | Text (equations, molecules, chemical reactions, scientific QA, \etc) | Text QA | 2023.08 | EN | Comprehensive multi-source integration | Semi-automated | N/A | Data review | GPT-4 | 18,000 | Undergraduate, Graduate | MCQ, Open-ended, True/False | Acc, BLEU, MSE |
SciKnowEval | Chemistry, Physics, Biology, Materials Science | Textbook QA, Literature QA, SMILES, IUPAC, Equations, \etc | Text QA, Classification, Regression | 2024.06 | EN | Comprehensive multi-source integration, Academic and research resources, Scientific databases, Integration of existing datasets | Semi-automated | N/A | Data review | GPT-4o, GPT-3.5, Claude3, LLaMA, Qwen | 70,203 | Undergraduate, Graduate, Expert | MCQ, True/False, Open-ended | Acc, F1, BLEU, ROUGE, Smith-Waterman, Tanimoto |
AGIEval | Chemistry (GK-chemistry), Physics (GK-physics), Biology (GK-biology), Geography (GK-geography) | Textbook, Literature, SMILES, IUPAC, Equations, \etc | Text QA | 2023.09 | EN, ZH | Academic and research resources, Integration of existing datasets | Semi-automated | N/A | Data review | ChatGPT, GPT-4 | 8,062 | Secondary school, Undergraduate | MCQ, Open-ended | Acc, EM |
ScienceAgentBench | Bioinformatics, Computational chemistry, Geographical information science, Psychology & cognitive neuroscience | Microscopy images, SMILES strings, Geospatial data, EEG, ECG, IMU, \etc | VQA | 2024.10 | EN | Academic and research resources, Integration of existing datasets | Manual | 9 | Data generation and review | N/A | 102 tasks | Expert | Open-ended | VER, SR, CodeBERTScore, GPT-4o Judge |
Models | Domain | Parameters | Base LLM | Modality encoder | Release | Open-source |
---|---|---|---|---|---|---|
Galactica | General Science | 120B | N/A | N/A | 2022.11 | β |
DARWIN | General Science | 7B | LLaMA-7B, Vicuna-7B | N/A | 2023.08 | β |
FORGE | General Science | 26B | GPT-NeoX | N/A | 2023.11 | β |
SciGLM | General Science | 6B / 32B | ChatGLM3 | N/A | 2024.01 | β |
SciDFM | General Science | 18.2B-A5.6B | N/A | N/A | 2024.09 | β |
OmniScience | General Science | 70B | LLaMA-3.1 | N/A | 2025.03 | β |
Intern-S1 | General Science | 241B-A28B/8B | Qwen3-235B-A22B, Qwen3-8B | InternViT-6B, InternViT-300M | 2025.08 | β |
Models | Domain | Parameters | Base LLM | Modality encoder | Release | Open-source |
---|---|---|---|---|---|---|
MechGPT | Mechanics | 13B / 70B | LLaMA-2 | N/A | 2023.10 | β |
Xiwu | High Energy Physics | 7B / 13B | LLaMA, Vicuna, ChatGLM, Grok-1 | N/A | 2024.04 | β |
Poseidon | Partial Differential Equations | 0.02B / 0.2B / 0.6B | scOT | N/A | 2024.05 | β |
L3M | Astrophysics | 0.5B | Qwen2.5-0.5B-Instruct | N/A | 2025.06 | β |
Models | Domain | Parameters | Base LLM | Modality encoder | Release | Open-source |
---|---|---|---|---|---|---|
ChemLLM | Chemistry, Pharmacy | 7B | InternLM2 | N/A | 2024.06 | β |
LLM-RDF | Chemistry, chemical synthesis | N/A | GPT-4 | N/A | 2024.11 | β |
InstructMol | Biochemistry, Chemistry, Pharmacy | 7B | LLaMA | molecular graph encoder | 2024.12 | β |
ChemDFM | Chemistry (molecular design), Chemistry | 13B | LLaMA-2 | N/A | 2025.07 | β |
ChemMLLM | Chemistry (molecular design), Pharmacy | 34B | Lumina-mGPT-34B-512 | VQGAN | 2025.08 | β |
Chemma | Chemistry, Organic Chemistry | 7B | LLaMA-2 | N/A | 2025.07 | β |
Chem3DLLM | Chemistry (Molecular Design), Pharmacy | 7B | Qwen2-7B | ESM-Encoder | 2025.08 | β |
Models | Domain | Parameters | Base LLM | Modality encoder | Release | Open-source |
---|---|---|---|---|---|---|
SMILES-BERT | Materials Science | 30M | BERT-small | N/A | 2019.09 | β |
MolGPT | Materials Science | N/A | N/A | N/A | 2022.05 | β |
MOFormer | Materials Science | N/A | N/A | N/A | 2022.10 | β |
MatBert-bandgap | Materials Science | 110M | MatBERT | N/A | 2023.03 | β |
Regression Transformer | Materials Science | N/A | N/A | N/A | 2023.04 | β |
MolXPT | Materials Science | N/A | GPT-2 | N/A | 2023.05 | β |
xyztransformer | Materials Science | N/A | N/A | N/A | 2023.05 | β |
polyBERT | Materials Science | N/A | DeBERTa | N/A | 2023.07 | β |
GPT-MolBERTa | Materials Science | N/A | RoBERTa | N/A | 2023.10 | β |
ChemRLformer | Materials Science | N/A | N/A | N/A | 2023.10 | β |
CrystaLLM | Materials Science | 70B | LLaMA-2 70B | N/A | 2024.02 | β |
MatText | Materials Science | N/A | BERT | N/A | 2024.06 | β |
ChatMOF | Materials Science | N/A | GPT-4, GPT-3.5-turbo, and GPT-3.5-turbo-16k | N/A | 2024.06 | β |
LHS2RHS | Materials Science | N/A | N/A | N/A | 2024.10 | β |
RHS2LHS | Materials Science | N/A | N/A | N/A | 2024.10 | β |
TGT2CEQ | Materials Science | N/A | N/A | N/A | 2024.10 | β |
CrystaLLM | Materials Science | 200M | GPT-2 | N/A | 2024.12 | β |
molT5-large | Materials Science | 770M | T5-large | N/A | 2024.12 | β |
Qwen2-KG | Materials Science | 72B | Qwen2-72B | N/A | 2025.02 | β |
LLM-Prop | Materials Science | 37M | T5-small | N/A | 2025.06 | β |
Crystal Synthesis LLM | Materials Science | 8B | LLaMA-3-8B | N/A | 2025.07 | β |
Models | Domain | Parameters | Base LLM | Modality encoder | Release | Open-source |
---|---|---|---|---|---|---|
ShizhenGPT | Healthcare and Medical Sciences | 7B / 32B | Qwen2.5 | Qwen2.5-VL vision encoder, Whisper-large-v3 | 2025.08 | β |
ProGen2 | Proteomics | 6.4B / 2.7B / 764M / 151M | N/A | N/A | 2022.06 | β |
BioGPT | Healthcare and Medical Sciences, General Biology | 347M | GPT-2 | N/A | 2022.10 | β |
ESM-2 | Proteomics | 15B / 3B / 650M /150M/ 35M / 8M | N/A | N/A | 2023.03 | β |
OphGLM | Healthcare and Medical Sciences | 6B | ChatGLM-6B | ConvNext | 2023.03 | β |
MedAlpaca | Healthcare and Medical Sciences | 7B / 13B | LLaMA | N/A | 2023.04 | β |
DoctorGLM | Healthcare and Medical Sciences | 6B | ChatGLM-6B | N/A | 2023.04 | β |
PMC-LLaMA | Healthcare and Medical Sciences | 13B | LLaMA | N/A | 2023.04 | β |
scGPT | Multi-omics | 30k / 300k / 3M / 33M | N/A | N/A | 2023.04 | β |
Med-PaLM | Healthcare and Medical Sciences | N/A | PaLM | N/A | 2023.05 | β |
Med-PaLM 2 | Healthcare and Medical Sciences | N/A | PaLM 2 | N/A | 2023.05 | β |
GatorTronS | Healthcare and Medical Sciences | 345M / 3.9B / 8.9B | GPT-3 | N/A | 2023.05 | β |
GatorTronGPT | Healthcare and Medical Sciences | 5B / 20B | GPT-3 | N/A | 2023.05 | β |
HuatuoGPT | Healthcare and Medical Sciences | 7B / 13B | Baichuan-7B, Ziya-LLaMA-13B-Pretrain-v1 | N/A | 2023.05 | β |
BiomedGPT | Healthcare and Medical Sciences | 33M / 93M / 182M | OFA | VQ-GAN | 2023.05 | β |
ClinicalGPT | Healthcare and Medical Sciences | 7B | BLOOM-7B | N/A | 2023.06 | β |
GENA-LM | Molecular and Cell Biology, Multi-omics | 110M / 336M | BERT | N/A | 2023.06 | β |
NYUTron | Healthcare and Medical Sciences, Neuroscience, Pharmacy | 190M | BERT | N/A | 2023.06 | β |
ChatDoctor | Healthcare and Medical Sciences | 7B | LLaMA | N/A | 2023.06 | β |
SoulChat | Neuroscience, Healthcare and Medical Sciences | 6B | ChatGLM-6B | N/A | 2023.07 | β |
DNAGPT | Molecular and Cell Biology, Multi-omics | 3B | GPT | N/A | 2023.07 | β |
Med-Flamingo | Healthcare and Medical Sciences | 9B | Openflamingo | Openflamingo | 2023.07 | β |
DISC-MedLLM | Healthcare and Medical Sciences | 13B | Baichuan-13B | N/A | 2023.08 | β |
IvyGPT | Healthcare and Medical Sciences | 33B | LLaMA-33B | N/A | 2023.08 | β |
Zhongjing | Healthcare and Medical Sciences | 13B | Ziya-LLaMA-13B-V1 | N/A | 2023.08 | β |
Radiology-Llama2 | Healthcare and Medical Sciences | 7B | LLaMA-2 | N/A | 2023.08 | β |
RadFM | Healthcare and Medical Sciences | 9B | MedLLaMA-13B | 3D ViT | 2023.08 | β |
CPLLM | Healthcare and Medical Sciences | 13B | Llama2-13B | N/A | 2023.09 | β |
DRG-LLaMA | Healthcare and Medical Sciences | 7B | LLaMA-7B | N/A | 2023.09 | β |
MindGPT | Neuroscience, Healthcare and Medical Sciences | 124M | GPT-2 | CLIP-ViT-B/32 | 2023.09 | β |
BioinspiredLLM | General Biology, Molecular and Cell Biology, Proteomics | 13B | LLaMA-2 | N/A | 2023.09 | β |
Qilin-Med | Healthcare and Medical Sciences | 7B | Baichuan-7B | N/A | 2023.10 | β |
CXR-LLAVA | Healthcare and Medical Sciences | 7B | LLaMA-2 | CLIP ViT-L/16 | 2023.10 | β |
InstructProtein | Proteomics | 1.3B | OPT-1.3B | N/A | 2023.10 | β |
ChiMed-GPT | Healthcare and Medical Sciences | 13B | Ziya-13B-v2 | N/A | 2023.11 | β |
HuatuoGPT-II | Healthcare and Medical Sciences | 7B/13B | Baichuan2-7B-Base, Baichuan2-13B-Base | N/A | 2023.11 | β |
Taiyi-LLM | Healthcare and Medical Sciences | 7B | Qwen-7B-base | N/A | 2023.11 | β |
Meditron | Healthcare and Medical Sciences | 7B / 70B | LLaMA-2 | N/A | 2023.11 | β |
MAIRA-1 | Healthcare and Medical Sciences | 7B | Vicuna-7B | RAD-DINO | 2023.11 | β |
MAIRA-2 | Healthcare and Medical Sciences | 7B | Vicuna-7B-v1.5 | RAD-DINO | 2023.11 | β |
Neuro-GPT | Neuroscience, Healthcare and Medical Sciences | 124M | GPT-2 | EEG Encoder | 2023.11 | β |
PLLaMa | Molecular and Cell Biology, General Biology | 7B / 13B | LLaMA-2 | N/A | 2024.01 | β |
EEG-GPT | Neuroscience | N/A | GPT-3 | EEG Encoder | 2024.01 | β |
BioMistral | Healthcare and Medical Sciences, Molecular and Cell Biology | 7B | Mistral-7B-Instruct-v0.1 | N/A | 2024.02 | β |
MMed-LLaMA 3 | Healthcare and Medical Sciences | 8B | LLaMA 3 | N/A | 2024.02 | β |
ProLLaMA | Proteomics | 7B | LLaMA-2 | N/A | 2024.02 | β |
ProtLLM | Proteomics | 7B | LLaMA-7B | ProtST (protein) | 2024.03 | β |
BrainGPT | Neuroscience | 7B | Mistral-7B | N/A | 2024.03 | β |
Apallo | Healthcare and Medical Sciences | 0.5B / 1.8B / 2B / 6B / 7B | Qwen | N/A | 2024.03 | β |
Med-Gemini | Healthcare and Medical Sciences | N/A | Gemini 1.5 Pro | Custom encoders (multimodal) | 2024.04 | β |
UMBRAE | Neuroscience | 7B | Vicuna-7B | CLIP-ViT/L-14 (vision), Encoder (fMRI) | 2024.04 | β |
SeedLLM | Agronomy | 7B | Qwen2.5 | N/A | 2024.04 | β |
Alphafold3 | Molecular and Cell Biology, Proteomics, Pharmacy, Neuroscience | N/A | N/A | Input Feature Embedder | 2024.05 | β |
DrugLLM | Pharmacy | 7B | LLaMA 7B | N/A | 2024.05 | β |
LLaVA-Med | Healthcare and Medical Sciences | N/A | Vicuna-7B | Clip ViT-L/14 | 2024.05 | β |
CareGPT | Healthcare and Medical Sciences | 7B | LLaMA-2 | N/A | 2024.05 | β |
ProtT3 | Proteomics | N/A | Galactica 1.3B | ESM-2 (protein) | 2024.05 | β |
MolecularGPT | Molecular and Cell Biology | N/A | LLaMA | N/A | 2024.06 | β |
HuatuoGPT-Vision | Healthcare and Medical Sciences | 7B / 34B | Qwen2-7B | Qwen Image Encoder (vision) | 2024.06 | β |
NeuroLM | Neuroscience | 254M/500M/1.7B | GPT-2 | Encoder (EEG) | 2024.08 | β |
RNAGPT | Molecular and Cell Biology, Multi-omics | 8B | LLaMA-3 | RNA-FM sequence encoder (RNA) | 2024.10 | β |
AgroGPT | Agronomy | 3B / 7B | LLaVA-1.5, Mipha | CLIP-ViT-L/14 (vision), SigLIP | 2024.10 | β |
LLaMA-Gene | Molecular and Cell Biology, Proteomics | 7B | LLaMA-7B | N/A | 2024.11 | β |
GMAI-VL | Healthcare and Medical Sciences | 7B | InternLM | Image Encoder (vision) | 2024.11 | β |
HuatuoGPT-o1 | Healthcare and Medical Sciences | 7B / 8B / 70B / 72B | LLaMA-3.1, Qwen2.5 | N/A | 2024.12 | β |
Evolla | Proteomics | 10B / 80B | LLaMA-3 8B | Saprot (protein) | 2025.01 | β |
UniMind | Neuroscience | 7B | InternLM2.5 | Encoder (EEG) | 2025.01 | β |
NatureLM | Pharmacy, Molecular and Cell Biology, Proteomics, Material | 46.7B | Mixtral 8x7B | N/A | 2025.02 | β |
MindLLM | Neuroscience, Healthcare and Medical Sciences | 7B | Vicuna-7B | Encoder (fMRI) | 2025.02 | β |
MedVLM-R1 | Healthcare and Medical Sciences | 2B | Qwen2-VL | Qwen Image Encoder (vision) | 2025.02 | β |
AlphaGenome | Molecular and Cell Biology, Multi-omics | N/A | N/A | N/A | 2025.05 | β |
ChatNT | Molecular and Cell Biology, Proteomics, Multi-omics | 7B | Vicuna-7B | Nucleotide Transformer v2 (DNA) | 2025.06 | β |
Lingshu | Healthcare and Medical Sciences | 7B / 32B | Qwen | N/A | 2025.06 | β |
PodGPT | Healthcare and Medical Sciences | N/A | Gemma, Mixtral, LLaMA | N/A | 2025.07 | β |
MedGemma | Healthcare and Medical Sciences | 4B / 27B | Gemma 3 | SigLip Image Encoder (vision) | 2025.07 | β |
Models | Domain | Parameters | Base LLM | Modality encoder | Release | Open-source |
---|---|---|---|---|---|---|
AstroLLaMA-2-7B | Astronomy | 7B | Llama-2 LLM | N/A | 2023.09 | β |
AstroLLaMA-3-8B | Astronomy | 8B | LLaMA-2-7B LLM | N/A | 2024.09 | β |
AstroLLaMA-2-70B | Astronomy | 70B | LLaMA-2-7B LLM | N/A | 2024.09 | β |
AstroSage-LLaMA-3.1-8B | Astronomy | 8B | Llama-3.1-8B LLM | N/A | 2025.04 | β |
AstroLLaVa-7B | Astronomy | 7B | LLaVA 1.5 LLM | CLIP-ViT/L-14 (vision) | 2025.04 | β |
AstroSage-LLaMA-3.1-70B | Astronomy | 70B | Llama-3.1-70B LLM | N/A | 2025.05 | β |
Models | Domain | Parameters | Base LLM | Modality encoder | Release | Open-source |
---|---|---|---|---|---|---|
OceanGPT | Hydrosphere, Biosphere, Lithosphere, Remote Sensing | 7B | LLama, Qwen | N/A | 2023.03 | β |
K2 | Lithosphere, Remote Sensing | 7B | LLama | N/A | 2023.08 | β |
GeoChat | Remote Sensing, Lithosphere | 7B | Vicuna-v1.5 | N/A | 2023.11 | β |
SkyEyeGPT | Remote Sensing | 7B | N/A | N/A | 2024.01 | β |
TeoChat | Remote Sensing, Lithosphere | 7B | Vdieo-LLaVA | N/A | 2024.10 | β |
EarthMarker | Remote Sensing | 13B | LLaMA-2 | N/A | 2024.11 | β |
EarthDial | Remote Sensing | 4B | Phi-3-mini | N/A | 2024.12 | β |
GeoPixel | Remote Sensing, Lithosphere | 7B | IXC-2.5 | N/A | 2025.01 | β |
EagleVision | Remote Sensing | 1B/2B/4B/7B | Qwen2-VL-72B, GPT-4o | N/A | 2025.03 | β |
ClimateChat | Lithosphere, Climate | 7B | jiuZhou | N/A | 2025.03 | β |
GeoGPT | Lithosphere, Remote Sensing | 70B | Llama3.1-70B, Qwen2.5-72B | N/A | 2025.04 | β |
GeoLLaVA-8K | Remote Sensing, Lithosphere | 7B | LongVA | N/A | 2025.05 | β |