Skip to content

open-sciencelab/Awesome-Scientific-Datasets-and-LLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Awesome-Scientific-Datasets-and-LLMs

A curated collection of papers, datasets, and resources on Scientific Datasets and Large Language Models (LLMs), organized in reference to our survey: "A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers"

If you spot any mistakes or have suggestions, feel free to reach out by email: [email protected]

(We also recommend CC’ing [email protected] and [email protected] in case of any unsuccessful delivery issue.)

If you find our survey useful for your research, please cite the following paper:

πŸ“– Citation

If you find this repository or our survey helpful in your research, please kindly cite our paper:

@misc{hu2025surveyscientificlargelanguage,
      title={A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers}, 
      author={Ming Hu and Chenglong Ma and Wei Li and Wanghan Xu and Jiamin Wu and Jucheng Hu and Tianbin Li and Guohang Zhuang and Jiaqi Liu and Yingzhou Lu and Ying Chen and Chaoyang Zhang and Cheng Tan and Jie Ying and Guocheng Wu and Shujian Gao and Pengcheng Chen and Jiashi Lin and Haitao Wu and Lulu Chen and Fengxiang Wang and Yuanyuan Zhang and Xiangyu Zhao and Feilong Tang and Encheng Su and Junzhi Ning and Xinyao Liu and Ye Du and Changkai Ji and Cheng Tang and Huihui Xu and Ziyang Chen and Ziyan Huang and Jiyao Liu and Pengfei Jiang and Yizhou Wang and Chen Tang and Jianyu Wu and Yuchen Ren and Siyuan Yan and Zhonghua Wang and Zhongxing Xu and Shiyan Su and Shangquan Sun and Runkai Zhao and Zhisheng Zhang and Yu Liu and Fudi Wang and Yuanfeng Ji and Yanzhou Su and Hongming Shan and Chunmei Feng and Jiahao Xu and Jiangtao Yan and Wenhao Tang and Diping Song and Lihao Liu and Yanyan Huang and Lequan Yu and Bin Fu and Shujun Wang and Xiaomeng Li and Xiaowei Hu and Yun Gu and Ben Fei and Zhongying Deng and Benyou Wang and Yuewen Cao and Minjie Shen and Haodong Duan and Jie Xu and Yirong Chen and Fang Yan and Hongxia Hao and Jielan Li and Jiajun Du and Yanbo Wang and Imran Razzak and Chi Zhang and Lijun Wu and Conghui He and Zhaohui Lu and Jinhai Huang and Yihao Liu and Fenghua Ling and Yuqiang Li and Aoran Wang and Qihao Zheng and Nanqing Dong and Tianfan Fu and Dongzhan Zhou and Yan Lu and Wenlong Zhang and Jin Ye and Jianfei Cai and Wanli Ouyang and Yu Qiao and Zongyuan Ge and Shixiang Tang and Junjun He and Chunfeng Song and Lei Bai and Bowen Zhou},
      year={2025},
      eprint={2508.21148},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.21148}, 
}

In addition, "Awesome-Agent-Scientists" highlights the latest advances of AI agents in scientific research, which nicely complements our work.

@article{wei2025ai,
  title={From AI for Science to Agentic Science: A Survey on Autonomous Scientific Discovery},
  author={Wei, Jiaqi and Yang, Yuejin and Zhang, Xiang and Chen, Yuhan and Zhuang, Xiang and Gao, Zhangyang and Zhou, Dongzhan and Wang, Guangshuai and Gao, Zhiqiang and Cao, Juntai and others},
  journal={arXiv preprint arXiv:2508.14111},
  year={2025}
}

πŸ“ˆ Trends in Scientific LLM Publications

arxiv_llms Cumulative trend of publications on major preprint platforms whose titles or abstracts mention the keyword β€œlanguage model” or the combination β€œlanguage model + scientific domain” (e.g., chemistry, physics, multi-omics, medicine, etc.). Left: Results from January 2018 to August 2025, from arXiv and PubMed. For arXiv, the matching includes β€œlanguage model” in combination with additional science-related keywords; PubMed results are limited to occurrences in titles and abstracts. Both platforms show rapid growth. Right: Results from 2020 to August 2025, from bioRxiv, medRxiv, and ChemRxiv, all based on direct matches of β€œlanguage model” in titles and abstracts. While the overall volumes are smaller than arXiv and PubMed, all three platforms, especially bioRxiv, show rapid acceleration, reflecting growing interdisciplinary interest in large language models across biomedical, chemical, and computational sciences

πŸ”¬ Development of Sci-LLMs

sci-llm_develop Evolution of Sci-LLMs reveals four paradigm shifts from 2018 to 2025, including (1) the progression from transfer learning approaches, (2) through the scaling era marked by knowledge integration in larger models, (3) instruction-following capabilities enabling flexible task adaptation, to (4) the latest paradigm introduces scientific agentsβ€”AI systems capable of autonomously conducting scientific research, from hypothesis generation and experimental design to data analysis and discovery. Note: Model positions reflect their release dates (x-axis) rather than strict paradigm classification. The four paradigms represent evolving trends in Sci-LLM development with overlaps and continuities, not mutually exclusive categories.

πŸ•‘ Timeline of Sci-LLMs

sci-llms Chronological overview of notable Sci-LLMs categorized by six scientific domains, spanning from 2019 through early 2025. Due to the rapid expansion of the field, this figure presents a selective overview.

πŸ“‘ Table of Contents

πŸ§ͺ Scientific Pretraining, SFT, Reasoning, and Agent Datasets

🧬 Life Sciences

⬆ Back to Top

Dataset Domain Modality Purpose Type Release Language Source Annotation Pipeline Human Annotators Human Tasks Auto-annotation Tools Size
MIRAGE Agriculture Biological entity photos SFT VQA (multi-image) 2025.06 EN Scientific databases Semi-automated N/A Data generation GPT-4.1 37,512
CROP Agriculture Academic papers SFT Text QA 2024.09 EN, ZH Academic and research resources Semi-automated N/A Data generation GPT-4 211,909
ToT‑Biology General Biology Biomedical QA SFT, CoT Text QA with CoT 2025.01 EN Academic and research resources N/A N/A N/A N/A 23,000
BioASQ10b-factoid General Biology Clinical dialogue SFT Text QA 2023.07 EN Academic and research resources Manual N/A Data generation and review N/A 1.25K
ReasonMed Healthcare and Medical Sciences Clinical dialogue SFT, CoT Text QA with CoT 2025.06 EN Comprehensive multi-source integration Automated N/A N/A Qwen-2.5-72B, DeepSeek-R1-Distill-Llama-70B, HuatuoGPT-o1-70B 194,925
Open-PMC-18M Healthcare and Medical Sciences CT, CFP Pre-training Image-text 2025.06 EN Academic and research resources Automated N/A N/A N/A 25,000,000
ReXVQA Healthcare and Medical Sciences X-ray SFT VQA 2025.06 EN Integration of existing datasets Semi-automated 3 Data review GPT-4o, ClinicalBERT, MedEmbed 613,277
RexGradient-160K Healthcare and Medical Sciences X-ray Pre-training, SFT Image-text 2025.05 EN Scientific databases Manual N/A N/A N/A 160K
AlphaMed19K Healthcare and Medical Sciences Biomedical QA SFT, CoT Text QA 2025.05 EN Integration of existing datasets Automated N/A Data generation and review N/A 19,178
Derm1M Healthcare and Medical Sciences Dermatological images Pre-training Image-text 2025.3 EN Social media and forums, Academic and research resources Automated N/A N/A DenseNet, DINO, GPT-4o, Whisper 1,029,761
MedVideoCap-55K Healthcare and Medical Sciences Medical videos Pre-training, SFT Video-text 2025.04 EN Web and Internet content Automated N/A Data review GPT-4o 55,803
medical-o1-reasoning-SFT Healthcare and Medical Sciences Clinical dialogue SFT, CoT Text QA with CoT 2025.04 EN, ZH Comprehensive multi-source integration Automated N/A N/A DeepSeek-R1 90,200
GMAI-Reasoning10K Healthcare and Medical Sciences CT, Dermatology, Endoscopy, CFP, Histopathology, MRI, Microscopy, OCT, PET, US, X-ray, \etc SFT VQA 2025.04 EN Comprehensive multi-source integration Semi-automated N/A Data review GPT-4o 17,004
MedReason Healthcare and Medical Sciences Clinical dialogue SFT, CoT Text QA with CoT 2025.03 EN Comprehensive multi-source integration Automated N/A N/A N/A 32,682
GEMeX-VQA Healthcare and Medical Sciences X-ray Pre-training, SFT VQA 2025.03 EN Integration of existing datasets Semi-automated N/A Data review OpenBioLLM-70B, GPT-4o 1,601,615
MIMIC-Diff-VQA Healthcare and Medical Sciences X-ray SFT VQA (multi-image) 2025.02 EN Scientific databases Semi-automated 3 Data generation and review ScispaCy 630,633
ICG-CXR Healthcare and Medical Sciences X-ray SFT VQA (multi-image) 2025.03 EN Scientific databases Automated N/A Data generation and review GPT-4 11,439
VL-Health Healthcare and Medical Sciences CT, CFP, MRI, Microscopy, OCT, US, X-ray Pre-training, SFT Image-text, VQA 2025.02 EN, ZH Comprehensive multi-source integration Semi-automated N/A Data review GPT-4o 1,548,847
BIOMEDICA Healthcare and Medical Sciences Academic papers Pre-training Raw text 2025.01 EN Academic and research resources Semi-automated 7 Data review N/A 2,400,000
AfriMed-QA v2 Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2024.11 EN Comprehensive multi-source integration Semi-automated N/A N/A N/A 15,275
GMAI-VL-5.5M Healthcare and Medical Sciences CT, Dermatology, Endoscopy, CFP, Histopathology, MRI, Microscopy, OCT, PET, US, X-ray, \etc SFT VQA, Text QA 2024.11 EN, ZH Comprehensive multi-source integration Semi-automated 5 Data review GPT-4o 5.5M
OphVL Healthcare and Medical Sciences Ophthalmic Surgical Video Pre-training Video-text 2024.11 EN Web and Internet content Automated N/A Data generation and review SurgicBERTa, GPT-4o 375,198
Bora-v1 Healthcare and Medical Sciences Endoscopy, MRI, Microscopy, US SFT Video-text 2024.10 EN Integration of existing datasets Automated N/A Data review N/A 4,897
MedSyn Healthcare and Medical Sciences Clinical documentation Pre-training Raw text 2024.08 RU Academic and research resources Automated N/A N/A GPT-4, Medical Knowledge Graph 41,200
RealMedQA Healthcare and Medical Sciences Biomedical QA SFT Text QA 2024.08 EN Encyclopedias and knowledge bases Semi-automated 6 Data generation and review GPT-3.5-turbo 1,200
MedTrinity-25M Healthcare and Medical Sciences CT, MRI, X-ray, Histopathology, \etc Pre-training Image-text, VQA 2024.08 EN Integration of existing datasets, Scientific databases Automated N/A N/A N/A 25,000,000
MedPix-single Healthcare and Medical Sciences CT, MRI, US, X-ray Pre-training Image-text 2024.07 EN Scientific databases Manual N/A Data generation N/A 59,000
BIMCV-R Healthcare and Medical Sciences CT Pre-training Image-text 2024.07 EN Scientific databases Semi-automated 20+ Data review GPT-4 8,069
MIMIC-Ext-MIMIC-CXR-VQA Healthcare and Medical Sciences X-ray Pre-training, SFT VQA 2024.07 EN Integration of existing datasets Semi-automated 4 Data review GPT-4 377,391
EHRXQA Healthcare and Medical Sciences X-ray Pre-training, SFT VQA 2024.07 EN Integration of existing datasets Semi-automated 4 Data review GPT-4 46,152
CheXpertPlus Healthcare and Medical Sciences X-ray Pre-training Image-text 2024.06 EN Scientific databases Semi-automated 10 Data generation and review CheXbert, Radgraph 223,228
PubMedVision Healthcare and Medical Sciences CT, Endoscopy, CFP, Infrared Reflectance, MRI, Microscopy, OCT, US, X-ray SFT VQA 2024.06 EN Academic and research resources Automated N/A N/A GPT-4, GPT-4V, SentenceBERT 1,294,092
MediQ Healthcare and Medical Sciences EHR SFT Text QA 2024.06 EN Academic and research resources Automated N/A N/A GPT-3.5, LLaMA‑3 2,545
HuatuoGPT2-SFT-GPT4-140K Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2024.06 ZH Other sources Automated N/A Data generation and review GPT-4 140,000
Asclepius-Synthetic-Clinical-Notes Healthcare and Medical Sciences EHR SFT Text QA 2024.06 EN Academic and research resources Semi-automated N/A Data generation GPT-3.5 158,114
Know Medical Dialogues Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2024.06 EN Web and Internet content Automated N/A N/A N/A 480
Duvel Healthcare and Medical Sciences Academic papers SFT Classification 2024.05 EN Scientific databases Semi-automated N/A Data generation ALAMBIC 6,553
SkinCAP Healthcare and Medical Sciences Dermatology Pre-training Image-text 2024.05 EN Academic and research resources Semi-automated N/A N/A N/A 4,000
MM-Retinal Healthcare and Medical Sciences CFP, FFA, OCT Pre-training, SFT Image-text 2024.05 EN, ZH Academic and research resources Semi-automated 6 Data review N/A 4,349
M3D-Data (caption) Healthcare and Medical Sciences CT, Clinical reports Pre-training, SFT Image-text, Text QA, VQA 2024.04 EN Scientific databases, Integration of existing datasets Semi-automated N/A Data generation and review GPT-4V 120,092
M3D-Data (instruction) Healthcare and Medical Sciences CT, Clinical reports SFT Image-text, Text QA, VQA 2024.04 EN Scientific databases, Integration of existing datasets Semi-automated N/A Data generation and review GPT-4V 58,180
RadGenome-Chest CT Healthcare and Medical Sciences CT Pre-training, SFT VQA, Image-text 2024.04 EN Academic and research resources Semi-automated N/A Data review SAT, GPT-4, GPT-2 1,965,000
CXR-LLM Healthcare and Medical Sciences X-ray SFT VQA 2024.03 EN Integration of existing datasets Semi-automated N/A Data generation GPT-4 104,892
MedChatZH Healthcare and Medical Sciences Clinical dialogue Pre-training, SFT Text QA 2024.03 ZH Comprehensive multi-source integration Semi-automated N/A Data generation N/A 2,068,823
Mental health chatbot dataset Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2024.02 EN Web and Internet content Automated N/A N/A N/A 172
StatPearls Healthcare and Medical Sciences Academic papers Pre-training Raw text 2024.02 EN Scientific databases Automated N/A N/A N/A 301,202
Quilt-Instruct Healthcare and Medical Sciences Histopathology SFT VQA 2024.02 EN Web and Internet content Semi-automated N/A Data review GPT-4-turbo 107,131
SHADR Healthcare and Medical Sciences EHR SFT Classification 2024.01 EN Scientific databases Semi-automated N/A Data review GPT-3.5 446
RJUA-QA Healthcare and Medical Sciences Dianosis report, Clinical dialogue SFT Text QA 2023.12 ZH Other sources Manual N/A Data generation and review N/A 1,705
RP3D-DiagDS Healthcare and Medical Sciences CT, MRI, X-ray US, Fluoroscopy, \etc Pre-training Classification 2023.12 EN Scientific databases Semi-automated N/A Data generation and review Custom crawlers, GPT-4 40,936
PMC-Inline Healthcare and Medical Sciences CT, MRI, PET, US, X-ray Pre-training Image-text 2023.11 EN Academic and research resources Automated N/A N/A N/A 11,000,000
ROCOv2 Healthcare and Medical Sciences CT, MRI, PET, US, X-ray Pre-training, SFT Image-text 2023.11 EN Academic and research resources Semi-automated N/A N/A fastText, MedCAT 80,080
PMC-CaseReport Healthcare and Medical Sciences X-ray SFT Image-text, VQA 2023.11 EN Academic and research resources Automated N/A N/A N/A 1,100,000
MedMD Healthcare and Medical Sciences CT, MRI, PET, US, X-ray Pre-training, SFT Image-text, VQA 2023.11 EN Academic and research resources Semi-automated 8 Data review ChatGPT 16,000,000
Taiyi-Instruction-Data-001 Healthcare and Medical Sciences Dianosis report, Clinical dialogue, EMR, Academic papers, \etc Pre-training, SFT Text QA 2023.11 EN, ZH Integration of existing datasets Automated N/A Data review N/A 1,114,315
MTS-DIALOG Healthcare and Medical Sciences Clinical dialogue Pre-training Text QA 2023.11 EN Academic and research resources Semi-automated 12 Data generation and review GPT-4o 23,977
MTS-Dialog Healthcare and Medical Sciences Clinical dialogue Pre-training Raw text 2023.11 EN Patent databases Semi-automated 9 Data generation and review OPUS-MT, BART 1,701
Clinical Guidelines Healthcare and Medical Sciences Clinical guidelines Pre-training Text QA with CoT 2023.11 EN Scientific databases Semi-automated N/A Data review S2ORC, GROBID 38,000
INSPECT Healthcare and Medical Sciences CT Pre-training Image-text 2023.11 EN Scientific databases Semi-automated N/A Data review, Data generation Clinical Longformer 23,248
AeroPath Healthcare and Medical Sciences CT Agent Segmentation 2023.11 EN Scientific databases Semi-automated 2 Data review 3D Slicer 27 (CT scans)
MORFITT Healthcare and Medical Sciences Clinical papers Pre-training Classification 2023.11 FR Academic and research resources Manual N/A Data review N/A 3,556
NoteChat Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2023.10 EN Integration of existing datasets Automated N/A N/A N/A 207,000
ChiMed-VL Healthcare and Medical Sciences X-ray, CT, MRI, \etc Pre-training, SFT Image-text, Text QA 2023.10 ZH, EN Integration of existing datasets Automated N/A N/A GPT-3.5 1,049,455
OncQA Healthcare and Medical Sciences Dianosis report SFT Text QA 2023.10 EN Other sources Manual 6 Data generation and review GPT-4 156
SDOH-NLI Healthcare and Medical Sciences Clinical notes Pre-training Classification 2023.10 EN Integration of existing datasets Manual N/A Data generation N/A 21.1K
CMtMedQA Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2023.08 ZH Other sources Automated N/A Data review N/A 68,000
DISC-Med-SFT Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2023.08 ZH Integration of existing datasets Semi-automated N/A Data review GPT-3.5, GPT-4 470,000
Healix-V1 Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2023.07 EN Comprehensive multi-source integration N/A N/A N/A N/A 796,239
Medical Cord19 Healthcare and Medical Sciences Academic papers Pre-training Raw text 2023.07 EN Academic and research resources Automated N/A N/A N/A 250,000
Pile-PubMed Central Healthcare and Medical Sciences Academic papers Pre-training Raw text 2023.07 EN Academic and research resources Automated N/A Data generation N/A N/A
AGCT Healthcare and Medical Sciences Biomedical knowledge base Pre-training Raw text 2023.07 EN, FR Scientific databases Automated N/A N/A Custom generation 421,216
Synthetic CSAW 100k Mammograms Healthcare and Medical Sciences Mammography SFT Image-text 2023.07 EN Scientific databases Automated N/A N/A Diffusion Model 100K
Quilt-1M Healthcare and Medical Sciences Histopathology Pre-training, SFT Image-text 2023.06 EN Academic and research resources, Web and Internet content, Other sources Automated N/A N/A N/A 1,000,000
LLaVA-Med Healthcare and Medical Sciences CT, Histopathology, MRI, Microscopy, PET, US, X-ray Pre-training, SFT VQA, Image-text 2023.06 EN Comprehensive multi-source integration Automated N/A N/A GPT-4 630,000
ShenNong-TCM-Dataset Healthcare and Medical Sciences Clinical dialogue SFT, CoT Text QA 2023.06 ZH Comprehensive multi-source integration Automated N/A Data generation ChatGPT 113,000
PMC-VQA Healthcare and Medical Sciences CT, CFP, Histopathology, MRI, Microscopy, US, X-ray SFT VQA 2023.05 EN Academic and research resources Automated N/A N/A N/A 226,946
ChatMed-Consult-Dataset Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2023.05 ZH Web and Internet content Automated N/A Data generation GPT-3.5-Turbo 549,000
QiZhenGPT-20k Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2023.05 ZH Other sources Automated N/A Data generation N/A 20,000
Huatuo-26M Healthcare and Medical Sciences Biomedical QA Pre-training, SFT Text QA 2023.05 EN Encyclopedias and knowledge bases Semi-automated N/A Data review Bert, T5 26,000,000
Huatuo26M-Lite Healthcare and Medical Sciences Clinical dialogue, Dianosis report Pre-training, SFT Text QA 2023.05 ZH Web and Internet content Semi-automated N/A Data review ChatGPT 177,703
Visual Med-Alpaca Healthcare and Medical Sciences CT, CFP, Histopathology, MRI, Microscopy, US, X-ray SFT VQA 2023.04 EN Scientific databases Automated N/A N/A GPT-3.5 54,000
MedAlpaca Healthcare and Medical Sciences Clinical dialogue, Academic papers Pre-training, SFT Raw text, Text QA 2023.04 EN Comprehensive multi-source integration Automated N/A Data generation and review N/A 860,076
Med-ChatGLM Healthcare and Medical Sciences Biomedical knowledge base SFT Text QA 2023.04 ZH Integration of existing datasets Automated N/A Data generation GPT-3.5 7,622
PMC-OA Healthcare and Medical Sciences CT, Dermatology, Endoscopy, Histopathology, Microscopy, MRI, OCT, PET, X-ray Pre-training Image-text 2023.03 EN Academic and research resources Automated N/A Data generation and review ResNet101 (DocFigure), ResNet34 (DETR MedICaT), PMC-CLIP 1,646,592
ChatDoctor Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2023.03 EN Other sources Semi-automated N/A Data generation and review N/A 115,000
WikiMedQA Healthcare and Medical Sciences Clinical Reports SFT Text QA 2023.03 EN Web and Internet content Semi-automated N/A N/A SentenceBERT, BioLinkBERT 111,895
MIMIC-IV Healthcare and Medical Sciences EHR Pre-training, SFT Raw text 2023.01 EN Scientific databases Semi-automated N/A N/A Transformer-DeID 364,627
BioRED Healthcare and Medical Sciences Academic papers Pre-training Classification 2022.09 EN Scientific databases Semi-automated 6 Data generation and review PubTator 500
ViHealthQA Healthcare and Medical Sciences Biomedical QA SFT Text QA 2022.06 VI Social media and forums Manual N/A Data generation N/A 10,015
MedMCQA Healthcare and Medical Sciences Medical exams SFT Text QA 2022.03 EN Books and literary works Automated N/A Data generation N/A 193,155
PMC-Patients-ReCDS Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2022.02 EN Academic and research resources Automated N/A N/A N/A 293,000
PMC-Patients Healthcare and Medical Sciences Clinical report Pre-training Raw text 2022.02 EN Scientific databases Semi-automated N/A Data review PubMedBERT, BioLinkBERT 167,000
CMCQA Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2022.01 ZH Web and Internet content Automated N/A Data review N/A 1,294,753
IMCS-V2 Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2022.01 ZH Other sources Manual N/A Data generation and review N/A 4,116
MLEC-QA Healthcare and Medical Sciences Biomedical QA SFT Raw text 2021.11 ZH Academic and research resources Semi-automated N/A Data generation and review N/A 136,236
ImageClef-VQA Med 2021 Healthcare and Medical Sciences CT, MRI, US, X-ray SFT VQA 2021.09 EN Academic and research resources Automated N/A N/A N/A 4,500
BioLeaflets Healthcare and Medical Sciences Package leaflets Pre-training Raw text 2021.09 EN Web and Internet content Semi-automated N/A Data generation Stanza, Amazon Comprehend Medical 1,067
MedGPT-5k-ko Healthcare and Medical Sciences Clinical trials, EHR, Medical forum, Medical textbooks SFT Classification, Text QA 2021.06 ZH Scientific databases, Books and literary works, Web and Internet content, Comprehensive multi-source integration Manual 3 Data generation and review N/A 149,141
CBLUE Healthcare and Medical Sciences Clinical trials, EHR, Medical forum, Medical textbooks SFT Classification, Text QA 2021.06 ZH Scientific databases, Books and literary works, Web and Internet content, Comprehensive multi-source integration Manual 3 Data generation and review N/A 149,141
MedDG Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2021.05 ZH Web and Internet content Automated N/A Data generation and review N/A 100,000
SLAKE Healthcare and Medical Sciences CT, MRI, X-ray SFT VQA 2021.02 EN, ZH Academic and research resources Automated N/A N/A N/A 11,958
Chinese-medical-dialogue-data Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2021.02 ZH Other sources N/A N/A N/A N/A 792,099
DeepEyeNet Healthcare and Medical Sciences CFP, FFA Pre-training Image-text 2021.01 EN Scientific databases Manual N/A Data generation N/A 15,709
AIforCOVID Healthcare and Medical Sciences X-ray Pre-training, SFT Image-text 2020.12 EN Scientific databases Manual N/A Data generation N/A 820
MedICaT Healthcare and Medical Sciences CT, Endoscopy, Histopathology, MRI, Microscopy, PET, US, X-ray Pre-training, SFT Image-text 2020.10 EN Academic and research resources Semi-automated 7 Data generation ResNet101-DocFigure, ScispaCy 217,060
ImageClef-VQA Med 2020 Healthcare and Medical Sciences CT, MRI, US, X-ray SFT VQA 2020.09 EN Academic and research resources Automated N/A N/A N/A 4,000
MedQA Healthcare and Medical Sciences Medical exams SFT Text QA 2020.09 EN, ZH Scientific databases Manual N/A Data generation and review N/A 61,097
MedDialog-CN Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2020.07 ZH Web and Internet content Automated N/A Data review N/A 1,100,000
MEDIQA-AnS Healthcare and Medical Sciences Consumer health QA SFT Text QA 2020.05 EN, ZH Web and Internet content Semi-automated 2 Data generation Custom crawlers 156
MedDialog Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2020.04 EN, ZH Web and Internet content Automated N/A N/A Custom crawlers 14,668,058
PathVQA Healthcare and Medical Sciences Histopathology SFT VQA 2020.03 EN Academic and research resources Automated N/A N/A N/A 19,654
RetinaRocks Healthcare and Medical Sciences CFP Pre-training, SFT Image-text 2019.12 EN Other sources Manual N/A Data generation N/A 4,000
MedQuAD Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2019.10 EN Web and Internet content Automated N/A N/A Custom crawlers 47,441
ImageClef-VQA Med 2019 Healthcare and Medical Sciences CT, MRI, US, X-ray SFT VQA 2019.09 EN Academic and research resources Automated N/A N/A N/A 15,292
PubMedQA Healthcare and Medical Sciences Academic papers SFT Text QA 2019.09 EN Web and Internet content Semi-automated N/A Data generation and review N/A 212,300
PubMedQA instruction Healthcare and Medical Sciences Academic papers SFT Text QA 2019.09 EN Academic and research resources Manual N/A Data generation N/A 1K
MIMIC-CXR Healthcare and Medical Sciences X-ray Pre-training Image-text 2019.08 EN Scientific databases Manual N/A Data generation N/A 227,835
MIMIC-Extract Healthcare and Medical Sciences EHR Pre-training Text QA 2019.07 EN Scientific databases Automated N/A N/A N/A 2,000,000
webMedQA Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2019.03 ZH Web and Internet content Automated N/A Data review N/A 63,284
VQA-RAD Healthcare and Medical Sciences CT, MRI, PET, US, X-ray SFT VQA 2018.11 EN Academic and research resources Manual N/A Data generation N/A 1,793
cMedQA2 Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2018.11 ZH Web and Internet content Automated N/A Data review N/A 108,000
ROCO Healthcare and Medical Sciences CT, MRI, PET, US, X-ray Pre-training, SFT Image-text 2018.09 EN Academic and research resources Automated N/A N/A N/A 81,000
emrQA Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2018.09 EN Integration of existing datasets Semi-automated N/A N/A N/A 455,000
ImageClef-VQA Med 2018 Healthcare and Medical Sciences CT, MRI, US, Unknown, X-ray SFT VQA 2018.06 EN Academic and research resources Automated N/A N/A N/A 6,413
LiveQA Healthcare and Medical Sciences Consumer health QA SFT Text QA 2018.02 EN Scientific databases Automated N/A Data review N/A 634
LiveQA trec2017 Healthcare and Medical Sciences Clinical dialogue SFT Text QA 2017.08 EN Academic and research resources Semi-automated N/A Data review N/A 634
OpenI Healthcare and Medical Sciences X-ray Pre-training Image-text 2016.03 EN Scientific databases Manual N/A Data generation N/A 3,955
Retina Image Bank Healthcare and Medical Sciences CFP, FFA Pre-training, SFT Image-text 2012.08 EN Other sources Manual N/A Data generation N/A 30,452
William Hoyt ImageText Healthcare and Medical Sciences CFP Pre-training Image-text 2004.03 EN Scientific databases Manual N/A Data generation N/A 856
Pima Healthcare and Medical Sciences EHR SFT Classification 1988.11 EN Scientific databases Manual N/A N/A N/A 691
COVID-19-Data-Hub Healthcare and Medical Sciences Global pandemic data (cases, vaccines, policies, \etc) Pre-training, RAG Classification, Regression 2020.07 EN Comprehensive multi-source integration Automated N/A N/A R package N/A
BEACON Molecular and Cellular Biology RNA sequence SFT Raw text 2024.06 EN Comprehensive multi-source integration Semi-automated N/A Data generation and review N/A 870,883
SPICE Molecular and Cellular Biology SMILES Pre-training, RAG Classification, Regression 2024.03 EN Scientific databases Semi-automated N/A Data generation and review N/A 113,999
PubChemSTM Molecular and Cellular Biology SMILES Pre-training, SFT Raw text 2024.01 EN Academic and research resources Semi-automated N/A Data generation SciBERT, spaCy 281,000
SourceData Molecular and Cellular Biology Academic papers Pre-training VQA 2023.10 EN Academic and research resources Semi-automated N/A Data review PubMedBERT, BioLinkBERT, GPT-4o 62,543
Mol-Instructions Molecular and Cellular Biology Biomolecular instructions SFT Text QA 2023.06 EN Comprehensive multi-source integration Automated N/A Data review GPT-3.5 2,043,000
PCdes Molecular and Cellular Biology SMILES Pre-training, SFT Raw text 2022.12 EN Academic and research resources Automated N/A N/A Custom crawlers 12,000
MoMu Molecular and Cellular Biology Graph Pre-training, SFT Raw text 2022.12 EN Academic and research resources Automated N/A N/A OGB 15,613
PEER Molecular and Cellular Biology Protein sequence SFT Classification, Regression 2022.10 EN Comprehensive multi-source integration Semi-automated N/A Data generation and review N/A 329,922
BioGPT Molecular and Cellular Biology, Healthcare and Medical Sciences Biomedical domain pretraining corpus Pre-training, SFT Raw text 2022.08 EN Scientific databases, Academic and research resources Automated N/A N/A Moses tokenizer, fastBPE 15M
DISEASES Molecular and Cellular Biology, Healthcare and Medical Sciences, Multi-omics Disease-gene associations SFT, RAG Classification 2015.01 EN Academic and research resources, Integration of existing datasets, Scientific databases Semi-automated N/A Data generation and review NER tagger 8,336,442
BioReason Molecular and Cellular Biology, Multi-omics DNA sequence, KEGG pathways, Gene variants SFT, CoT Text QA with CoT 2025.05 EN Scientific databases, Academic and research resources Semi‑automated N/A N/A Custom scripts 87,620
GeneChat Multi-omics Nucleotide sequence Pre-training Text QA 2025.06 EN Scientific databases N/A N/A Data generation N/A 47,275
Genomics instructions Multi-omics Nucleotide sequence SFT Text QA 2025.04 EN Academic and research resources N/A N/A Data generation N/A 4,954,234
scMMGPT data Multi-omics scRNA-seq Pre-training, SFT scRNA-seq-text 2025.03 EN Academic and research resources Automated N/A N/A N/A 467K
OPI Multi-omics Protein SFT Text QA 2025.03 EN Scientific databases Semi-automated N/A Data generation GPT-3.5 1,640,000
OpenGenome2 Multi-omics Nucleotide sequence Pre-training Raw text 2025.02 EN Integration of existing datasets N/A N/A N/A N/A 8,800B (nucleotides)
Seq2Func Multi-omics Nucleotide sequence SFT Text QA 2025.02 EN Scientific databases Automated N/A Data generation N/A 297,000
DNA2Image Multi-omics Nucleotide sequence SFT Generation 2025.02 EN Scientific databases Automated N/A Data generation N/A 43,200
LLaMA-Gene (protein) Multi-omics Protein sequence Pre-training, SFT Text QA 2024.12 EN Scientific databases N/A N/A Data generation N/A 62,918
LLaMA-Gene (DNA) Multi-omics DNA sequence Pre-training, SFT Text QA 2024.12 EN Scientific databases N/A N/A Data generation N/A 178,551
OpenGenome Multi-omics Nucleotide sequence Pre-training Raw text 2024.11 EN Integration of existing datasets N/A N/A N/A N/A 300B (nucleotides)
The 1000G Multi-omics Nucleotide sequence Pre-training Raw text 2024.10 EN Scientific databases N/A N/A N/A N/A 20,500B (nucleotides)
Multispecies dataset Multi-omics Nucleotide sequence Pre-training Raw text 2024.10 EN Scientific databases N/A N/A N/A N/A 174B (nucleotides)
NT Benchmark Multi-omics Nucleotide sequence SFT Classification 2024.10 EN Academic and research resources N/A N/A N/A N/A 493,242
ProteinLMDataset Multi-omics Protein sequence Pre-training Raw text 2024.06 EN Academic and research resources Automated N/A N/A N/A 893,000
RNAcentral Multi-omics RNA sequence Pre-training Raw text 2024.05 EN Scientific databases N/A N/A N/A N/A 23M
RNA-QA Multi-omics RNA sequence SFT Text QA 2024.05 EN Academic and research resources Automated N/A N/A GPT-4o 407,616
ProCoT Multi-omics Biomedical QA SFT, CoT Text QA with CoT 2024.05 EN Scientific databases, Academic and research resources Semi‑automated N/A Data generation and review embedding‑based filtering 4,967,723
UniProtKB/Swiss-Prot Multi-omics Protein sequence Pre-training Raw text 2023.11 EN Scientific databases N/A N/A N/A N/A 570K
Multi-species genome Multi-omics Nucleotide sequence Pre-training Raw text 2023.06 EN Integration of existing datasets N/A N/A N/A N/A 32.49B (nucleotides)
Genomic Benchmark Multi-omics Nucleotide sequence SFT Classification 2023.05 EN Academic and research resources N/A N/A N/A N/A 699,116
CELLxGENE scRNA-seq Collection Multi-omics scRNA-seq Pre-training Gene Expression-pretrain 2023.05 EN Scientific databases N/A N/A N/A N/A 33 M
Human Pancreas Multi-omics scRNA-seq SFT Classification 2023.01 EN Academic and research resources N/A N/A N/A N/A 10,600
scFoundation Dataset Multi-omics scRNA-seq Pre-training Gene Expression-pretrain 2022.10 EN Scientific databases N/A N/A N/A N/A 50M
Human genome Multi-omics Nucleotide sequence Pre-training Raw text 2021.02 EN Scientific databases N/A N/A N/A N/A 2.75B (nucleotides)
GPD Multi-omics Nucleotide sequence Pre-training Raw text 2021.02 EN Scientific databases N/A N/A N/A N/A 142,809
Myeloid Multi-omics scRNA-seq SFT Classification 2021.02 EN Academic and research resources N/A N/A N/A N/A 9,748
Human Cell Atlas Dataset Multi-omics scRNA-seq SFT Classification 2021.02 EN Academic and research resources N/A N/A N/A N/A 84,363
GVD Multi-omics Nucleotide sequence Pre-training Raw text 2019.07 EN Scientific databases N/A N/A N/A N/A 13,203B
Multiple Sclerosis Multi-omics scRNA-seq SFT Classification 2019.07 EN Academic and research resources N/A N/A N/A N/A 7,844
PanglaoDB Multi-omics scRNA-seq Pre-training Gene Expression-pretrain 2018.11 EN Scientific databases N/A N/A N/A N/A 1,126,580
Zheng68k Multi-omics scRNA-seq SFT Classification 2016.07 EN Academic and research resources N/A N/A N/A N/A 68,450
GRCh38/hg38 Multi-omics Nucleotide sequence Pre-training Raw text 2013.12 EN Scientific databases N/A N/A N/A N/A 3.1B (nucleotides)
Biology-Instructions Multi-omics DNA, RNA, Protein sequence SFT Text QA 2024.12 EN Academic and research resources Semi-automated N/A Data generation GPT-4o, Claude-3.5-sunnet 3.3 M
TCPA Multi-omics Protein sequence Pre-training Raw text 2013.09 EN Academic and research resources N/A N/A N/A N/A 4,379
NCBI-GenBank Multi-omics Nucleotide sequence Pre-training Raw text 2012.11 EN Scientific databases N/A N/A N/A N/A 5,000B (nucleotides)
GRCh37/hg19 Multi-omics Nucleotide sequence Pre-training Raw text 2009.02 EN Scientific databases N/A N/A N/A N/A 3.1B (nucleotides)
Neuro-3D Neuroscience EEG Pre-training, SFT Classification 2025.03 EN Academic and research resources Semi-automated N/A N/A N/A 720
Things-MEG Neuroscience MEG Pre-training, SFT Classification 2023.04 EN Academic and research resources Semi-automated N/A N/A N/A 22,248
Things-EEG2 Neuroscience EEG Pre-training, SFT Classification 2022.11 EN Academic and research resources Semi-automated N/A N/A N/A 16,740
SHU Neuroscience EEG Pre-training, SFT Classification 2022.08 EN Academic and research resources Semi-automated N/A N/A N/A 11,988
Things-fMRI Neuroscience fMRI Pre-training, SFT Classification 2022.07 EN Academic and research resources Semi-automated N/A N/A N/A 8,740
NSD-Imagery Neuroscience fMRI Pre-training, SFT Classification 2022.07 EN Academic and research resources Semi-automated N/A N/A N/A 2,304
HMC Neuroscience EEG Pre-training, SFT Classification 2022.03 EN Academic and research resources Semi-automated N/A N/A N/A 154
Things-EEG1 Neuroscience EEG Pre-training, SFT Classification 2022.01 EN Academic and research resources Semi-automated N/A N/A N/A 22,248
NSD Neuroscience fMRI Pre-training, SFT Classification 2021.09 EN Academic and research resources Semi-automated N/A N/A N/A 70,566
ZuCo2 Neuroscience EEG Pre-training, SFT Text QA 2019.11 EN Academic and research resources Semi-automated N/A N/A N/A 739
DIR Neuroscience fMRI Pre-training, SFT Classification 2019.01 EN Academic and research resources Semi-automated N/A N/A N/A 6,000
Workload Neuroscience EEG Pre-training, SFT Classification 2018.12 EN Academic and research resources Semi-automated N/A N/A N/A 1080
ZuCo1 Neuroscience EEG Pre-training, SFT Text QA 2018.11 EN Academic and research resources Semi-automated N/A N/A N/A 1,107
SEED-IV Neuroscience EEG Pre-training, SFT Classification 2018.02 EN Academic and research resources Semi-automated N/A N/A N/A 143,610
TUSL Neuroscience EEG Pre-training, SFT Classification 2018.01 EN Academic and research resources Semi-automated N/A N/A N/A 245
TUEV Neuroscience EEG Pre-training, SFT Classification 2015.12 EN Academic and research resources Semi-automated N/A N/A N/A 112,237
TUAB Neuroscience EEG Pre-training, SFT Classification 2015.12 EN Academic and research resources Semi-automated N/A N/A N/A 409,083
SEED Neuroscience EEG Pre-training, SFT Classification 2015.05 EN Academic and research resources Semi-automated N/A N/A N/A 144,851
Sleep-EDF Neuroscience EEG Pre-training, SFT Classification 2013.10 EN Academic and research resources Semi-automated N/A N/A N/A 197
SHHS Neuroscience EEG Pre-training, SFT Classification 1998.01 EN Academic and research resources Semi-automated N/A N/A N/A 6,441
repoDB Pharmacy, Healthcare and Medical Sciences Drug-disease relationships, Clinical trials RAG Classification, Text QA 2017.03 EN Scientific databases Automated N/A N/A scripts 15,648

βš—οΈ Chemistry

⬆ Back to Top

Dataset Domain Modality Purpose Type Release Language Source Annotation Pipeline Human Annotators Human Tasks Auto-annotation Tools Size
MOSES Biochemistry SMILES Pre-training Raw text 2020.07 EN Academic and research resources Automated N/A Data generation and review N/A 1,936,962
ChemBL Biochemistry SMILES Pre-training Raw text 2012.01 EN Academic and research resources Automated N/A Data generation and review N/A 1,961,462
ChemRxivQuest General Chemistry Academic papers Pre-training, SFT Text QA 2025.05 EN Academic and research resources Manual N/A Data generation and review N/A 970
ScholarChemQA General Chemistry Academic papers Pre-training, SFT Text QA 2025.02 EN Academic and research resources Manual N/A Data generation and review N/A 40K
SMolInstruct General Chemistry SMILES SFT Text QA 2024.08 EN Scientific databases Semi-automated N/A Data generation and review GPT-4 3.3M
ChemNLP General Chemistry Text Pre-training, SFT Classification 2023.01 EN Academic and research resources Manual N/A Data generation and review N/A 110,342
PMO General Chemistry SMILES Pre-training, SFT Raw text 2022.05 EN Academic and research resources Automated N/A Data generation and review N/A 10K
ZINC General Chemistry SMILES Pre-training Raw text 2012.10 EN Academic and research resources Automated N/A Data generation and review N/A 250K
DeepProtein Pharmacy Protein sequence, SMILES Pre-training, SFT Raw text 2025.05 EN Academic and research resources Automated N/A Data generation and review N/A 78K
TrialBench Pharmacy SMILES, Disease code Pre-training, SFT Raw text 2024.09 EN Academic and research resources Automated N/A Data generation and review N/A 470K
TDC2 Pharmacy SMILES, Protein sequence, Genome sequence Pre-training, SFT Classification, Regression, Generation 2024.09 EN Academic and research resources Manual N/A Data generation and review N/A 3.4B (tokens)
SBDDBench Pharmacy Text, Protein sequence, SMILES Pre-training, SFT Protein-ligand 2022.06 EN Academic and research resources Automated N/A Data generation and review N/A 5K
TOP Pharmacy SMILES Pre-training, SFT Raw text 2022.02 EN Academic and research resources Automated N/A Data generation and review N/A 12K
TDC Pharmacy SMILES, Protein sequence, Genome sequence Pre-training, SFT Classification, Regression, Generation 2021.06 EN Academic and research resources Manual N/A Data generation and review N/A 0.2B (tokens)
DeepPurpose Pharmacy Protein sequence, SMILES Pre-training, SFT Raw text 2020.12 EN Academic and research resources Automated N/A Data generation and review N/A 5,074
DrugBank Pharmacy SMILES Pre-training, SFT Raw text 2018.01 EN Academic and research resources Manual N/A Data generation and review N/A 18K
DrugCentral Pharmacy SMILES Pre-training, SFT Raw text 2017.01 EN Academic and research resources Automated N/A Data generation and review N/A 4,995
USPTO Synthetic Chemistry SMILES Pre-training, SFT Generation 2015.07 EN Patent databases Manual N/A Data generation and review N/A 1,939,253

βš›οΈ Physics

⬆ Back to Top

Dataset Domain Modality Purpose Type Release Language Source Annotation Pipeline Human Annotators Human Tasks Auto-annotation Tools Size
MM-PhyQA General Physics High-school exams SFT, CoT VQA with CoT 2024.04 EN Web and Internet content Manual N/A Data generation and review AFL 3.0 3,825
PIQA General Physics Text SFT Text QA 2020.01 EN Other sources Semi-automated AFLite Data generation and review N/A 19,838

🌌 Astronomy

⬆ Back to Top

Dataset Domain Modality Purpose Type Release Language Source Annotation Pipeline Human Annotators Human Tasks Auto-annotation Tools Size
AstroLLaVA Astronomy General dialog, Astronomical images SFT VQA 2025.04 EN Comprehensive multi-source integration Semi-automated N/A Data review GPT-4 29,783
AstroPT Astronomy Astronomical images Pre-training Regression 2024.05 EN Web and Internet content, Scientific databases Automated N/A Data review DESI Legacy Survey API 8.6M (tokens)
Astro-NER Astronomy Academic papers SFT Text QA 2024.05 EN Academic and research resources Semi-automated 4 Data generation and review GPT-3.5 5000
AstroLLaMA-chat Astronomy Academic papers SFT Text QA 2024.01 EN Academic and research resources Manual N/A Data review N/A 10,356
AstroLLaMA Astronomy Academic papers SFT Text QA 2023.09 EN Academic and research resources, Web and Internet content Manual N/A Data review N/A 9.5M
ATel Astronomy Academic papers SFT Text QA 2023.05 EN Academic and research resources Manual N/A Data review N/A 234
AstroBERT Astronomy Academic papers Pre-training Raw text 2022.11 EN Academic and research resources Automated 12 Data generation and review N/A 3.8B (tokens)
AstroMLab 4 Astronomy Academic papers SFT Text QA 2025.05 EN Integration of existing datasets Automated N/A Data generation and review Gemini-1.5-Pro 250,000 arXiv preprints
AstroMLab 3 Astronomy Academic papers SFT Text QA 2025.04 EN Academic and research resources Automated N/A Data generation and review Gemini-1.5-Pro 3.3B (tokens)
AstroMLab 2 Astronomy Academic papers SFT Text QA 2024.09 EN Academic and research resources Automated N/A Data generation and review Gemini-1.5-Pro 10,356
Starwhisper-pilsar Astrophysics Text, pulsar diagnostic plots, pulsars signals SFT Classification 2024.04 EN Integration of existing datasets Manual N/A Data generation and review DeepSeek-VL-7B, InternVL2-40B 106,674
PAPERCLIP Astrophysics synthetic conversation text, Astronomical images SFT Image-text 2024.03 EN Academic and research resources Automated N/A Data review Mixtral-8x7B-Instruct 31,859

πŸͺ¨ Materials Science

⬆ Back to Top

Dataset Domain Modality Purpose Type Release Language Source Annotation Pipeline Human Annotators Human Tasks Auto-annotation Tools Size
ChEBI-20-MM Materials Science InChI, IUPAC, SELFIES, Molecular image SFT Text QA 2025.01 EN Integration of existing datasets Manual N/A Data generation and review N/A 29,706
Materials Project Trajectory Materials Science CIF Pre-training Raw text 2023.07 EN Scientific databases Manual N/A Data generation and review N/A 1,580,395
DigiMOF Materials Science CIF Pre-training Raw text 2023.05 EN Academic and research resources Manual N/A Data generation and review N/A 15,501
Novel Materials Discovery (NOMAD) Materials Science CIF Pre-training Raw text 2023.03 EN Scientific databases Manual N/A Data generation and review N/A 4,341,443
MOFX-DB (hMOF) Materials Science CIF Pre-training Raw text 2023.02 EN Integration of existing datasets Manual N/A Data generation and review N/A 160,000
MatScholar Materials Science Academic papers Pre-training Raw text 2022.07 EN Academic and research resources Manual N/A Data generation and review N/A 5M
Pfeiffer et al. Chemical composition Materials Science Chemical Composition Pre-training Raw text 2022.03 EN Comprehensive multi-source integration Manual N/A Data generation and review N/A 14,884
Pfeiffer et al. Mechanical Properties Materials Science Numerical property Pre-training Raw text 2022.03 EN Comprehensive multi-source integration Manual N/A Data generation and review N/A 1,278
ChEBI-20 Materials Science Scientific instruction SFT Text QA 2021.11 EN Integration of existing datasets Manual N/A Data generation and review N/A 29709
ZINC Materials Science SMILES Pre-training Raw text 2020.12 EN Scientific databases Manual N/A Data generation and review N/A 230M
JARVIS-DFT Materials Science InChI, IUPAC, SELFIES Pre-training Raw text 2020.11 EN Integration of existing datasets Manual N/A Data generation and review N/A 41,000
MOSES Materials Science SMILES SFT Text QA 2020.11 EN Integration of existing datasets Manual N/A Data generation and review N/A 1.6M
QMOF Materials Science CIF Pre-training Raw text 2020.05 EN Academic and research resources Manual N/A Data generation and review N/A 20,000
Warwick Electron Microscopy Datasets Materials Science STEM image, TEM image, TEM exit wavefunction SFT VQA 2020.05 EN Academic and research resources Manual N/A Data generation and review N/A 135395
CoRE MOF 2019 Materials Science CIF Pre-training Raw text 2019.12 EN Integration of existing datasets Manual N/A Data generation and review N/A 14,000
Inorganic Crystal Structure Database (ICSD) Materials Science CIF Pre-training Raw text 2019.10 EN Scientific databases Manual N/A Data generation and review N/A 318,901
US Patent Office (USPTO) Materials Science SMILES Pre-training Raw text 2017.06 EN Patent databases Manual N/A Data generation and review N/A 2,830,616
Open Quantum Materials Database (OQMD) Materials Science CIF Pre-training Raw text 2014.11 EN Scientific databases Manual N/A Data generation and review N/A 1,317,811
Materials Project Materials Science CIF Pre-training Raw text 2013.07 EN Scientific databases Manual N/A Data generation and review N/A 577,813

🌍 Earth Science

⬆ Back to Top

Dataset Domain Modality Purpose Type Release Language Source Annotation Pipeline Human Annotators Human Tasks Auto-annotation Tools Size
WeatherQA Atmosphere Remote sensing, Science QA SFT VQA 2024.06 EN Scientific databases Semi-automated 4 Data review GPT-4 8,511
SeafloorAI Hydrosphere Sonar images, Text SFT VQA 2024.11 EN Scientific databases Semi-automated 4 Data review GPT-4 7M
TEOChatlas Lithosphere Remote sensing SFT VQA 2025.01 EN Scientific databases Automated N/A Data generation N/A 554K
EarthVQA Lithosphere Remote sensing, Science QA SFT VQA 2023.12 EN Scientific databases Automated N/A Data generation ArcGIS toolbox 208K
Geochat Lithosphere Remote sensing, Science QA SFT VQA 2023.11 EN Scientific databases Automated N/A Data generation Vicuna-v1.5 306K
FloodNet Lithosphere Remote sensing, Science QA SFT VQA 2021.05 EN Scientific databases Manual N/A Data generation and review N/A 11K
GeoSignal Lithosphere, Hydrosphere, Atmosphere Remote sensing, Science QA SFT Text QA 2023.06 EN Encyclopedias and knowledge bases, Academic and research resources, Scientific databases, Comprehensive multi-source integration Semi-automated 10 Data review GPT-4 39,749
GeoLLaVA-8k Remote Sensing Remote sensing SFT Image-text, VQA 2025.05 EN Academic and research resources Semi-automated 35 Data generation and review GPT-4o 81,367
EVAttrs-95k Remote Sensing Remote sensing, Object property SFT Image-text, VQA 2025.03 EN Academic and research resources Semi-automated N/A Data generation and review Qwen2-VL-72B, GPT-4o 95.1K
VersaD Remote Sensing Remote sensing Pre-training Image-text 2024.11 EN Academic and research resources Automated N/A N/A Gemini-Vision 1.4M
RSVP Remote Sensing Remote sensing SFT Image-text, VQA 2024.10 EN Integration of existing datasets, Academic and research resources Automated N/A N/A GPT-4V, DINOv2-ViT L/14, CLIP-ConvNeXt 3.65M
FIT-RS Remote Sensing Remote sensing, Relation graph, \etc SFT Image-text, VQA 2024.07 EN Integration of existing datasets, Academic and research resources Semi-automated N/A Data generation TinyLLaVA-3.1B, GPT-4, GPT-3.5, CLIP-ViT-L14 1,415K
VRSBench Remote Sensing Remote sensing SFT Image-text, VQA 2024.06 EN Academic and research resources Semi-automated N/A Data review GPT-4V 142,390
MMRS-1M Remote Sensing Remote sensing, Optical, SAR, Infrared, \etc SFT Image-text, VQA 2024.03 EN Integration of existing datasets Automated N/A N/A N/A 1.06M
ChatEarthNet Remote Sensing Remote sensing, Optical, Multi-band SFT Image-text 2024.02 EN Scientific databases Semi-automated N/A Data review GPT-3.5, GPT-4V 173,488
LHRS-Align Remote Sensing Remote sensing Pre-training Image-text 2024.02 EN Scientific databases Automated N/A N/A Vicuna-v1.5-13B 1.15M
LHRS-Instruct Remote Sensing Remote sensing SFT Image-text, VQA 2024.02 EN Integration of existing datasets, Academic and research resources Semi-automated N/A Data review Vicuna-v1.5-13B, GPT-4 12K
RS5M Remote Sensing Remote sensing Pre-training Image-text 2024.01 EN Scientific databases Automated N/A N/A CLIP 5.07M
SkyEye-968k Remote Sensing Remote sensing SFT Image-text, Video-text, VQA 2024.01 EN Integration of existing datasets Semi-automated N/A Data review N/A 968K
SkyScript Remote Sensing Remote sensing Pre-training Image-text 2023.12 EN Academic and research resources, Scientific databases Automated N/A N/A CLIP, Logistic Regression model 2.6M
RSICap Remote Sensing Remote sensing SFT Image-text 2023.07 EN Academic and research resources Manual 5 Data generation, Data review N/A 2.5K

πŸ”­ General Science

⬆ Back to Top

Dataset Domain Modality Purpose Type Release Language Source Annotation Pipeline Human Annotators Human Tasks Auto-annotation Tools Size
NaturalReasoning Multidisciplinary (incl. Physics) Text SFT Text QA with CoT 2025.02 EN Web and Internet content, Books and literary works, Academic and research resources Semi-automated N/A Data review LLaMA-70B 2.8M
Nemotron-Science Multidisciplinary (incl. Physics) Text with formulae and code SFT, RLHF Text QA with CoT 2025.05 EN Social media and forums, Academic and research resources, Books and literary works Semi-automated N/A Data review DeepSeek-R1 2.7M
Galactica Multidisciplinary (incl. Chemistry) Text (incl. formulas, code) Pre-training Raw text 2022.11 EN Webpages Fully-automated N/A Data generation and review Custom crawlers, PDF parsers 106B tokens
SciBERT Multidisciplinary (incl. Physics) Academic papers Pre-training Raw text 2019.09 EN Academic and research resources Automated N/A Data generation Crawlers, text processing tools 3.3B (tokens)
ArXivCap Physics, Biology, \etc Paper figures Pre-training Image-text 2024.05 EN Academic and research resources Semi-automated 7 Data review PDF parsers 6.4M
SCP-116K Physics, Chemistry, Biology, \etc Text with formulae SFT Text QA with CoT 2025.01 EN Academic and research resources, Books and literary works Semi-automated N/A Data review PDF parsers, OCR, LaTeX rendering 116.8K
MegaScience Medicine, Physics, Chemistry, Biology Science textbooks SFT Text QA with CoT 2024.08 EN Web and Internet content, Books and literary works, Integration of existing datasets Semi-automated N/A Data review Llama3.3-70B-Instruct, DeepSeek-V3, BGE-large-en-v1.5 651,840

πŸ“ Scientific Evaluation Datasets

⬆ Back to Top

🧬 Life Science

Dataset Domain Modality Level Type Release Language Source Annotation Pipeline Human Annotators Human Tasks Auto-annotation Tools Size Evaluation Type Metrics
SeedBench Agriculture Breeding literature Expert Text QA 2025.05 EN, ZH Academic and research resources Semi-automated N/A Data generation and review GPT-4 2,264 MCQ, Open-ended Acc, F1, ROUGE
AgEval Agriculture Plant stress phenotyping photos and annotations Expert VQA 2025.01 EN Scientific databases N/A N/A N/A N/A 1,200 Classification, Regression F1, NMAE
AgXQA Agriculture Agricultural extension records Expert Text QA 2024.10 EN Academic and research resources Semi-automated N/A N/A N/A 2,186 Open-ended EM, F1
Fundus-MMBench Healthcare and Medical Sciences CFP Expert VQA 2025.07 EN Integration of existing datasets Manual N/A Data review N/A 620 MCQ Acc
ReXVQA Healthcare and Medical Sciences X-ray N/A VQA 2025.06 EN Integration of existing datasets Semi-automated 3 Data review GPT-4o, ClinicalBERT, MedEmbed 40,557 MCQ Acc
HealthBench Healthcare and Medical Sciences Clinical dialogue, Medical task requests, Medical record summarization, \etc Expert Text QA 2025.05 EN Comprehensive multi-source integration Semi-automated 262 Data generation and review GPT-o1, GPT-4.1 5,000 Open-ended Customized rubric criterion
MedAlpaca Healthcare and Medical Sciences Biomedical knowledge base Expert Text QA 2025.03 EN Web and Internet content Semi-automated N/A Data review GPT-3.5-Turbo 374 MCQ Acc
GEMeX-VQA Healthcare and Medical Sciences X-ray N/A VQA 2025.03 EN Integration of existing datasets Semi-automated N/A Data review OpenBioLLM-70B, GPT-4o 3,960 MCQ, True/False, Open-ended Acc
MIMIC-Diff-VQA Healthcare and Medical Sciences X-ray Expert VQA (multi-image) 2025.02 EN Scientific databases Semi-automated 3 Data generation and review ScispaCy 70,070 MCQ, Open-ended BLEU, METEOR, ROUGE-L, CIDEr
MedAgentBench Healthcare and Medical Sciences EHR, Lab results, Diagnosis codes, Medication orders Expert Text QA 2025.01 EN Academic and research resources Manual 2 Data generation and review N/A 300 Open-ended Success rate
MedXpertQA Healthcare and Medical Sciences CT, ECG, Histopathology, MRI, US, X-ray, \etc Expert VQA, Text QA 2025.01 EN Academic and research resources Semi-automated N/A Data generation and review GPT-4o, Claude 4,460 MCQ Acc
OpenMM-Medical Healthcare and Medical Sciences CT, Dermatology, Endoscopy, CFP, MRI, Microscopy, X-ray, \etc N/A VQA 2025.01 EN, ZH Comprehensive multi-source integration Semi-automated N/A Data review GPT-4o 88,996 MCQ Acc
Asclepius Healthcare and Medical Sciences CT, Dermatology, CFP, Histopathology, MRI, Microscopy, OCT, X-ray, \etc N/A VQA, Image-text 2024.11 EN Comprehensive multi-source integration Semi-automated 34 Data generation and review ChatGPT, GPT-4V, GPT-4o 3,232 MCQ Acc
ClinicalBench Healthcare and Medical Sciences EHR N/A Text QA 2024.11 EN Integration of existing datasets N/A N/A N/A N/A N/A MCQ F1, AUROC
WorldMedQA-V Healthcare and Medical Sciences Dermatology, Microscopy, X-ray, \etc N/A VQA 2024.10 EN, JA, ES, HE, PT Academic and research resources Semi-automated N/A Data review GPT-4o, Gemini Flash1-5, Yi-VL-34B 568 MCQ Acc
CRAFT-BioQA Healthcare and Medical Sciences Biomedical QA N/A Text QA 2024.09 EN Academic and research resources Automated N/A N/A N/A N/A MCQ Acc
MedTrinity-25M Healthcare and Medical Sciences CT, MRI, X-ray, Histopathology, \etc Expert Image-text, VQA 2024.08 EN Integration of existing datasets, Scientific databases Automated N/A N/A N/A 100,000 Open-ended Acc
GMAI-MMBench Healthcare and Medical Sciences CT, Dermatology, Endoscopy, CFP, Histopathology, MRI, Microscopy, OCT, PET, US, X-ray, \etc N/A VQA 2024.08 EN Comprehensive multi-source integration Semi-automated N/A Data review GPT-4o 26k MCQ Acc
SlideBench Healthcare and Medical Sciences Histopathology N/A VQA 2024.11 EN Scientific databases Semi-automated N/A Data generation and review GPT-4o 16k MCQ, Open-ended Acc, BLEU
Bio-ML Healthcare and Medical Sciences Ontology data Expert Text QA 2024.07 EN Encyclopedias and knowledge bases Semi-automated N/A Data generation and review N/A 25,270 Retrieval F1
MedBench Healthcare and Medical Sciences Dianosis report, Clinical dialogue, EHR Expert Text QA 2024.06 ZH Integration of existing datasets Manual N/A Data generation N/A 300,901 MCQ, Open-ended BLEU, ROUGE-L, F1, Acc
ClinicalLab Healthcare and Medical Sciences Clinical notes Expert Text QA 2024.06 EN, ZH Other sources Manual N/A Data generation and review GPT-4 1,500 Open-ended DWR, DIFR, CDR, Acceptability, Acc, BLEU, ROUGE, BERTScore
AgentClinic-NEJM Healthcare and Medical Sciences Clinical dialog, Diagnosis report, CT, Dermatology, Histopathology, \etc Expert VQA 2024.05 EN Academic and research resources, Comprehensive multi-source integration Automated N/A N/A N/A 120 Open-ended Acc, Patient compliance, Consultation ratings
AgentClinic-Lang Healthcare and Medical Sciences Medical exams Expert Text QA 2024.05 EN, ES, FA, FR, HI, KO, ZH Academic and research resources, Comprehensive multi-source integration Semi-automated N/A Data review GPT-4 749 Open-ended Acc, Patient compliance, Consultation ratings
AgentClinic-MedQA Healthcare and Medical Sciences Medical exams Expert Text QA 2024.05 EN Academic and research resources, Comprehensive multi-source integration Semi-automated N/A Data review GPT-4 215 Open-ended Acc, Patient compliance, Consultation ratings
AgentClinic-MIMIC-IV Healthcare and Medical Sciences EHR Expert Text QA 2024.05 EN Scientific databases Semi-automated N/A Data review GPT-4 200 Open-ended Acc, Patient compliance, Consultation ratings
AgentClinic-Spec Healthcare and Medical Sciences Medical exams Expert Text QA 2024.05 EN Integration of existing datasets Semi-automated N/A N/A GPT-4 260 Open-ended Acc, Patient compliance, Consultation ratings
M3D-Bench Healthcare and Medical Sciences CT, Clinical reports Expert Image-text, Text QA, VQA 2024.04 EN Scientific databases, Integration of existing datasets Semi-automated N/A Data generation and review GPT-4V 1,235 MCQ, Open-ended, Retrieval Acc, BLEU, ROUGE
AMOS-MM Healthcare and Medical Sciences CT Expert Image-text, VQA 2024.04 EN, ZH Integration of existing datasets, Scientific databases N/A N/A N/A N/A 2300 Open-ended, MCQ Acc
CMtMedQA Healthcare and Medical Sciences Clinical dialogue Expert Text QA 2024.03 ZH Books and literary works Semi-automated 6 Data review GPT-3.5, CMeKG, RLHF-Label-Tool 70k Open-ended GPT-4 score
Medbullets Healthcare and Medical Sciences Medical exams N/A Text QA 2024.02 EN Social media and forums Automated N/A Data review N/A 618 MCQ ROUGE-L, BERTScore, CTC, G-Eval, BARTScore+
RareBench Healthcare and Medical Sciences EHR, Medical history, Lab tests Expert Text QA 2024.02 EN, ZH Scientific databases, Academic and research resources, Other sources Manual N/A Data generation and review N/A 2,185 Open-ended Precision, Recall, F1, Median Rank, \etc
OmniMedVQA Healthcare and Medical Sciences CT, Dermatology, Endoscopy, CFP, Histopathology, MRI, Microscopy, OCT, PET, US, X-ray, \etc N/A VQA 2024.02 EN Comprehensive multi-source integration Semi-automated N/A Data review GPT-4 127,995 MCQ Acc
MultiMedEval Healthcare and Medical Sciences CT, Dermatology, CFP, Histopathology, MRI, Microscopy, OCT, US, X-ray, \etc N/A VQA 2024.02 EN Integration of existing datasets Semi-automated N/A Data review CheXbert, GPT, RadGraph 60k MCQ Acc
Fhirfly Medical Questions Healthcare and Medical Sciences Biomedical QA N/A Text QA 2024.01 EN Academic and research resources Semi-automated N/A Data review N/A 25,102 True/False Acc
RP3D-DiagDS Healthcare and Medical Sciences CT, MRI, X-ray, US, Fluoroscopy, \etc Expert Classification 2023.12 EN Scientific databases Semi-automated N/A Data generation and review Custom crawlers, GPT-4 40,936 True/False AUROC, AP
NEJM-AI Benchmarking Healthcare and Medical Sciences Medical exams Expert Text QA 2023.11 EN Academic and research resources Automated N/A N/A NLTK, Regex 858 MCQ Acc, BLEU, WER, Cosine
MORFITT Healthcare and Medical Sciences Clinical papers Expert Classification 2023.11 FR Academic and research resources Manual N/A Data review N/A 1,560 Classification Precision, Rappel, F1
SourceData Healthcare and Medical Sciences Gene/protein entities Expert Raw text 2023.10 EN Academic and research resources Manual N/A Data generation and review N/A 620,000 NER Precision, Recall, F1
SDOH-NLI Healthcare and Medical Sciences Clinical notes Expert Classification 2023.10 EN Integration of existing datasets Manual N/A Data generation N/A 4.21k Classification Precision, Recall, F1
HealthsearchQA Healthcare and Medical Sciences Consumer health QA Expert Text QA 2023.08 EN Web and Internet content Semi-automated N/A Data review N/A 3,173 Open-ended Factuality, Comprehension, Reasoning, Possible harm and bias
CMB-Exam Healthcare and Medical Sciences Medical exams Expert Text QA 2023.08 ZH Web and Internet content Semi-automated N/A Data review N/A 280,839 MCQ Acc
CMB-Clin Healthcare and Medical Sciences Medical exams Expert Text QA 2023.08 ZH Books and literary works Semi-automated N/A Data review N/A 208 Open-ended Fluency, Relevance, Completeness, Proficiency
MultiMedBench Healthcare and Medical Sciences CT, Dermatology, Histopathology, Microscopy, MRI, X-ray, \etc N/A Text QA, VQA 2023.07 Mixed Integration of existing datasets N/A N/A N/A N/A 1M N/A Acc, ROUGE-L, BLEU, F1-RadGraph, F1
GPT-4 BiasBenchmark Healthcare and Medical Sciences Clinical trials Expert Text QA 2023.07 EN Academic and research resources Semi-automated N/A Data generation and review GPT-4 213 Open-ended Acc
Lavita Medical QA Healthcare and Medical Sciences Clinical guidelines N/A Text QA 2023.07 EN Academic and research resources Automated N/A N/A N/A 11,500 MCQ Acc
BioASQ10b-factoid Healthcare and Medical Sciences Clinical dialogue, PubMed snippets Expert Text QA 2023.07 EN Scientific databases, Academic and research resources Manual N/A Data generation and review N/A 166 Open-ended Acc, MRR
MedNERF Healthcare and Medical Sciences Drug Prescription Expert Classification 2023.06 FR Other sources Manual N/A Data generation and review N/A 100 NER F1
WikiMedQA Healthcare and Medical Sciences Clinical reports Expert Text QA 2023.03 EN Web and Internet content Semi-automated N/A N/A SentenceBERT, BioLinkBERT 5,893 MCQ Acc
BioASQ Healthcare and Medical Sciences Biomedical Documents Expert Text QA 2022.12 EN Academic and research resources Manual 21 Data generation and review N/A 4,721 Open-ended Acc
BioRED Healthcare and Medical Sciences Biomedical papers N/A Classification 2022.09 EN Scientific databases Semi-automated 6 Data generation and review PubTator 100 NER Precision, Recall, F1
BioLeaflets Healthcare and Medical Sciences Package leaflets Expert Raw text 2021.09 EN Web and Internet content Semi-automated N/A Data generation Stanza, Amazon Comprehend Medical 134 Generation SacreBLEU, ROUGE-L, BERTScore, BLEURT, MoverScore-21
CBLUE Healthcare and Medical Sciences Clinical trials, EHR, Medical forum, Medical textbooks N/A Classification, Text QA 2021.06 ZH Comprehensive multi-source integration Manual 3 Data generation and review N/A 46,729 NER, Open-ended, Retrieval Acc, F1
SLAKE Healthcare and Medical Sciences CT, MRI, X-ray N/A VQA 2021.02 EN, ZH Academic and research resources Automated N/A N/A N/A 2,070 MCQ, Open-ended Acc
MEDIQA-AnS Healthcare and Medical Sciences Consumer health QA Undergraduate Text QA 2020.09 EN Web and Internet content Manual 2 Data generation N/A 708 Open-ended ROUGE, BLEU
RadVisDial (G) Healthcare and Medical Sciences X-ray N/A VQA 2020.07 EN Integration of existing datasets Semi-automated 2 Data generation NegBio, CheXpert 91k MCQ Acc
CORD-19 Healthcare and Medical Sciences Academic papers N/A Text QA 2020.03 EN Academic and research resources N/A N/A N/A N/A 280K Retrieval, QA MRR, Acc
PathVQA Healthcare and Medical Sciences Histopathology Expert VQA 2020.03 EN Academic and research resources Automated N/A N/A CoreNLP 6,012 MCQ, Open-ended BLEU, Exact-match, F1
MedQuAD Healthcare and Medical Sciences Patient educational materials Undergraduate Text QA 2019.11 EN Web and Internet content Semi-automated 2 Data generation MetaMap Lite, UMLS lookup 47,457 Open-ended Acc, F1, MRR
Pubmed Causal Healthcare and Medical Sciences Biomedical papers N/A Classification 2019.11 EN Scientific databases Manual N/A Data generation N/A 2,446 Classification Acc, F1
PubMedQA instruction Healthcare and Medical Sciences Clinical dialogue Expert Text QA 2019.09 EN Academic and research resources Manual N/A Data generation N/A 273k Classification Acc
VQA-RAD Healthcare and Medical Sciences CT, MRI, PET, US, X-ray N/A VQA 2018.11 EN Academic and research resources Manual N/A Data generation N/A 451 MCQ, Open-ended Acc, BLEU
Pima Healthcare and Medical Sciences EHR Expert Classification 1988.11 EN Scientific databases Manual N/A N/A N/A 77 Classification AUROC
TOMG-Bench Molecular and Cellular Biology Molecule Expert Text QA 2024.12 EN Scientific databases Automated N/A N/A N/A 45,000 Open-ended Success Rate, Similarity, Novelty, Validity
MoleculeQA Molecular and Cellular Biology Molecule Expert Text QA 2024.11 EN Scientific databases Manual 2 Data generation and review N/A 62,000 MCQ Acc
BEACON Molecular and Cellular Biology RNA sequence N/A Raw text 2024.06 EN Comprehensive multi-source integration Semi-automated N/A Data generation and review N/A 96,283 Classification, Regression F1, AUROC, Precision, R\textsuperscript{2}, MSE, PCC
GeneHop Molecular and Cellular Biology Multi-hop genomic QA N/A Text QA 2023.04 EN Academic and research resources Manual N/A Data generation and review N/A 150 Open-ended Acc
PCdes Molecular and Cellular Biology SMILES N/A Text QA 2022.12 EN Academic and research resources Automated N/A N/A Custom crawlers 3,000 Retrieval Acc, Recall
PEER Molecular and Cellular Biology Protein sequence Expert Classification, Regression 2022.10 EN Comprehensive multi-source integration Semi-automated N/A Data generation and review N/A 115,281 Classification, Regression Acc, RMSE, Precision, PCC
BioPreDyn-bench Molecular and Cellular Biology Time-series (simulation data) Expert Regression 2015.02 EN Academic and research resources N/A N/A N/A 6 6 Open-ended NRMSE
MicroVQA Molecular and Cellular Biology, Healthcare and Medical Sciences Microscopy Expert VQA 2025.03 EN Academic and research resources Semi-automated 12 Data generation and review GPT-4o 1042 MCQ Acc
DISEASES Molecular and Cellular Biology, Healthcare and Medical Sciences, Multi-omics Disease-gene associations Expert Classification 2015.01 EN Academic and research resources, Integration of existing datasets, Scientific databases Semi-automated N/A Data generation and review NER tagger 8,336,442 Open-ended, True/False, Retrieval Precision, Recall, F1, AUROC, AUPRC
LAB-Bench Molecular and Cellular Biology, Multi-omics, Neuroscience Research problems N/A Text QA 2024.07 EN Academic and research resources N/A N/A N/A N/A 2,457 MCQ Acc
BioProBench Molecular and Cellular Biology, Multi-omics, Pharmacy, Neuroscience, \etc Protocol N/A Text QA 2025.05 EN Academic and research resources Semi-automated N/A Data review N/A 556,171 Open-ended Acc, F1, EM, BLEU
Genome-Bench Multi-omics Research problems N/A Text QA 2025.06 EN Academic and research resources N/A N/A Data generation and review GPT-4 Turbo, GPT-4o 3,332 MCQ Acc
GeneChat-test Multi-omics Nucleotide sequence N/A Text QA 2025.06 EN Scientific databases N/A N/A Data generation N/A N/A Open-ended BLUE, METEOR
GeneChat Multi-omics Nucleotide sequence N/A Text QA 2025.06 EN Scientific databases N/A N/A Data generation N/A 2,973 Open-ended BLEU, METEOR
Genomics instructions Multi-omics Nucleotide sequence N/A Text QA 2025.04 EN Academic and research resources N/A N/A Data generation N/A 403,814 Classification, Regression F1, MCC, AUROC, PCC
BixBench Multi-omics Genomics transcriptomics text Expert Text QA 2025.03 EN Academic and research resources Semi-automated Data generation and review 53 Claude 3.5 Sonnet 296 Open-ended, MCQ Acc
Seq2Func Multi-omics Nucleotide sequence N/A Text QA 2025.02 EN Scientific databases Automated N/A Data generation N/A 33,000 MCQ MCC, F1
DNA2Image Multi-omics Nucleotide sequence N/A Generation 2025.02 EN Scientific databases Automated N/A Data generation N/A 4,800 Generation Invalid percentage, F1
DNA Long Bench Multi-omics DNA sequence N/A Classification, Regression 2025.01 EN Scientific databases; Academic and research resources Automated N/A N/A N/A 213,416 Classification, Regression SCC, PCC, AUROC
LLaMA-Gene (protein) Multi-omics Protein sequence N/A Text QA 2024.12 EN Scientific databases N/A N/A Data generation N/A 6,991 Open-ended Acc
LLaMA-Gene (DNA) Multi-omics DNA sequence N/A Text QA 2024.12 EN Scientific databases N/A N/A Data generation N/A 19,839 Open-ended Acc
NT Benchmark Multi-omics Nucleotide sequence N/A Classification 2024.10 EN Academic and research resources N/A N/A N/A N/A 38,822 MCQ MCC
BioinformaticsBench Multi-omics Textbook Undergraduate Text QA 2024.06 EN Books and literary works, Academic and research resources Semi-automated 4 N/A GPT-3.5, GPT-4, GPT-4 Turbo 602 MCQ, True/False, Open-ended Acc
genomics-long-range-benchmark Multi-omics Nucleotide sequence N/A Classification, Regression 2024.05 EN Academic and research resources N/A N/A N/A N/A N/A Classification, Regression MCC
RNA-QA Multi-omics RNA sequence N/A Text QA 2024.05 EN Academic and research resources Automated N/A N/A N/A 121 K Open-ended Precision, Recall, F1, ROUGE
BioinfoBench Multi-omics RNA sequence Undergraduate Text QA 2023.10 EN Other sources Semi-automated N/A Data review ChatGPT 200 MCQ Acc, Perplexity, Next-token likelihood
BioCoder Multi-omics Codes Undergraduate Text QA 2023.08 EN Integration of existing datasets, Academic and research resources Automated N/A N/A N/A 2522 Open-ended Acc
SpeciesClassification Multi-omics Nucleotide sequence N/A Classification 2023.06 EN Academic and research resources N/A N/A N/A N/A 5 species genomes MCQ Acc
GUE Benchmark Multi-omics Nucleotide sequence N/A Classification 2023.06 EN Academic and research resources N/A N/A N/A N/A 80,648 MCQ MCC, F1
Genomic Benchmark Multi-omics Nucleotide sequence N/A Classification 2023.05 EN Academic and research resources N/A N/A N/A N/A 191,589 MCQ Acc, F1
GeneTuring Multi-omics Biomedical knowledge base N/A Text QA 2023.03 EN Academic and research resources N/A N/A Data generation GPT-2, BioGPT, BioMedLM, GPT-3, ChatGPT, New Bing 600 MCQ Acc
Human Pancreas Multi-omics scRNA-seq N/A Classification 2023.01 EN Academic and research resources N/A N/A N/A N/A 4,218 Classification Acc, Precision, Recall, F1
Myeloid Multi-omics scRNA-seq N/A Classification 2021.02 EN Academic and research resources N/A N/A N/A N/A 3,430 Classification Acc, Precision, Recall, F1
Human Cell Atlas Dataset Multi-omics scRNA-seq N/A Classification 2021.02 EN Academic and research resources N/A N/A N/A N/A 84,363 Classification Acc, F1
Human enhancers Ensembl Multi-omics Nucleotide sequence N/A Classification 2021.01 EN Academic and research resources N/A N/A N/A N/A 154,842 MCQ MCC
Human regulatory Ensembl Multi-omics Nucleotide sequence N/A Classification 2021.01 EN Academic and research resources N/A N/A N/A N/A 289,061 MCQ MCC
Human ocr Ensembl Multi-omics Nucleotide sequence N/A Classification 2021.01 EN Academic and research resources N/A N/A N/A N/A 174,756 MCQ MCC
Multiple Sclerosis Multi-omics scRNA-seq N/A Classification 2019.07 EN Academic and research resources N/A N/A N/A N/A 13,468 Classification Acc, Precision, Recall, F1
APARENT Multi-omics Nucleotide sequence N/A Regression 2019.06 EN Academic and research resources N/A N/A N/A N/A 8,000 Regression R\textsuperscript{2}
Human enhancers Cohn Multi-omics Nucleotide sequence N/A Classification 2018.02 EN Academic and research resources N/A N/A N/A N/A 27,791 MCQ MCC
Human non-TATA promoters Multi-omics Nucleotide sequence N/A Classification 2017.02 EN Academic and research resources N/A N/A N/A N/A 36,131 MCQ MCC
Zheng68k Multi-omics scRNA-seq N/A Classification 2016.07 EN Academic and research resources N/A N/A N/A N/A 68,450 Classification Acc, F1
Drosophila enhancers Stark Multi-omics Nucleotide sequence N/A Classification 2014.06 EN Academic and research resources N/A N/A N/A N/A 6,914 MCQ MCC
COMET Multi-omics DNA, RBA, Protein Sequence/Residue N/A Classification, Regression 2024.12 EN Academic and research resources N/A N/A N/A N/A 1.22M Classification, Regression R\textsuperscript{2}, PCC, F1, SCC
AdaBrain-Bench Neuroscience EEG N/A Classification 2025.07 EN Integration of existing datasets N/A N/A N/A N/A N/A Open-ended Acc, AUROC, AUPRC, F1, PCC, R\textsuperscript{2}
FDA Pharmaceuticals FAQ Pharmacy FAQ-style text Expert Text QA 2023.03 EN Web and Internet content Automated N/A N/A N/A 1,681 MCQ Acc
repoDB Pharmacy, Healthcare and Medical Sciences Drug-disease relationships, Clinical trial outcomes Expert Classification, Text QA 2017.03 EN Scientific databases Automated N/A N/A scripts 15,648 MCQ, Retrieval AUROC, AUPRC, Acc

βš—οΈ Chemistry

⬆ Back to Top

Dataset Domain Modality Level Type Release Language Source Annotation Pipeline Human Annotators Human Tasks Auto-annotation Tools Size Evaluation Type Metrics
OmniGenBench Biochemistry, Multi-omics DNA sequence, RNA sequence, TF binding, \etc N/A Classification 2025.05 N/A Integration of existing datasets, Academic and research resources N/A N/A N/A N/A N/A N/A AUROC, F1, RMSE, R\textsuperscript{2}
MOSES Biochemistry SMILES Expert Raw text 2020.07 EN Academic and research resources Manual N/A Data review N/A 1,936,962 Generation Chemical validity, Drug-likeness
ChEMBL Biochemistry SMILES Expert Raw text 2012.01 EN Academic and research resources Manual N/A Data generation and review N/A 1.96M Generation Chemical validity, Drug-likeness
ChemRxivQuest General Chemistry Academic papers Expert Text QA 2025.05 EN Academic and research resources Manual N/A Data generation and review NA 970 Open-ended Acc
ScholarChemQA General Chemistry Academic papers Expert Text QA 2025.02 EN Academic and research resources Manual N/A Data generation and review N/A 40K MCQ Acc
ChemSafetyBench General Chemistry Text Expert Raw text 2024.11 EN Academic and research resources Automated N/A Data generation and review NA 30K+ Open-ended Acc, Recall, Precision, F1, safety/quality score, \etc
ChemEval General Chemistry Text Expert Raw text 2024.09 EN Academic and research resources Automated N/A Data generation and review NA unknown (42 tasks) Open-ended Acc, BLEU-2, F1, \etc
ChemNLP General Chemistry Text Secondary school Text QA 2023.01 EN Academic and research resources Manual N/A Data generation and review N/A 27.6K Classification, NER, Generation Acc, ROUGE
ZINC General Chemistry SMILES Expert Raw text 2012.10 EN Academic and research resources Manual N/A Data review N/A 250K Generation Chemical validity, Drug-likeness
PMO General Chemistry SMILES Expert Raw text 2022.05 EN Academic and research resources Automated N/A Data generation and review N/A 10K Open-ended target property, Chemical validity, Drug-likeness
DeepProtein Pharmacy Protein sequence, SMILES Expert Raw text 2025.05 EN Academic and research resources Automated N/A Data generation and review N/A 78K Classification, Regression Acc, MAE, F1, AUPRC, AUROC, R\textsuperscript{2}, \etc
TrialBench Pharmacy SMILES, Disease code Expert Raw text 2024.09 EN Academic and research resources Automated N/A Data generation and review N/A 470K Open-ended F1, Recall, Precision, MSE, \etc
TDC2 Pharmacy SMILES, Protein sequence, Genome sequence Expert Raw text 2024.09 EN Academic and research resources Manual N/A Data generation and review NA 3.4B tokens Open-ended F1, Recall, Precision, MSE, \etc
PCQM4Mv2 Pharmacy Molecular graph N/A Regression 2022.11 EN Academic and research resources Automated N/A N/A N/A 3,746,619 Regression MAE
SBDDBench Pharmacy Protein sequence, SMILES Expert Protein-ligand 2022.06 EN Academic and research resources Automated N/A Data generation and review N/A 5K Open-ended binding affinity, Chemical validity, Drug-likeness
GEOM Pharmacy 3D conformation N/A Regression 2022.04 EN Academic and research resources Automated N/A N/A CREST (GFN2-xTB) 37M conformations Regression MAE, RMSD
TOP Pharmacy SMILES Expert Raw text 2022.02 EN Academic and research resources Automated N/A Data generation and review N/A 12K Open-ended F1, Recall, AUPRC, \etc
TDC Pharmacy SMILES, Protein sequence, Genome sequence Expert Raw text 2021.06 EN Academic and research resources Manual N/A Data generation and review NA 0.2B tokens Open-ended F1, Recall, Precision, MSE, \etc
DeepPurpose Pharmacy Protein sequence, SMILES Expert Raw text 2020.12 EN Academic and research resources Automated N/A Data generation and review N/A 5,074 Classification, Regression MSE, PCC, F1, AUROC, AUPRC, \etc
DrugBank Pharmacy SMILES Expert Raw text 2018.01 EN Academic and research resources Manual N/A Data generation and review N/A 17.47K Open-ended Acc
USPTO Synthetic Chemistry SMILES Expert Raw text 2015.07 EN Patent databases Manual N/A Data generation and review NA 1,939,253 Open-ended Acc, F1, MSE, \etc

βš›οΈ Physics

⬆ Back to Top

Dataset Domain Modality Level Type Release Language Source Annotation Pipeline Human Annotators Human Tasks Auto-annotation Tools Size Evaluation Type Metrics
FSReD General Physics Text Expert Text QA 2019.05 EN Comprehensive multi-source integration Automated N/A Data review N/A 120 Regression MSE, Exact Match
PIQA General Physics Text Primary school Text QA 2020.01 EN Other sources Semi-automated N/A Data generation and review N/A 2,000 MCQ Acc
SRBench General Physics Text N/A Text QA 2021.07 EN Comprehensive multi-source integration Semi-automated N/A Data generation and review N/A 252 Open-ended Acc, Simplicity, Exact Match
PROST General Physics Text N/A Text QA 2021.08 EN Other sources Semi-automated 4 Data generation N/A 18,736 MCQ Acc
MM-PhyQA Kinematics, Mechanics, Electrostatics and Current Electricity, Thermodynamics, Optics, Magnetism, \etc Text High school VQA with CoT 2024.04 EN Comprehensive multi-source integration Semi-automated 8+ Data generation and review ChatGPT 675 MCQ Acc, ROUGE
MVBench General Physics Video N/A Video QA 2024.05 EN Comprehensive multi-source integration Automated 0 Data review N/A 4,000 MCQ Acc
UGPhysics Mechanics, Thermodynamics, Electromagnetism, Modern Physics Text (problem statements, equations, reasoning) Undergraduate Text QA 2025.01 EN, ZH Academic and research resources Semi-automated N/A Data generation and review GPT-4o 11,040 MCQ, Open-ended, True/False, Retrieval Acc
PhysReason Mechanics, Electromagnetism, Thermodynamics, \etc Text (problem statements, equations), Diagrams (physics illustrations) Undergraduate, Graduate, Expert Text QA, VQA 2025.02 EN Comprehensive multi-source integration Semi-automated 4 Data generation and review GPT-4 1,200 MCQ Acc
TPBench Cosmology, High Energy Theory, General Relativity, Astrophysics, Electromagnetism, Quantum Mechanics, Mechanics, \etc Text N/A Text QA 2025.02 N/A Other sources Manual N/A Data generation and review N/A 57 Open-ended Acc, AI-based Holistic Grading
PHYSICS Mechanics, Electromagnetism, Thermodynamics, Optics, \etc Text (problem statements, equations, reasoning), Diagrams (illustrations, charts, experimental setups) Undergraduate Text QA, VQA 2025.03 EN Comprehensive multi-source integration Manual N/A Data generation and review N/A 1,297 Open-ended Acc
PhysicsArena Mechanics, Electromagnetism, Thermodynamics, \etc Text (problem statements, equations, reasoning), Diagrams (illustrations, charts, experimental setups) Expert Text QA, VQA 2025.05 EN Comprehensive multi-source integration Semi-automated N/A Data generation and review N/A 5,100 Open-ended Acc
PHYBench Mechanics, Electricity, Thermodynamics, Optics, Modern Physics, \etc Research problems Undergraduate Text QA 2025.05 EN Comprehensive multi-source integration Semi-automated 178 Data generation and review o1, DeepSeek-R1 500 Open-ended EED
PhyX Mechanics, Quantum Mechanics, Thermodynamics, Electromagnetism, Atomic Physics, \etc Text (problem statements, equations, reasoning), Diagrams (illustrations, charts, experimental setups) Undergraduate Text QA, VQA 2025.05 EN Comprehensive multi-source integration Semi-automated N/A Data generation and review GPT-4o 3,000 MCQ, Open-ended Acc
PhysUniBench Mechanics, Electromagnetism, Optics, Atomic Physics, \etc Text (problem statements, equations), Diagrams (physics illustrations) Undergraduate VQA 2025.06 EN, ZH Comprehensive multi-source integration Manual N/A Data generation and review N/A 3,304 Open-ended, MCQ Acc
IntPhys 2 General Physics Video, Text (scene parameters, object categories, trajectories, physical attributes) N/A Video QA 2025.06 N/A Other sources Semi-automated N/A Data generation and review N/A 1,400 Open-ended Acc
MVP-Bench General Physics Video N/A Video QA 2025.06 EN Encyclopedias and knowledge bases Semi-automated N/A Data generation and review OpenAI CLIP (ViT-L/14) 55,000 Open-ended Acc
SeePhys Mechanics, Electromagnetism, Particle Physics, Optics=, Astrophysics, Thermodynamics, Quantum Mechanics, \etc Text (problem statements, equations), Diagrams (physics illustrations) Secondary school, Undergraduate, Graduate VQA 2025.07 EN,ZH Comprehensive multi-source integration Semi-automated N/A Data generation and review GPT-4o 2,000 Open-ended Acc

🌌 Astronomy

⬆ Back to Top

Dataset Domain Modality Level Type Release Language Source Annotation Pipeline Human Annotators Human Tasks Auto-annotation Tools Size Evaluation Type Metrics
Astro-QA Astronomy Astronomy Olympiad competitions, Astronomy exams, Online encyclopedias Undergraduate Text QA 2025.06 Mixed Comprehensive multi-source integration Manual 30+ Data generation and review N/A 3,082 Open-ended DGscore, BLEU, ROUGE, chrF
Astrovisbench Astronomy Galaxy images Expert VQA 2025.06 EN Comprehensive multi-source integration Semi-automated 6 Data review GPT-4o, Claude 3.5 Sonnet 432 Open-ended VIscore, Image error level, Expert evaluation
AstroMLab 1 Astronomy Academic papers Expert Text QA 2024.11 EN Academic and research resources Automated N/A Data review Gemini-1.5-Pro 4,425 MCQ Acc
AstroPT Astronomy Astronomical images Expert Image-text 2024.05 EN Web and Internet content, Scientific databases Automated N/A Data review DESI Legacy Survey API 8.6 M Classification PCC, Acc
Astro-NER Astronomy Academic papers Expert Text QA 2024.05 EN Academic and research resources Semi-automated 4 Data generation and review GPT-3.5 5,000 Open-ended Precision, Recall, F1
AstroLLaMA Astronomy Academic papers Expert Text QA 2023.09 EN Academic and research resources,Web and Internet content Manual N/A Data review N/A 9.5 M Open-ended Perplexity, Cosine similarity
ATel Astronomy Academic papers Expert Text QA 2023.05 EN Academic and research resources Manual N/A Data review N/A 234 Open-ended Acc
PhyE2Es Astrophysics Text with formulae Expert Raw text 2025.03 EN Scientific databases Automated N/A Data generation and review OpenLLAMA-2-3B 8,000 Regression Acc, Numerical precision, Formula complexity, Formula depth
Pathfinder Dataset Astrophysics Academic papers, ADS Expert Text QA with CoT 2024.11 EN Web and Internet content, Academic and research resources Automated 36+ Data generation and review text-embedding-3-small 385,166 Open-ended Acc, MRR, Recall, NDCG, relevance score
Starwhisper-pilsar Astrophysics Pulsar diagnostic plots, Pulsars signals Expert VQA 2024.04 EN Integration of existing datasets Manual N/A Data generation and review DeepSeek-VL-7B, InternVL2-40B 106,674 Open-ended Acc, Recall, Precision, F1, \etc
PAPERCLIP Astrophysics Synthetic conversation, Astronomical images Expert Text QA 2024.03 EN Academic and research resources,Scientific databases Automated N/A Data generation and review Mixtral-8x7B-Instruct 31,859 Open-ended Acc

πŸͺ¨ Materials Science

⬆ Back to Top

Dataset Domain Modality Level Type Release Language Source Annotation Pipeline Human Annotators Human Tasks Auto-annotation Tools Size Evaluation Type Metrics
CheMatAgent Materials Science Scientific instruction Expert Text QA 2025.06 EN Other sources Manual N/A Data generation and review N/A 137 Open-ended Acc
ChEBI-20-MM Materials Science InChI, IUPAC, SELFIES, Science QA, Molecular Image Expert Text QA 2025.01 EN Integration of existing datasets Manual N/A Data generation and review N/A 3,300 Open-ended BLEU, ROUGE, METEOR, CIDEr
LLM4MatBench Materials Science CIF, Chemical composition, Numerical property Expert Text QA 2024.10 EN Scientific databases Manual N/A Data generation and review N/A 1.9M Open-ended Acc
MatText Materials Science Chemical composition, Numerical property Expert Text QA 2024.08 EN Scientific databases Manual N/A Data generation and review N/A 2,000,000 Open-ended MAE, AUROC
MatBookQA Materials Science Science QA Expert Text QA 2024.05 EN Academic and research resources Manual N/A Data generation and review N/A 650 Open-ended Acc
MaSCQA Materials Science Science QA Expert Text QA 2023.08 EN Academic and research resources Manual N/A Data generation and review N/A 650 Open-ended Acc
MatSci-NLP Materials Science Chemical composition, Numerical property Expert Text QA 2023.05 EN Academic and research resources Manual N/A Data generation and review N/A 169,197 Open-ended Acc, F1
ChEBI-20 Materials Science Scientific instruction Expert Text QA 2021.11 EN Integration of existing datasets Manual N/A Data generation and review N/A 3,301 Open-ended BLEU, ROUGE, METEOR, CIDEr
MOSES Materials Science SMILES Expert Text QA 2020.11 EN Integration of existing datasets Manual N/A Data generation and review N/A 176,000 Open-ended Uniqueness, Validity, Frag, Scaff, SNN
MatBench Materials Science CIF, Numerical property, Chemical composition Expert Text QA 2020.09 EN Scientific databases Manual N/A Data generation and review N/A 408,062 Open-ended MAE, AUROC
GuacaMol Materials Science SMILES Expert Text QA 2019.03 EN Integration of existing datasets Manual N/A Data generation and review N/A 2M Open-ended Validity, Uniqueness, Novelty
MoleculeNet Materials Science SMILES Expert Text QA 2017.03 EN Integration of existing datasets Manual N/A Data generation and review N/A 700,000 Open-ended AUROC, AUPRC, RMSE, MAE
MaCBench Materials Science, Chemistry Science QA, AFM Image Expert VQA 2024.11 EN Academic and research resources Manual N/A Data generation and review N/A 628 Open-ended Acc
MMSci Materials Science, Chemistry Science QA Graduate VQA 2024.05 EN Academic and research resources Manual N/A Data generation and review N/A 742,273 Open-ended Acc

🌍 Earth Science

⬆ Back to Top

Dataset Domain Modality Level Type Release Language Source Annotation Pipeline Human Annotators Human Tasks Auto-annotation Tools Size Evaluation Type Metrics
ClimaQA Atmosphere Textbooks Expert Text QA 2025.03 EN Books and literary works Semi-automated Data review GPT-4o 3,633 MCQ, Open-ended Acc, BLEU, \etc
WeatherQA Atmosphere Remote sensing, Science QA Expert VQA (multi-image) 2024.06 EN Scientific databases Semi-automated 4 Data review 600 MCQ, Open-ended Acc, F1, BLEU, \etc
ClimateBERT Atmosphere Corporate annual reports, Sustainability reports Secondary school Text QA 2022.12 EN Web and Internet content Manual 4+ Data review Prodigy 320 MCQ Acc
OceanBench Hydrosphere Academic papers Expert Text QA 2024.09 EN Academic and research resources Automated 10+ Data review GPT-4, GPT-3.5 13,000 Open-ended Win Rate
OmniEarth-Bench Hydrosphere, Biosphere, Lithosphere, Atmosphere, Cryosphere Remote sensing, Science QA Expert VQA with CoT (multi-image) 2025.05 EN Integration of existing datasets Manual 40+ Data generation and review 29,779 MCQ Acc, Precision, Recall, F1
MSEarth Hydrosphere, Biosphere, Lithosphere, Atmosphere, Cryosphere Academic papers Expert VQA with CoT 2025.05 EN Academic and research resources Semi-automated 20+ Data review GPT-4o 11,500 MCQ, Open-ended BLEU, BERTScore, Acc
EarthSE Hydrosphere, Biosphere, Lithosphere, Atmosphere, Cryosphere Academic papers Expert Text QA with CoT 2025.05 EN Academic and research resources Semi-automated 20+ Data review GPT-4o 10,000 Open-ended Acc
GeoBench Lithosphere Science QA Expert Text QA 2023.06 EN Web and Internet content, Academic and research resources Semi-automated 10+ Data review 2,516 MCQ, Open-ended Acc, GPTScore
XLRS-Bench Remote Sensing Remote sensing N/A Image-text, VQA 2025.03 EN, ZH Academic and research resources Semi-automated 55 Data generation and review GPT-4o 32,389 MCQ, Open-ended Acc, IoU, BLEU, \etc
LRS-VQA Remote Sensing Remote sensing N/A Image-text, VQA 2025.03 EN Academic and research resources Automated N/A N/A Qwen2-VL, GPT-4V 7,333 Open-ended Acc
MME-RealWorld-RS Remote Sensing Remote sensing N/A Image-text, VQA 2024.08 EN, ZH Academic and research resources Manual N/A Data generation and review N/A 3,738 MCQ Acc
VRSBench Remote Sensing Remote sensing N/A Image-text, VQA 2024.06 EN Academic and research resources Semi-automated N/A Data review GPT-4V 62,917 Open-ended Acc, IoU, BLEU, \etc
GeoChat Remote Sensing Remote sensing N/A Image-text, VQA 2023.11 EN Academic and research resources, Integration of existing datasets Automated N/A N/A Vicuna 10K Open-ended Acc, IoU, METEOR
RSIEval Remote Sensing Remote sensing N/A Image-text, VQA 2023.07 EN Academic and research resources Manual 5 Data generation and review N/A 1,036 Open-ended Acc, BLEU, ROUGE, \etc
DIOR-RSVG Remote Sensing Remote sensing N/A Image-text 2022.10 EN Academic and research resources Semi-automated N/A Data review N/A 17,402 Open-ended IoU
NWPU-Captions Remote Sensing Remote sensing N/A Image-text 2022.08 EN Academic and research resources Manual N/A Data generation N/A 31,500 Open-ended BLEU, METEOR, \etc
RSVQA-HRBEN Remote Sensing Remote sensing N/A Image-text, VQA 2020.05 EN Scientific databases Automated N/A N/A N/A 77,232 Open-ended Acc
RSVQA-LRBEN Remote Sensing Remote sensing N/A Image-text, VQA 2020.05 EN Scientific databases Automated N/A N/A N/A 1,066,316 Open-ended Acc
RSICD Remote Sensing Remote sensing N/A Image-text 2017.12 EN Academic and research resources Manual N/A Data generation N/A 10,921 Open-ended BLEU, METEOR, CIDEr
UCM-Captions Remote Sensing Remote sensing N/A Image-text 2016.07 EN Academic and research resources Manual N/A Data generation N/A 2,100 Open-ended BLEU, METEOR, CIDEr
Sydney-Captions Remote Sensing Remote sensing N/A Image-text 2016.07 EN Academic and research resources Manual N/A Data generation N/A 613 Open-ended BLEU, METEOR, CIDEr

πŸ”­ General Science

⬆ Back to Top

Dataset cScientific Domain Modality Type Release Language Source Annotation Pipeline Human Annotators Human Tasks Auto-annotation Tools Size Level Evaluation Type Metrics
MMMU Science (Biology, Chemistry, Geography, Math, Physics), Health & Medicine (Basic Medical Science, Clinical Medicine, Diagnostics, Pharmacy, Public Health), Tech & Engineering (Materials, \etc) Scientific VQA, MRI, CT, X-ray, \etc VQA 2023.11 EN Comprehensive multi-source inte- gration Semi-automatic 50 Data review Claude, GPT-4, GPT-4V 11,550 Expert MCQ Acc
MMMU Pro Science (Biology, Chemistry, Geography, Math, Physics), Health & Medicine (Basic Medical Science, Clinical Medicine, Diagnostics, Pharmacy, Public Health), Tech & Engineering (Materials, \etc) Scientific VQA, MRI, CT, X-ray, \etc VQA 2023.11 EN Comprehensive multi-source integration Semi-automatic N/A Data review Claude, GPT-4, GPT-4V 5,190 Expert MCQ Acc
ScienceQA Biology, Earth Science, Physics, Chemistry, Geography Scientific query, Scientific instruction, Science textbooks and literature VQA 2022.01 EN Books and literary works Manual 9+ Data generation and review ViT, GPT-2 21.2k Primary school, Secondary school MCQ Acc
SciQA Material Science, Chemistry, Life sciences Scientific query Text QA 2023.05 EN Academic and research resources Manual N/A Data generation and review N/A 2,565 Expert Open-ended Acc
Scicode Material Science, Biology, Chemistry, Physics, Mathematics Scientific Instruction Text QA 2024.08 EN Academic and research resources Manual N/A Data generation and review N/A 338 Expert Open-ended pass@1
CURIE Materials Science, Life Sciences, Physics, Earth Science Scientific query VQA 2025.04 EN Integration of existing datasets Manual N/A Data generation and review N/A 580 Expert Open-ended Acc
TheoremQA Physics, Mathematics Theorems Text QA 2023.12 EN Books and literary works, Encyclopedias and knowledge bases Manual N/A Data generation and review N/A 800 Undergraduate, Expert Open-ended Acc
SciBench Physics, Chemistry Science QA Text QA 2023.09 EN Books and literary works Manual 7 Data review N/A 695 Undergraduate Open-ended Acc
JEEBench Physics, Chemistry Science Exams Text QA 2023.12 EN Other sources Semi-automated N/A Data generation and review N/A 515 Expert MCQ Acc
MMLU Physics (College Physics, Conceptual Physics, High School Physics), Chemistry (College Chemistry, High School Chemistry), Biology (College Biology, High School Biology) Science QA Text QA 2020.09 EN Books and literary works Manual 7 Data generation and review N/A 15.9k Secondary School, Undergraduate, Expert MCQ Acc
C-Eval Chemistry (College Chemistry, High School Chemistry, Middle School Chemistry), Physics (College Physics, High School Physics, Middle School Physics), Biology (High School Biology, Middle School Biology), Medicine (Veterinary Medicine, Basic Medicine, Clinical Medicine, Physician), Earth Science (High School Geography, Middle School Geography) Exam questions, Chinese educational assessments Text QA 2023.05 ZH Books and literary works Manual 12 Data generation and review N/A 13.9k Primary school, Secondary school, Undergraduate MCQ Acc
GPQA Chemistry, Biology, Physics Graduate-level scientific questions Text QA 2023.11 EN Other sources Manual 8 Data generation and review N/A 448 Expert MCQ Acc
ArXivQA Physics (Accelerator Physics, High Energy Physics - Lattice, Mathematical Physics, \etc), Chemistry (Chemical Physics), Biology (Quantitative Biology), Material (Materials Theory) scientific figure question-answer Text QA 2024.05 EN, ZH Other sources Semi-automated N/A Data generation and review GPT-4V 249,587 Expert MCQ Acc
Xiezhi Agronomy (Crop Science, Veterinary Medicine), Science (Chemistry, Physics), Medicine (Traditional Chinese Medicine) Professional exams Text QA 2024.05 EN Comprehensive multi-source integration Semi-automated N/A Data review ChatGPT, Llama-7B 250k Expert MCQ Acc
SuperGPQA Medicine, Science, Agriculture Graduate Disciplines QA Text QA 2025.02 EN Other sources Semi-automatic 80+ Data generation and review N/A 26.5k Expert MCQ Acc
BMMR Health (Medicine, Pharmacy, \etc), Natural Sciences (Physics, Biology, \etc), Agriculture Image, College-level visual question answering, OCR-based QA VQA 2025.07 EN, ZH Web and Internet content, Books and literary works, Integration of existing datasets Semi-automated N/A Data generation and review N/A 109,449 Primary school, Undergraduate, Secondary school MCQ Acc
OlympiadBench Physics, Mathematics QA from math and physics competitions, Image Text QA, VQA 2024.02 ZH, EN Other sources Semi-automated 14 Data generation and review N/A 8.5k Expert Open-ended Acc
LLM-SRBench LSR-Synth (Chemistry, Biology, Physics, Material Science), LSR-Transform Structured Data text 2025.04 EN Comprehensive multi-source integration Fully-automated 0 Data generation and review N/A 239 Expert Open-ended Exact Match, MSE
HLE Biology (Marine Biology, Molecular Biology, Computational Biology, Ecology, \etc), Chemistry (Chemical Engineering, Biochemistry, \etc), Physics (Biophysics), Materials Science Organic reaction analysis, Molecular text, Chemical equations, Medical question answering, Textbook QA, \etc Text QA 2025.01 EN Academic and research resources Manual nearly 1000 Data generation and review GPT-4o 2,500 Expert MCQ, Open-ended Acc, Calibration Error
SFE Astronomy, Chemistry, Life Science, Materials Science, Earth Science Protein structure, RNA structure, Molecular structure, \etc VQA 2025.06 EN, ZH Scientific databases, Academic and research resources Manual N/A Data generation and review GPT-4o 1,660 Expert MCQ, Open-ended Exact Match, LLM-as-a-Judge score, BERTScore, IoU
SciEval Chemistry, Physics, Biology Text (equations, molecules, chemical reactions, scientific QA, \etc) Text QA 2023.08 EN Comprehensive multi-source integration Semi-automated N/A Data review GPT-4 18,000 Undergraduate, Graduate MCQ, Open-ended, True/False Acc, BLEU, MSE
SciKnowEval Chemistry, Physics, Biology, Materials Science Textbook QA, Literature QA, SMILES, IUPAC, Equations, \etc Text QA, Classification, Regression 2024.06 EN Comprehensive multi-source integration, Academic and research resources, Scientific databases, Integration of existing datasets Semi-automated N/A Data review GPT-4o, GPT-3.5, Claude3, LLaMA, Qwen 70,203 Undergraduate, Graduate, Expert MCQ, True/False, Open-ended Acc, F1, BLEU, ROUGE, Smith-Waterman, Tanimoto
AGIEval Chemistry (GK-chemistry), Physics (GK-physics), Biology (GK-biology), Geography (GK-geography) Textbook, Literature, SMILES, IUPAC, Equations, \etc Text QA 2023.09 EN, ZH Academic and research resources, Integration of existing datasets Semi-automated N/A Data review ChatGPT, GPT-4 8,062 Secondary school, Undergraduate MCQ, Open-ended Acc, EM
ScienceAgentBench Bioinformatics, Computational chemistry, Geographical information science, Psychology & cognitive neuroscience Microscopy images, SMILES strings, Geospatial data, EEG, ECG, IMU, \etc VQA 2024.10 EN Academic and research resources, Integration of existing datasets Manual 9 Data generation and review N/A 102 tasks Expert Open-ended VER, SR, CodeBERTScore, GPT-4o Judge

πŸ€– Scientific Models

⬆ Back to Top

🌐 General-purpose

Models Domain Parameters Base LLM Modality encoder Release Open-source
Galactica General Science 120B N/A N/A 2022.11 βœ…
DARWIN General Science 7B LLaMA-7B, Vicuna-7B N/A 2023.08 βœ…
FORGE General Science 26B GPT-NeoX N/A 2023.11 βœ…
SciGLM General Science 6B / 32B ChatGLM3 N/A 2024.01 βœ…
SciDFM General Science 18.2B-A5.6B N/A N/A 2024.09 βœ…
OmniScience General Science 70B LLaMA-3.1 N/A 2025.03 ❌
Intern-S1 General Science 241B-A28B/8B Qwen3-235B-A22B, Qwen3-8B InternViT-6B, InternViT-300M 2025.08 βœ…

βš›οΈ Physics

⬆ Back to Top

Models Domain Parameters Base LLM Modality encoder Release Open-source
MechGPT Mechanics 13B / 70B LLaMA-2 N/A 2023.10 βœ…
Xiwu High Energy Physics 7B / 13B LLaMA, Vicuna, ChatGLM, Grok-1 N/A 2024.04 βœ…
Poseidon Partial Differential Equations 0.02B / 0.2B / 0.6B scOT N/A 2024.05 βœ…
L3M Astrophysics 0.5B Qwen2.5-0.5B-Instruct N/A 2025.06 ❌

βš—οΈ Chemistry

⬆ Back to Top

Models Domain Parameters Base LLM Modality encoder Release Open-source
ChemLLM Chemistry, Pharmacy 7B InternLM2 N/A 2024.06 βœ…
LLM-RDF Chemistry, chemical synthesis N/A GPT-4 N/A 2024.11 βœ…
InstructMol Biochemistry, Chemistry, Pharmacy 7B LLaMA molecular graph encoder 2024.12 βœ…
ChemDFM Chemistry (molecular design), Chemistry 13B LLaMA-2 N/A 2025.07 βœ…
ChemMLLM Chemistry (molecular design), Pharmacy 34B Lumina-mGPT-34B-512 VQGAN 2025.08 βœ…
Chemma Chemistry, Organic Chemistry 7B LLaMA-2 N/A 2025.07 βœ…
Chem3DLLM Chemistry (Molecular Design), Pharmacy 7B Qwen2-7B ESM-Encoder 2025.08 βœ…

πŸͺ¨ Materials Science

⬆ Back to Top

Models Domain Parameters Base LLM Modality encoder Release Open-source
SMILES-BERT Materials Science 30M BERT-small N/A 2019.09 βœ…
MolGPT Materials Science N/A N/A N/A 2022.05 βœ…
MOFormer Materials Science N/A N/A N/A 2022.10 ❌
MatBert-bandgap Materials Science 110M MatBERT N/A 2023.03 βœ…
Regression Transformer Materials Science N/A N/A N/A 2023.04 βœ…
MolXPT Materials Science N/A GPT-2 N/A 2023.05 βœ…
xyztransformer Materials Science N/A N/A N/A 2023.05 βœ…
polyBERT Materials Science N/A DeBERTa N/A 2023.07 βœ…
GPT-MolBERTa Materials Science N/A RoBERTa N/A 2023.10 ❌
ChemRLformer Materials Science N/A N/A N/A 2023.10 ❌
CrystaLLM Materials Science 70B LLaMA-2 70B N/A 2024.02 βœ…
MatText Materials Science N/A BERT N/A 2024.06 βœ…
ChatMOF Materials Science N/A GPT-4, GPT-3.5-turbo, and GPT-3.5-turbo-16k N/A 2024.06 βœ…
LHS2RHS Materials Science N/A N/A N/A 2024.10 ❌
RHS2LHS Materials Science N/A N/A N/A 2024.10 ❌
TGT2CEQ Materials Science N/A N/A N/A 2024.10 ❌
CrystaLLM Materials Science 200M GPT-2 N/A 2024.12 βœ…
molT5-large Materials Science 770M T5-large N/A 2024.12 βœ…
Qwen2-KG Materials Science 72B Qwen2-72B N/A 2025.02 βœ…
LLM-Prop Materials Science 37M T5-small N/A 2025.06 βœ…
Crystal Synthesis LLM Materials Science 8B LLaMA-3-8B N/A 2025.07 βœ…

🧬 Life Sciences

⬆ Back to Top

Models Domain Parameters Base LLM Modality encoder Release Open-source
ShizhenGPT Healthcare and Medical Sciences 7B / 32B Qwen2.5 Qwen2.5-VL vision encoder, Whisper-large-v3 2025.08 βœ…
ProGen2 Proteomics 6.4B / 2.7B / 764M / 151M N/A N/A 2022.06 βœ…
BioGPT Healthcare and Medical Sciences, General Biology 347M GPT-2 N/A 2022.10 βœ…
ESM-2 Proteomics 15B / 3B / 650M /150M/ 35M / 8M N/A N/A 2023.03 βœ…
OphGLM Healthcare and Medical Sciences 6B ChatGLM-6B ConvNext 2023.03 βœ…
MedAlpaca Healthcare and Medical Sciences 7B / 13B LLaMA N/A 2023.04 βœ…
DoctorGLM Healthcare and Medical Sciences 6B ChatGLM-6B N/A 2023.04 βœ…
PMC-LLaMA Healthcare and Medical Sciences 13B LLaMA N/A 2023.04 βœ…
scGPT Multi-omics 30k / 300k / 3M / 33M N/A N/A 2023.04 βœ…
Med-PaLM Healthcare and Medical Sciences N/A PaLM N/A 2023.05 ❌
Med-PaLM 2 Healthcare and Medical Sciences N/A PaLM 2 N/A 2023.05 ❌
GatorTronS Healthcare and Medical Sciences 345M / 3.9B / 8.9B GPT-3 N/A 2023.05 βœ…
GatorTronGPT Healthcare and Medical Sciences 5B / 20B GPT-3 N/A 2023.05 βœ…
HuatuoGPT Healthcare and Medical Sciences 7B / 13B Baichuan-7B, Ziya-LLaMA-13B-Pretrain-v1 N/A 2023.05 βœ…
BiomedGPT Healthcare and Medical Sciences 33M / 93M / 182M OFA VQ-GAN 2023.05 βœ…
ClinicalGPT Healthcare and Medical Sciences 7B BLOOM-7B N/A 2023.06 βœ…
GENA-LM Molecular and Cell Biology, Multi-omics 110M / 336M BERT N/A 2023.06 βœ…
NYUTron Healthcare and Medical Sciences, Neuroscience, Pharmacy 190M BERT N/A 2023.06 βœ…
ChatDoctor Healthcare and Medical Sciences 7B LLaMA N/A 2023.06 βœ…
SoulChat Neuroscience, Healthcare and Medical Sciences 6B ChatGLM-6B N/A 2023.07 βœ…
DNAGPT Molecular and Cell Biology, Multi-omics 3B GPT N/A 2023.07 βœ…
Med-Flamingo Healthcare and Medical Sciences 9B Openflamingo Openflamingo 2023.07 βœ…
DISC-MedLLM Healthcare and Medical Sciences 13B Baichuan-13B N/A 2023.08 βœ…
IvyGPT Healthcare and Medical Sciences 33B LLaMA-33B N/A 2023.08 ❌
Zhongjing Healthcare and Medical Sciences 13B Ziya-LLaMA-13B-V1 N/A 2023.08 βœ…
Radiology-Llama2 Healthcare and Medical Sciences 7B LLaMA-2 N/A 2023.08 βœ…
RadFM Healthcare and Medical Sciences 9B MedLLaMA-13B 3D ViT 2023.08 βœ…
CPLLM Healthcare and Medical Sciences 13B Llama2-13B N/A 2023.09 βœ…
DRG-LLaMA Healthcare and Medical Sciences 7B LLaMA-7B N/A 2023.09 βœ…
MindGPT Neuroscience, Healthcare and Medical Sciences 124M GPT-2 CLIP-ViT-B/32 2023.09 ❌
BioinspiredLLM General Biology, Molecular and Cell Biology, Proteomics 13B LLaMA-2 N/A 2023.09 βœ…
Qilin-Med Healthcare and Medical Sciences 7B Baichuan-7B N/A 2023.10 ❌
CXR-LLAVA Healthcare and Medical Sciences 7B LLaMA-2 CLIP ViT-L/16 2023.10 βœ…
InstructProtein Proteomics 1.3B OPT-1.3B N/A 2023.10 ❌
ChiMed-GPT Healthcare and Medical Sciences 13B Ziya-13B-v2 N/A 2023.11 βœ…
HuatuoGPT-II Healthcare and Medical Sciences 7B/13B Baichuan2-7B-Base, Baichuan2-13B-Base N/A 2023.11 βœ…
Taiyi-LLM Healthcare and Medical Sciences 7B Qwen-7B-base N/A 2023.11 βœ…
Meditron Healthcare and Medical Sciences 7B / 70B LLaMA-2 N/A 2023.11 βœ…
MAIRA-1 Healthcare and Medical Sciences 7B Vicuna-7B RAD-DINO 2023.11 βœ…
MAIRA-2 Healthcare and Medical Sciences 7B Vicuna-7B-v1.5 RAD-DINO 2023.11 βœ…
Neuro-GPT Neuroscience, Healthcare and Medical Sciences 124M GPT-2 EEG Encoder 2023.11 βœ…
PLLaMa Molecular and Cell Biology, General Biology 7B / 13B LLaMA-2 N/A 2024.01 βœ…
EEG-GPT Neuroscience N/A GPT-3 EEG Encoder 2024.01 ❌
BioMistral Healthcare and Medical Sciences, Molecular and Cell Biology 7B Mistral-7B-Instruct-v0.1 N/A 2024.02 βœ…
MMed-LLaMA 3 Healthcare and Medical Sciences 8B LLaMA 3 N/A 2024.02 βœ…
ProLLaMA Proteomics 7B LLaMA-2 N/A 2024.02 βœ…
ProtLLM Proteomics 7B LLaMA-7B ProtST (protein) 2024.03 βœ…
BrainGPT Neuroscience 7B Mistral-7B N/A 2024.03 βœ…
Apallo Healthcare and Medical Sciences 0.5B / 1.8B / 2B / 6B / 7B Qwen N/A 2024.03 βœ…
Med-Gemini Healthcare and Medical Sciences N/A Gemini 1.5 Pro Custom encoders (multimodal) 2024.04 ❌
UMBRAE Neuroscience 7B Vicuna-7B CLIP-ViT/L-14 (vision), Encoder (fMRI) 2024.04 βœ…
SeedLLM Agronomy 7B Qwen2.5 N/A 2024.04 ❌
Alphafold3 Molecular and Cell Biology, Proteomics, Pharmacy, Neuroscience N/A N/A Input Feature Embedder 2024.05 ❌
DrugLLM Pharmacy 7B LLaMA 7B N/A 2024.05 ❌
LLaVA-Med Healthcare and Medical Sciences N/A Vicuna-7B Clip ViT-L/14 2024.05 βœ…
CareGPT Healthcare and Medical Sciences 7B LLaMA-2 N/A 2024.05 βœ…
ProtT3 Proteomics N/A Galactica 1.3B ESM-2 (protein) 2024.05 βœ…
MolecularGPT Molecular and Cell Biology N/A LLaMA N/A 2024.06 βœ…
HuatuoGPT-Vision Healthcare and Medical Sciences 7B / 34B Qwen2-7B Qwen Image Encoder (vision) 2024.06 βœ…
NeuroLM Neuroscience 254M/500M/1.7B GPT-2 Encoder (EEG) 2024.08 βœ…
RNAGPT Molecular and Cell Biology, Multi-omics 8B LLaMA-3 RNA-FM sequence encoder (RNA) 2024.10 ❌
AgroGPT Agronomy 3B / 7B LLaVA-1.5, Mipha CLIP-ViT-L/14 (vision), SigLIP 2024.10 βœ…
LLaMA-Gene Molecular and Cell Biology, Proteomics 7B LLaMA-7B N/A 2024.11 βœ…
GMAI-VL Healthcare and Medical Sciences 7B InternLM Image Encoder (vision) 2024.11 βœ…
HuatuoGPT-o1 Healthcare and Medical Sciences 7B / 8B / 70B / 72B LLaMA-3.1, Qwen2.5 N/A 2024.12 βœ…
Evolla Proteomics 10B / 80B LLaMA-3 8B Saprot (protein) 2025.01 βœ…
UniMind Neuroscience 7B InternLM2.5 Encoder (EEG) 2025.01 ❌
NatureLM Pharmacy, Molecular and Cell Biology, Proteomics, Material 46.7B Mixtral 8x7B N/A 2025.02 ❌
MindLLM Neuroscience, Healthcare and Medical Sciences 7B Vicuna-7B Encoder (fMRI) 2025.02 βœ…
MedVLM-R1 Healthcare and Medical Sciences 2B Qwen2-VL Qwen Image Encoder (vision) 2025.02 βœ…
AlphaGenome Molecular and Cell Biology, Multi-omics N/A N/A N/A 2025.05 βœ…
ChatNT Molecular and Cell Biology, Proteomics, Multi-omics 7B Vicuna-7B Nucleotide Transformer v2 (DNA) 2025.06 βœ…
Lingshu Healthcare and Medical Sciences 7B / 32B Qwen N/A 2025.06 βœ…
PodGPT Healthcare and Medical Sciences N/A Gemma, Mixtral, LLaMA N/A 2025.07 βœ…
MedGemma Healthcare and Medical Sciences 4B / 27B Gemma 3 SigLip Image Encoder (vision) 2025.07 βœ…

🌌 Astronomy

⬆ Back to Top

Models Domain Parameters Base LLM Modality encoder Release Open-source
AstroLLaMA-2-7B Astronomy 7B Llama-2 LLM N/A 2023.09 βœ…
AstroLLaMA-3-8B Astronomy 8B LLaMA-2-7B LLM N/A 2024.09 βœ…
AstroLLaMA-2-70B Astronomy 70B LLaMA-2-7B LLM N/A 2024.09 βœ…
AstroSage-LLaMA-3.1-8B Astronomy 8B Llama-3.1-8B LLM N/A 2025.04 βœ…
AstroLLaVa-7B Astronomy 7B LLaVA 1.5 LLM CLIP-ViT/L-14 (vision) 2025.04 βœ…
AstroSage-LLaMA-3.1-70B Astronomy 70B Llama-3.1-70B LLM N/A 2025.05 βœ…

🌍 Earth Science

⬆ Back to Top

Models Domain Parameters Base LLM Modality encoder Release Open-source
OceanGPT Hydrosphere, Biosphere, Lithosphere, Remote Sensing 7B LLama, Qwen N/A 2023.03 βœ…
K2 Lithosphere, Remote Sensing 7B LLama N/A 2023.08 βœ…
GeoChat Remote Sensing, Lithosphere 7B Vicuna-v1.5 N/A 2023.11 βœ…
SkyEyeGPT Remote Sensing 7B N/A N/A 2024.01 βœ…
TeoChat Remote Sensing, Lithosphere 7B Vdieo-LLaVA N/A 2024.10 βœ…
EarthMarker Remote Sensing 13B LLaMA-2 N/A 2024.11 βœ…
EarthDial Remote Sensing 4B Phi-3-mini N/A 2024.12 βœ…
GeoPixel Remote Sensing, Lithosphere 7B IXC-2.5 N/A 2025.01 βœ…
EagleVision Remote Sensing 1B/2B/4B/7B Qwen2-VL-72B, GPT-4o N/A 2025.03 βœ…
ClimateChat Lithosphere, Climate 7B jiuZhou N/A 2025.03 βœ…
GeoGPT Lithosphere, Remote Sensing 70B Llama3.1-70B, Qwen2.5-72B N/A 2025.04 βœ…
GeoLLaVA-8K Remote Sensing, Lithosphere 7B LongVA N/A 2025.05 βœ…

πŸ“… Star History

⬆ Back to Top

Star History Chart

About

A curated collection of papers, datasets, and resources on Scientific Datasets and Large Language Models (LLMs)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published