Skip to content

Commit 94e791f

Browse files
author
unknown
committed
leibai
1 parent 2c8371a commit 94e791f

File tree

16 files changed

+166
-61
lines changed

16 files changed

+166
-61
lines changed

github_images/pipeline.png

-943 Bytes
Loading

github_images/teaser.png

54.2 KB
Loading

index.html

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -265,7 +265,7 @@ <h1 class="fade-in-up delay-100 text-5xl md:text-7xl font-bold text-slate-900 le
265265
</h1>
266266

267267
<p class="fade-in-up delay-200 text-xl text-slate-600 max-w-3xl mx-auto leading-relaxed px-4">
268-
Scientific General Intelligence (SGI) is defined as an AI system that can autonomously navigate the full, iterative cycle of scientific inquiry—Deliberation, Conception, Action, and Perception—with the versatility and proficiency of a human scientist. SGI-Bench operationalizes this definition via four scientist-aligned task families: deep research, idea generation, AI-assisted experiments (dry/wet), and multimodal experimental reasoning. The benchmark spans 10 disciplines and more than 1,000 expert-curated samples inspired by Science's 125 Big Questions.
268+
Scientific General Intelligence (SGI) is defined as an AI system that can autonomously navigate the full, iterative cycle of scientific inquiry—Deliberation, Conception, Action, and Perception—with the versatility and proficiency of a human scientist. SGI-Bench operationalizes this definition via four scientist-aligned task families: deep research, idea generation, dry/wet experiments, and multimodal experimental reasoning. The benchmark spans 10 disciplines and more than 1,000 expert-curated samples inspired by Science's 125 Big Questions.
269269
</p>
270270

271271
<div class="fade-in-up delay-300 mt-12 p-2 glass rounded-2xl mx-auto max-w-full sm:max-w-4xl">
@@ -309,7 +309,7 @@ <h4 class="font-semibold text-slate-900">Conception</h4>
309309
<div class="flex-shrink-0 w-10 h-10 rounded-full bg-green-50 flex items-center justify-center text-green-600"><i class="fas fa-code"></i></div>
310310
<div>
311311
<h4 class="font-semibold text-slate-900">Action</h4>
312-
<p class="text-sm text-slate-500">AI-assisted experiments: dry (code/simulation) and wet (lab protocol).</p>
312+
<p class="text-sm text-slate-500">Scientific experiments: dry (code generation) and wet (lab protocol development).</p>
313313
</div>
314314
</div>
315315
<div class="flex gap-4 p-4 glass rounded-xl border-l-4 border-orange-500">
@@ -347,7 +347,7 @@ <h4 class="font-semibold text-slate-900">Raw Corpus</h4>
347347
<div class="flex-shrink-0 w-10 h-10 rounded-full bg-indigo-50 flex items-center justify-center text-indigo-600"><i class="fas fa-pen-nib"></i></div>
348348
<div>
349349
<h4 class="font-semibold text-slate-900">Question Construction</h4>
350-
<p class="text-sm text-slate-500">100+ graduate/PhD annotators; continuous expert review for scientific value.</p>
350+
<p class="text-sm text-slate-500">100+ Master's and PhD holders; continuous expert review for scientific value.</p>
351351
</div>
352352
</div>
353353
<div class="flex gap-4 p-4 glass rounded-xl border-l-4 border-amber-500">
@@ -366,7 +366,7 @@ <h4 class="font-semibold text-slate-900">Difficulty Filtering</h4>
366366
</div>
367367
</div>
368368
<p class="text-base md:text-base text-slate-600 leading-relaxed">
369-
SGI-Bench data is scientist-aligned and high-fidelity: an expert-sourced corpus spanning 10 disciplines (inspired by Science’s 125 Big Questions), questions constructed by 100+ graduate/PhD annotators with continuous scientist-in-the-loop review, multi-stage cleaning (rules + model checks + expert QA) to ensure executability and unique answers, and difficulty filtering that removes items solved by >50% strong LLMs—yielding authentic, challenging, and broadly representative scientific tasks.
369+
SGI-Bench data is scientist-aligned and high-fidelity: an expert-sourced corpus spanning 10 disciplines (inspired by Science’s 125 Big Questions), questions constructed by 100+ Master's and PhD holders with continuous scientist-in-the-loop review, multi-stage cleaning (rules + model checks + expert QA) to ensure executability and unique answers, and difficulty filtering that removes items solved by >50% strong LLMs—yielding authentic, challenging, and broadly representative scientific tasks.
370370
</p>
371371
</div>
372372
</div>
@@ -637,7 +637,7 @@ <h3 class="text-2xl font-bold text-slate-800 flex items-center gap-2">
637637
</div>
638638
</div>
639639
<div class="mt-4 text-slate-500 text-sm leading-relaxed bg-white/50 p-4 rounded-lg border border-slate-100">
640-
<p><strong class="text-slate-700">Overview Results Across SGI-Bench Tasks:</strong> Aggregated performance across Deep Research, Idea Generation, Dry/Wet Experiment, and Scientific Experimental Reasoning. The scores for Deep Research are based on the exact match metric (the strictest metric). Idea Generation scores are the average of four metrics evaluating ideas. Dry Experiment scores are based on PassAll@5 (the strictest metric). Wet Experiment scores are the average of action sequence similarity and parameter accuracy. Experimental Reasoning scores are based on the multi-choice accuracy metric (the strictest metric). The SGI-Score is the average across these tasks, reflecting the overall capability of an AI model in various scientific research scenarios. Note: The Experimental Reasoning metrics for Qwen3-Max and Qwen3-8B come from their multimodal versions.</p>
640+
<p><strong class="text-slate-700">Overview Results Across SGI-Bench Tasks:</strong> Aggregated performance across Deep Research, Idea Generation, Dry/Wet Experiment, and Experimental Reasoning. The scores for Deep Research are based on the exact match metric (the strictest metric). Idea Generation scores are the average of four metrics evaluating ideas. Dry Experiment scores are based on PassAll@5 (the strictest metric). Wet Experiment scores are the average of action sequence similarity and parameter accuracy. Experimental Reasoning scores are based on the multi-choice accuracy metric (the strictest metric). The SGI-Score is the average across these tasks, reflecting the overall capability of an AI model in various scientific research scenarios. Note: The Experimental Reasoning metrics for Qwen3-Max and Qwen3-8B come from their multimodal versions.</p>
641641
</div>
642642
</div>
643643

@@ -646,7 +646,7 @@ <h3 class="text-2xl font-bold text-slate-800 flex items-center gap-2">
646646
<div class="flex flex-col sm:flex-row justify-between items-end mb-4 gap-4">
647647
<h3 class="text-2xl font-bold text-slate-800 flex items-center gap-2">
648648
<span class="w-8 h-8 rounded-lg bg-blue-100 text-blue-600 flex items-center justify-center text-sm"><i class="fas fa-magnifying-glass"></i></span>
649-
Deep Research
649+
Scientific Deep Research
650650
</h3>
651651
</div>
652652

@@ -862,7 +862,7 @@ <h3 class="text-2xl font-bold text-slate-800 flex items-center gap-2">
862862
<h2 class="text-2xl font-bold text-slate-900 mb-6 text-center">Conclusion</h2>
863863
<div class="text-slate-600 leading-relaxed space-y-4">
864864
<p>
865-
This work advances the study of Scientific General Intelligence (SGI) from both theory and practice. Grounded in the Practical Inquiry Model, we formalize SGI as the capacity to navigate the iterative cycle of Deliberation, Conception, Action, and Perception with the versatility of a human scientist. Building on this principle-grounded definition, we operationalize SGI through SGI-Bench, a comprehensive, scientist-aligned benchmark that instantiates four core task families: Scientific Deep Research, Idea Generation, AI-Assisted Scientific Experiment (dry/wet), and Scientific Experimental Reasoning. Complemented by our agentic evaluation framework and multi-metric protocol, SGI-Bench enables scalable, transparent, and domain-faithful assessment.
865+
This work advances the study of Scientific General Intelligence (SGI) from both theory and practice. Grounded in the Practical Inquiry Model, we formalize SGI as the capacity to navigate the iterative cycle of Deliberation, Conception, Action, and Perception with the versatility of a human scientist. Building on this principle-grounded definition, we operationalize SGI through SGI-Bench, a comprehensive, scientist-aligned benchmark that instantiates four core task families: Scientific Deep Research, Idea Generation, Dry/Wet Experiment, and Experimental Reasoning. Complemented by our agentic evaluation framework and multi-metric protocol, SGI-Bench enables scalable, transparent, and domain-faithful assessment.
866866
</p>
867867
<p>
868868
Experiments reveal a consistent pattern: in Deep Research, models show step-level alignment but low exact-match accuracy (10--20%), with brittleness in quantitative reasoning; in Idea Generation, hypotheses are fluent but underspecified and infeasible; in Dry Experiment, code is executable but PassAll@k remains low; in Wet Experiment, sequences show omissions and misordering; and in Experimental Reasoning, causal reasoning outperforms comparative, with persistent multimodal challenges. These highlight gaps between linguistic fluency and integrated scientific cognition. Moreover, SGI exhibits dynamic capacity: Test-Time Reinforcement Learning with novelty rewards improves idea generation without reference answers.

md_images/pipeline.png

-4.66 KB
Loading

md_images/teaser.png

-18.5 KB
Loading

paper.pdf

58.2 KB
Binary file not shown.

paper/imgs/pipeline.png

-943 Bytes
Loading

paper/imgs/teaser.png

54.2 KB
Loading

paper/main.bib

Lines changed: 77 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -612,17 +612,15 @@ @misc{Llama4_Release2025
612612
author = {Meta AI},
613613
year = {2025},
614614
month = jul,
615-
howpublished = {\url{https://ai.meta.com/blog/llama-4-multimodal-intelligence/}},
616-
note = {Official Blog Post}
615+
howpublished = {\url{https://ai.meta.com/blog/llama-4-multimodal-intelligence/}}
617616
}
618617

619618
@techreport{GPT5_SystemCard2025,
620619
title = {GPT-5 System Card},
621620
author = {OpenAI},
622621
institution = {OpenAI},
623622
year = {2025},
624-
url = {https://cdn.openai.com/gpt-5-system-card.pdf},
625-
note = {[27]}
623+
url = {https://cdn.openai.com/gpt-5-system-card.pdf}
626624
}
627625

628626
@techreport{GPT5.1_Addendum2025,
@@ -631,16 +629,23 @@ @techreport{GPT5.1_Addendum2025
631629
institution = {OpenAI},
632630
year = {2025},
633631
month = nov,
634-
url = {https://cdn.openai.com/pdf/4173ec8d-1229-47db-96de-06d87147e07e/5_1_system_card.pdf},
635-
note = {[22]}
632+
url = {https://cdn.openai.com/pdf/4173ec8d-1229-47db-96de-06d87147e07e/5_1_system_card.pdf}
633+
}
634+
635+
@techreport{GPT5.2,
636+
title = {Update to GPT-5 System Card: GPT-5.2},
637+
author = {OpenAI},
638+
institution = {OpenAI},
639+
year = {2025},
640+
month = Dec,
641+
url = {https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf}
636642
}
637643

638644
@misc{Gemini3_DeepMind2025,
639645
title = {Gemini 3 Model Card},
640646
author = {Google DeepMind},
641647
year = {2025},
642-
howpublished = {\url{https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf}},
643-
note = {}
648+
howpublished = {\url{https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf}}
644649
}
645650

646651
@article{bai2025intern,
@@ -1070,4 +1075,68 @@ @article{hu2025survey
10701075
author={Hu, Ming and Ma, Chenglong and Li, Wei and Xu, Wanghan and Wu, Jiamin and Hu, Jucheng and Li, Tianbin and Zhuang, Guohang and Liu, Jiaqi and Lu, Yingzhou and others},
10711076
journal={arXiv preprint arXiv:2508.21148},
10721077
year={2025}
1078+
}
1079+
1080+
@inproceedings{yang2024moose,
1081+
title={MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses},
1082+
author={Yang, Zonglin and Liu, Wanhao and Gao, Ben and Xie, Tong and Li, Yuqiang and Ouyang, Wanli and Poria, Soujanya and Cambria, Erik and Zhou, Dongzhan},
1083+
booktitle={Proceedings of the International Conference on Learning Representations (ICLR)},
1084+
year={2025}
1085+
}
1086+
1087+
@book{popper2005logic,
1088+
title={The logic of scientific discovery},
1089+
author={Popper, Karl},
1090+
year={2005},
1091+
publisher={Routledge}
1092+
}
1093+
1094+
@book{popper2014conjectures,
1095+
title={Conjectures and refutations: The growth of scientific knowledge},
1096+
author={Popper, Karl},
1097+
year={2014},
1098+
publisher={routledge}
1099+
}
1100+
1101+
@book{bacon1878novum,
1102+
title={Novum organum},
1103+
author={Bacon, Francis},
1104+
year={1878},
1105+
publisher={Clarendon press}
1106+
}
1107+
1108+
@article{Wan2025DeepResearchAT,
1109+
title={DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks},
1110+
author={Haiyuan Wan and Cheng Yang and Junchi Yu and Meiqi Tu and Jiaxuan Lu and Di Yu and Jianbao Cao and Ben Gao and Jiaqing Xie and Aoran Wang and Wenlong Zhang and Philip Torr and Dongzhan Zhou},
1111+
journal={ArXiv},
1112+
year={2025},
1113+
volume={abs/2509.01396},
1114+
url={https://api.semanticscholar.org/CorpusID:281080495}
1115+
}
1116+
1117+
@article{Bosse2025DeepRB,
1118+
title={Deep Research Bench: Evaluating AI Web Research Agents},
1119+
author={Nikos I. Bosse and Jon Evans and Robert G. Gambee and Daniel Hnyk and Peter M{\"u}hlbacher and Lawrence Phillips and Dan Schwarz and Jack Wildman},
1120+
journal={ArXiv},
1121+
year={2025},
1122+
volume={abs/2506.06287},
1123+
url={https://api.semanticscholar.org/CorpusID:279251730}
1124+
}
1125+
1126+
@article{Liu2025BioProBenchCD,
1127+
title={BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning},
1128+
author={Yuyang Liu and Liuzhenghao Lv and Xiancheng Zhang and Li Yuan and Yonghong Tian},
1129+
journal={ArXiv},
1130+
year={2025},
1131+
volume={abs/2505.07889},
1132+
url={https://api.semanticscholar.org/CorpusID:278534452}
1133+
}
1134+
1135+
@article{Zhou2024LabSafetyBB,
1136+
title={LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs},
1137+
author={Yujun Zhou and Jingdong Yang and Kehan Guo and Pin-Yu Chen and Tian Gao and Werner Geyer and Nuno Moniz and Nitesh V. Chawla and Xiangliang Zhang},
1138+
journal={ArXiv},
1139+
year={2024},
1140+
volume={abs/2410.14182},
1141+
url={https://api.semanticscholar.org/CorpusID:273482719}
10731142
}

paper/sections/0-abstract.tex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
\textbf{Abstract:}
1515

1616
Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)—the ability to autonomously conceive, investigate, and reason across scientific domains—remains lacking.
17-
We present a principled SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, AI-assisted experiments (dry/wet), and experimental reasoning.
17+
We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning.
1818
SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20\%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges.
1919
We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer.
2020
Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.

0 commit comments

Comments
 (0)