You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scientific General Intelligence (SGI) is defined as an AI system that can autonomously navigate the full, iterative cycle of scientific inquiry—Deliberation, Conception, Action, and Perception—with the versatility and proficiency of a human scientist. SGI-Bench operationalizes this definition via four scientist-aligned task families: deep research, idea generation, AI-assisted experiments (dry/wet), and multimodal experimental reasoning. The benchmark spans 10 disciplines and more than 1,000 expert-curated samples inspired by Science's 125 Big Questions.
268
+
Scientific General Intelligence (SGI) is defined as an AI system that can autonomously navigate the full, iterative cycle of scientific inquiry—Deliberation, Conception, Action, and Perception—with the versatility and proficiency of a human scientist. SGI-Bench operationalizes this definition via four scientist-aligned task families: deep research, idea generation, dry/wet experiments, and multimodal experimental reasoning. The benchmark spans 10 disciplines and more than 1,000 expert-curated samples inspired by Science's 125 Big Questions.
SGI-Bench data is scientist-aligned and high-fidelity: an expert-sourced corpus spanning 10 disciplines (inspired by Science’s 125 Big Questions), questions constructed by 100+ graduate/PhD annotators with continuous scientist-in-the-loop review, multi-stage cleaning (rules + model checks + expert QA) to ensure executability and unique answers, and difficulty filtering that removes items solved by >50% strong LLMs—yielding authentic, challenging, and broadly representative scientific tasks.
369
+
SGI-Bench data is scientist-aligned and high-fidelity: an expert-sourced corpus spanning 10 disciplines (inspired by Science’s 125 Big Questions), questions constructed by 100+ Master's and PhD holders with continuous scientist-in-the-loop review, multi-stage cleaning (rules + model checks + expert QA) to ensure executability and unique answers, and difficulty filtering that removes items solved by >50% strong LLMs—yielding authentic, challenging, and broadly representative scientific tasks.
<p><strongclass="text-slate-700">Overview Results Across SGI-Bench Tasks:</strong> Aggregated performance across Deep Research, Idea Generation, Dry/Wet Experiment, and Scientific Experimental Reasoning. The scores for Deep Research are based on the exact match metric (the strictest metric). Idea Generation scores are the average of four metrics evaluating ideas. Dry Experiment scores are based on PassAll@5 (the strictest metric). Wet Experiment scores are the average of action sequence similarity and parameter accuracy. Experimental Reasoning scores are based on the multi-choice accuracy metric (the strictest metric). The SGI-Score is the average across these tasks, reflecting the overall capability of an AI model in various scientific research scenarios. Note: The Experimental Reasoning metrics for Qwen3-Max and Qwen3-8B come from their multimodal versions.</p>
640
+
<p><strongclass="text-slate-700">Overview Results Across SGI-Bench Tasks:</strong> Aggregated performance across Deep Research, Idea Generation, Dry/Wet Experiment, and Experimental Reasoning. The scores for Deep Research are based on the exact match metric (the strictest metric). Idea Generation scores are the average of four metrics evaluating ideas. Dry Experiment scores are based on PassAll@5 (the strictest metric). Wet Experiment scores are the average of action sequence similarity and parameter accuracy. Experimental Reasoning scores are based on the multi-choice accuracy metric (the strictest metric). The SGI-Score is the average across these tasks, reflecting the overall capability of an AI model in various scientific research scenarios. Note: The Experimental Reasoning metrics for Qwen3-Max and Qwen3-8B come from their multimodal versions.</p>
This work advances the study of Scientific General Intelligence (SGI) from both theory and practice. Grounded in the Practical Inquiry Model, we formalize SGI as the capacity to navigate the iterative cycle of Deliberation, Conception, Action, and Perception with the versatility of a human scientist. Building on this principle-grounded definition, we operationalize SGI through SGI-Bench, a comprehensive, scientist-aligned benchmark that instantiates four core task families: Scientific Deep Research, Idea Generation, AI-Assisted Scientific Experiment (dry/wet), and Scientific Experimental Reasoning. Complemented by our agentic evaluation framework and multi-metric protocol, SGI-Bench enables scalable, transparent, and domain-faithful assessment.
865
+
This work advances the study of Scientific General Intelligence (SGI) from both theory and practice. Grounded in the Practical Inquiry Model, we formalize SGI as the capacity to navigate the iterative cycle of Deliberation, Conception, Action, and Perception with the versatility of a human scientist. Building on this principle-grounded definition, we operationalize SGI through SGI-Bench, a comprehensive, scientist-aligned benchmark that instantiates four core task families: Scientific Deep Research, Idea Generation, Dry/Wet Experiment, and Experimental Reasoning. Complemented by our agentic evaluation framework and multi-metric protocol, SGI-Bench enables scalable, transparent, and domain-faithful assessment.
866
866
</p>
867
867
<p>
868
868
Experiments reveal a consistent pattern: in Deep Research, models show step-level alignment but low exact-match accuracy (10--20%), with brittleness in quantitative reasoning; in Idea Generation, hypotheses are fluent but underspecified and infeasible; in Dry Experiment, code is executable but PassAll@k remains low; in Wet Experiment, sequences show omissions and misordering; and in Experimental Reasoning, causal reasoning outperforms comparative, with persistent multimodal challenges. These highlight gaps between linguistic fluency and integrated scientific cognition. Moreover, SGI exhibits dynamic capacity: Test-Time Reinforcement Learning with novelty rewards improves idea generation without reference answers.
author={Hu, Ming and Ma, Chenglong and Li, Wei and Xu, Wanghan and Wu, Jiamin and Hu, Jucheng and Li, Tianbin and Zhuang, Guohang and Liu, Jiaqi and Lu, Yingzhou and others},
1071
1076
journal={arXiv preprint arXiv:2508.21148},
1072
1077
year={2025}
1078
+
}
1079
+
1080
+
@inproceedings{yang2024moose,
1081
+
title={MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses},
1082
+
author={Yang, Zonglin and Liu, Wanhao and Gao, Ben and Xie, Tong and Li, Yuqiang and Ouyang, Wanli and Poria, Soujanya and Cambria, Erik and Zhou, Dongzhan},
1083
+
booktitle={Proceedings of the International Conference on Learning Representations (ICLR)},
1084
+
year={2025}
1085
+
}
1086
+
1087
+
@book{popper2005logic,
1088
+
title={The logic of scientific discovery},
1089
+
author={Popper, Karl},
1090
+
year={2005},
1091
+
publisher={Routledge}
1092
+
}
1093
+
1094
+
@book{popper2014conjectures,
1095
+
title={Conjectures and refutations: The growth of scientific knowledge},
1096
+
author={Popper, Karl},
1097
+
year={2014},
1098
+
publisher={routledge}
1099
+
}
1100
+
1101
+
@book{bacon1878novum,
1102
+
title={Novum organum},
1103
+
author={Bacon, Francis},
1104
+
year={1878},
1105
+
publisher={Clarendon press}
1106
+
}
1107
+
1108
+
@article{Wan2025DeepResearchAT,
1109
+
title={DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks},
1110
+
author={Haiyuan Wan and Cheng Yang and Junchi Yu and Meiqi Tu and Jiaxuan Lu and Di Yu and Jianbao Cao and Ben Gao and Jiaqing Xie and Aoran Wang and Wenlong Zhang and Philip Torr and Dongzhan Zhou},
title={Deep Research Bench: Evaluating AI Web Research Agents},
1119
+
author={Nikos I. Bosse and Jon Evans and Robert G. Gambee and Daniel Hnyk and Peter M{\"u}hlbacher and Lawrence Phillips and Dan Schwarz and Jack Wildman},
title={LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs},
1137
+
author={Yujun Zhou and Jingdong Yang and Kehan Guo and Pin-Yu Chen and Tian Gao and Werner Geyer and Nuno Moniz and Nitesh V. Chawla and Xiangliang Zhang},
Copy file name to clipboardExpand all lines: paper/sections/0-abstract.tex
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@
14
14
\textbf{Abstract:}
15
15
16
16
Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)—the ability to autonomously conceive, investigate, and reason across scientific domains—remains lacking.
17
-
We present a principled SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, AI-assisted experiments (dry/wet), and experimental reasoning.
17
+
We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning.
18
18
SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20\%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges.
19
19
We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer.
20
20
Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
0 commit comments