You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: paper/main.bib
+40Lines changed: 40 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -758,6 +758,21 @@ @misc{Grok3_2025
758
758
}
759
759
760
760
761
+
@article{hu2025flowsearch,
762
+
title={FlowSearch: Advancing deep research with dynamic structured knowledge flow},
763
+
author={Hu, Yusong and Ma, Runmin and Fan, Yue and Shi, Jinxin and Cao, Zongsheng and Zhou, Yuhao and Yuan, Jiakang and Yan, Xiangchao and Zhang, Wenlong and Bai, Lei and others},
764
+
journal={arXiv preprint arXiv:2510.08521},
765
+
year={2025}
766
+
}
767
+
768
+
@article{shi2025dualresearch,
769
+
title={DualResearch: Entropy-Gated Dual-Graph Retrieval for Answer Reconstruction},
770
+
author={Shi, Jinxin and Cao, Zongsheng and Ma, Runmin and Hu, Yusong and Zhou, Jie and Li, Xin and Bai, Lei and He, Liang and Zhang, Bo},
771
+
journal={arXiv preprint arXiv:2510.08959},
772
+
year={2025}
773
+
}
774
+
775
+
761
776
@article{team2025novelseek,
762
777
title={NovelSeek: When Agent Becomes the Scientist--Building Closed-Loop System from Hypothesis to Verification},
763
778
author={Team, NovelSeek and Zhang, Bo and Feng, Shiyang and Yan, Xiangchao and Yuan, Jiakang and Yu, Zhiyin and He, Xiaohan and Huang, Songtao and Hou, Shaowei and Nie, Zheng and others},
title={Manalyzer: End-to-end Automated Meta-analysis with Multi-agent System},
822
+
author={Xu, Wanghan and Zhang, Wenlong and Ling, Fenghua and Fei, Ben and Hu, Yusong and Ren, Fangxuan and Lin, Jintai and Ouyang, Wanli and Bai, Lei},
823
+
journal={arXiv preprint arXiv:2505.20310},
824
+
year={2025}
825
+
}
826
+
827
+
@article{field2010meta,
828
+
title={How to do a meta-analysis},
829
+
author={Field, Andy P and Gillett, Raphael},
830
+
journal={British Journal of Mathematical and Statistical Psychology},
831
+
volume={63},
832
+
number={3},
833
+
pages={665--694},
834
+
year={2010},
835
+
publisher={Wiley Online Library}
836
+
}
837
+
838
+
@article{xu2025comprehensive,
839
+
title={A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications},
Copy file name to clipboardExpand all lines: paper/sections/0-abstract.tex
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@
13
13
14
14
Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)—the ability to autonomously conceive, investigate, and reason across scientific domains—remains lacking.
15
15
We present a principled SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, AI-assisted experiments (dry/wet), and experimental reasoning.
16
-
SGI-Bench comprises ~1{,}000 expert-curated, cross-disciplinary samples inspired by Science’s 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20\%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges.
16
+
SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science’s 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20\%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges.
17
17
We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer.
18
18
Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
Copy file name to clipboardExpand all lines: paper/sections/1-introduction.tex
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ \section{Introduction}
24
24
\item\textbf{Scientific Experimental Reasoning (Perception):} This task requires models to analyze experimental results, interpret data trends, and identify meaningful conclusions.
25
25
\end{itemize}
26
26
27
-
Building upon our theoretical framework, the construction of SGI-Bench operationalizes the proposed definition of \textbf{Scientific General Intelligence (SGI)}. We began with foundational topics drawn from \textit{Science’s 125 Big Questions for the 21st Century}~\cite{sanders2021125}, spanning ten major disciplinary areas. Through multi-round collaborations with domain experts, we identified high-impact, AI-assisted research problems and curated raw source materials from leading journals such as \textit{Nature}, \textit{Science}, and \textit{Cell}. Together with PhD-level researchers, we implemented a multi-stage quality control pipeline involving human annotation, model-based verification, and rule-based consistency checks. The resulting benchmark comprises approximately 1,000 expert-curated samples that concretely instantiate the reasoning, creativity, and experimental competencies central to our definition of SGI.
27
+
Building upon our theoretical framework, the construction of SGI-Bench operationalizes the proposed definition of \textbf{Scientific General Intelligence (SGI)}. We began with foundational topics drawn from \textit{Science’s 125 Big Questions for the 21st Century}~\cite{sanders2021125}, spanning ten major disciplinary areas. Through multi-round collaborations with domain experts, we identified high-impact, AI-assisted research problems and curated raw source materials from leading journals such as \textit{Nature}, \textit{Science}, and \textit{Cell}. Together with PhD-level researchers, we implemented a multi-stage quality control pipeline involving human annotation, model-based verification, and rule-based consistency checks. The resulting benchmark comprises over 1,000 expert-curated samples that concretely instantiate the reasoning, creativity, and experimental competencies central to our definition of SGI.
28
28
29
29
To evaluate performance across these four dimensions, we found that conventional “LLM-as-a-judge”~\cite{li2025generation} paradigms are insufficient to handle the diverse and specialized metrics required by SGI assessment. To address this, we developed an agent-based evaluation framework following an \textbf{Agent-as-a-judge}~\cite{zhuge2024agent} paradigm. Equipped with tools such as a web search interface, Python interpreter, file reader, PDF parser, and discipline-specific metric functions, this framework ensures rigor, scalability, and transparency. It operates through four interdependent stages—\textit{Question Selection}, \textit{Metric Customization}, \textit{Prediction \& Evaluation}, and \textit{Report Generation}—each coordinated by specialized agents aligned with different aspects of scientific inquiry.
0 commit comments