Skip to content

Commit 32695ee

Browse files
author
unknown
committed
all done
1 parent 1bddc77 commit 32695ee

File tree

6 files changed

+70
-27
lines changed

6 files changed

+70
-27
lines changed

paper/main.bib

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -758,6 +758,21 @@ @misc{Grok3_2025
758758
}
759759

760760

761+
@article{hu2025flowsearch,
762+
title={FlowSearch: Advancing deep research with dynamic structured knowledge flow},
763+
author={Hu, Yusong and Ma, Runmin and Fan, Yue and Shi, Jinxin and Cao, Zongsheng and Zhou, Yuhao and Yuan, Jiakang and Yan, Xiangchao and Zhang, Wenlong and Bai, Lei and others},
764+
journal={arXiv preprint arXiv:2510.08521},
765+
year={2025}
766+
}
767+
768+
@article{shi2025dualresearch,
769+
title={DualResearch: Entropy-Gated Dual-Graph Retrieval for Answer Reconstruction},
770+
author={Shi, Jinxin and Cao, Zongsheng and Ma, Runmin and Hu, Yusong and Zhou, Jie and Li, Xin and Bai, Lei and He, Liang and Zhang, Bo},
771+
journal={arXiv preprint arXiv:2510.08959},
772+
year={2025}
773+
}
774+
775+
761776
@article{team2025novelseek,
762777
title={NovelSeek: When Agent Becomes the Scientist--Building Closed-Loop System from Hypothesis to Verification},
763778
author={Team, NovelSeek and Zhang, Bo and Feng, Shiyang and Yan, Xiangchao and Yuan, Jiakang and Yu, Zhiyin and He, Xiaohan and Huang, Songtao and Hou, Shaowei and Nie, Zheng and others},
@@ -800,4 +815,29 @@ @misc{zhao2025swiftascalablelightweightinfrastructure
800815
archivePrefix={arXiv},
801816
primaryClass={cs.CL},
802817
url={https://arxiv.org/abs/2408.05517},
818+
}
819+
820+
@article{xu2025manalyzer,
821+
title={Manalyzer: End-to-end Automated Meta-analysis with Multi-agent System},
822+
author={Xu, Wanghan and Zhang, Wenlong and Ling, Fenghua and Fei, Ben and Hu, Yusong and Ren, Fangxuan and Lin, Jintai and Ouyang, Wanli and Bai, Lei},
823+
journal={arXiv preprint arXiv:2505.20310},
824+
year={2025}
825+
}
826+
827+
@article{field2010meta,
828+
title={How to do a meta-analysis},
829+
author={Field, Andy P and Gillett, Raphael},
830+
journal={British Journal of Mathematical and Statistical Psychology},
831+
volume={63},
832+
number={3},
833+
pages={665--694},
834+
year={2010},
835+
publisher={Wiley Online Library}
836+
}
837+
838+
@article{xu2025comprehensive,
839+
title={A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications},
840+
author={Xu, Renjun and Peng, Jingwen},
841+
journal={arXiv preprint arXiv:2506.12594},
842+
year={2025}
803843
}

paper/sections/0-abstract.tex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313

1414
Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)—the ability to autonomously conceive, investigate, and reason across scientific domains—remains lacking.
1515
We present a principled SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, AI-assisted experiments (dry/wet), and experimental reasoning.
16-
SGI-Bench comprises ~1{,}000 expert-curated, cross-disciplinary samples inspired by Science’s 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20\%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges.
16+
SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science’s 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20\%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges.
1717
We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer.
1818
Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
1919

paper/sections/1-introduction.tex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ \section{Introduction}
2424
\item \textbf{Scientific Experimental Reasoning (Perception):} This task requires models to analyze experimental results, interpret data trends, and identify meaningful conclusions.
2525
\end{itemize}
2626

27-
Building upon our theoretical framework, the construction of SGI-Bench operationalizes the proposed definition of \textbf{Scientific General Intelligence (SGI)}. We began with foundational topics drawn from \textit{Science’s 125 Big Questions for the 21st Century}~\cite{sanders2021125}, spanning ten major disciplinary areas. Through multi-round collaborations with domain experts, we identified high-impact, AI-assisted research problems and curated raw source materials from leading journals such as \textit{Nature}, \textit{Science}, and \textit{Cell}. Together with PhD-level researchers, we implemented a multi-stage quality control pipeline involving human annotation, model-based verification, and rule-based consistency checks. The resulting benchmark comprises approximately 1,000 expert-curated samples that concretely instantiate the reasoning, creativity, and experimental competencies central to our definition of SGI.
27+
Building upon our theoretical framework, the construction of SGI-Bench operationalizes the proposed definition of \textbf{Scientific General Intelligence (SGI)}. We began with foundational topics drawn from \textit{Science’s 125 Big Questions for the 21st Century}~\cite{sanders2021125}, spanning ten major disciplinary areas. Through multi-round collaborations with domain experts, we identified high-impact, AI-assisted research problems and curated raw source materials from leading journals such as \textit{Nature}, \textit{Science}, and \textit{Cell}. Together with PhD-level researchers, we implemented a multi-stage quality control pipeline involving human annotation, model-based verification, and rule-based consistency checks. The resulting benchmark comprises over 1,000 expert-curated samples that concretely instantiate the reasoning, creativity, and experimental competencies central to our definition of SGI.
2828

2929
To evaluate performance across these four dimensions, we found that conventional “LLM-as-a-judge”~\cite{li2025generation} paradigms are insufficient to handle the diverse and specialized metrics required by SGI assessment. To address this, we developed an agent-based evaluation framework following an \textbf{Agent-as-a-judge}~\cite{zhuge2024agent} paradigm. Equipped with tools such as a web search interface, Python interpreter, file reader, PDF parser, and discipline-specific metric functions, this framework ensures rigor, scalability, and transparency. It operates through four interdependent stages—\textit{Question Selection}, \textit{Metric Customization}, \textit{Prediction \& Evaluation}, and \textit{Report Generation}—each coordinated by specialized agents aligned with different aspects of scientific inquiry.
3030

0 commit comments

Comments
 (0)