You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
\captionof{figure}{\textbf{Scientific General Intelligence (SGI)} We define SGI as an AI that can autonomously navigate the complete, iterative cycle of scientific inquiry with the versatility and proficiency of a human scientist. The teaser illustrates the Practical Inquiry Model’s four quadrants—Deliberation (synthesis and critical evaluation of knowledge), Conception (idea generation), Action (experimental execution), and Perception (interpretation)—and how SGI-Bench operationalizes them through four task categories and an agent-based evaluation paradigm, together providing a principle-grounded, measurable framework for assessing scientific intelligence.}
235
+
\captionof{figure}{\textbf{Scientific General Intelligence (SGI)} We define SGI as an AI that can autonomously navigate the complete, iterative cycle of scientific inquiry with the versatility and proficiency of a human scientist. The teaser illustrates the Practical Inquiry Model's four quadrants—Deliberation (synthesis and critical evaluation of knowledge), Conception (idea generation), Action (experimental execution), and Perception (interpretation)—and how SGI-Bench operationalizes them through four task categories and an agent-based evaluation paradigm, together providing a principle-grounded, measurable framework for assessing scientific intelligence.}
Copy file name to clipboardExpand all lines: paper/sections/0-abstract.tex
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -15,7 +15,7 @@
15
15
16
16
Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)—the ability to autonomously conceive, investigate, and reason across scientific domains—remains lacking.
17
17
We present a principled SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, AI-assisted experiments (dry/wet), and experimental reasoning.
18
-
SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science’s 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20\%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges.
18
+
SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20\%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges.
19
19
We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer.
20
20
Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
Copy file name to clipboardExpand all lines: paper/sections/1-introduction.tex
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -18,13 +18,13 @@ \section{Introduction}
18
18
19
19
Thus, to concretize the proposed definition of \textbf{Scientific General Intelligence (SGI)}, we develop \textbf{SGI-Bench: A Scientific Intelligence Benchmark for LLMs via Scientist-Aligned Workflows}. Rather than serving as yet another performance benchmark, SGI-Bench functions as an \emph{operational instantiation} of the SGI framework, quantitatively evaluating LLMs across the full spectrum of scientific cognition defined by the \textbf{Practical Inquiry Model}. By design, SGI-Bench is comprehensive in its disciplinary breadth, challenging in its difficulty, and unique in its explicit coverage of all four capabilities central to our definition of SGI. The benchmark structure is therefore organized into four corresponding task categories:
20
20
\begin{itemize}
21
-
\item\textbf{Scientific Deep Research (Deliberation):} This task evaluates models’ ability to perform iterative, multi-step reasoning over complex scientific content.
21
+
\item\textbf{Scientific Deep Research (Deliberation):} This task evaluates models' ability to perform iterative, multi-step reasoning over complex scientific content.
22
22
\item\textbf{Idea Generation (Conception):} This task assesses creativity and methodological planning by asking models to generate novel hypotheses or experimental designs.
23
23
\item\textbf{AI-Assisted Scientific Experiment (Action):} This task evaluates the ability to plan and execute computational (dry) or laboratory-style (wet) experiments.
24
24
\item\textbf{Scientific Experimental Reasoning (Perception):} This task requires models to analyze experimental results, interpret data trends, and identify meaningful conclusions.
25
25
\end{itemize}
26
26
27
-
Building upon our theoretical framework, the construction of SGI-Bench operationalizes the proposed definition of \textbf{Scientific General Intelligence (SGI)}. We began with foundational topics drawn from \textit{Science’s 125 Big Questions for the 21st Century}~\cite{sanders2021125}, spanning ten major disciplinary areas. Through multi-round collaborations with domain experts, we identified high-impact, AI-assisted research problems and curated raw source materials from leading journals such as \textit{Nature}, \textit{Science}, and \textit{Cell}. Together with PhD-level researchers, we implemented a multi-stage quality control pipeline involving human annotation, model-based verification, and rule-based consistency checks. The resulting benchmark comprises over 1,000 expert-curated samples that concretely instantiate the reasoning, creativity, and experimental competencies central to our definition of SGI.
27
+
Building upon our theoretical framework, the construction of SGI-Bench operationalizes the proposed definition of \textbf{Scientific General Intelligence (SGI)}. We began with foundational topics drawn from \textit{Science's 125 Big Questions for the 21st Century}~\cite{sanders2021125}, spanning ten major disciplinary areas. Through multi-round collaborations with domain experts, we identified high-impact, AI-assisted research problems and curated raw source materials from leading journals such as \textit{Nature}, \textit{Science}, and \textit{Cell}. Together with PhD-level researchers, we implemented a multi-stage quality control pipeline involving human annotation, model-based verification, and rule-based consistency checks. The resulting benchmark comprises over 1,000 expert-curated samples that concretely instantiate the reasoning, creativity, and experimental competencies central to our definition of SGI.
28
28
29
29
To evaluate performance across these four dimensions, we found that conventional “LLM-as-a-judge”~\cite{li2025generation} paradigms are insufficient to handle the diverse and specialized metrics required by SGI assessment. To address this, we developed an agent-based evaluation framework following an \textbf{Agent-as-a-judge}~\cite{zhuge2024agent} paradigm. Equipped with tools such as a web search interface, Python interpreter, file reader, PDF parser, and discipline-specific metric functions, this framework ensures rigor, scalability, and transparency. It operates through four interdependent stages—\textit{Question Selection}, \textit{Metric Customization}, \textit{Prediction \& Evaluation}, and \textit{Report Generation}—each coordinated by specialized agents aligned with different aspects of scientific inquiry.
0 commit comments