You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: paper/sections/2-benchmark.tex
+2-1Lines changed: 2 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -24,6 +24,7 @@ \section{Scientific General Intelligence: Concept and Operational Definition}
24
24
25
25
SGI-Bench departs from conventional benchmarks that emphasize factual recall or single-turn reasoning. Instead, it operationalizes the long-horizon workflow of scientific discovery into four interdependent stages: literature review(Deliberation), methodology design(conception), experiment implementation(Action), and experimental analysis(Perception). These stages correspond to fundamental capabilities required of AI systems: information integration and understanding(Scientific Deep Research), design and planning(Idea Generation), experimental execution(AI-Assisted Scientific Experiment), and reasoning-based interpretation(Scientific Experimental Reasoning). Together, they form a unified framework that measures not only what models know but how they think, plan, and adapt in pursuit of new knowledge.
26
26
27
+
27
28
\begin{figure}[ht]
28
29
% \vspace{-0.5em}
29
30
\centerline
@@ -462,7 +463,7 @@ \subsubsection{Metrics of AI-Assisted Scientific Experiment}
462
463
463
464
\paragraph{Dry Experiment}
464
465
\label{sec:MetricofDryExperiment}
465
-
Dry experiments focus on code generation task. Specifically, each problem includes background information, data code, and main code with certain functions masked. The model is tasked with completing the missing functions. Each problem contains 5 unit tests. Our metrics capture both correctness and execution behavior of the generated code.~\cite{jain2024livecodebenchholisticcontaminationfree}
466
+
Dry experiments focus on code generation task. Specifically, each problem includes background information, data code, and main code with certain functions masked. The model is tasked with completing the missing functions. Each problem contains 5 unit tests. Our metrics capture both correctness and execution behavior of the generated code~\cite{jain2024livecodebenchholisticcontaminationfree}.
\caption{\textbf{Deep Research Task Metrics (LLMs)}: Category-wise scores across Properties, Micro/Macro-Experiments, and Data.}
3380
+
\caption{\textbf{Deep Research Task Metrics (LLMs)}: Category-wise scores across Properties, Micro/Macro-Experiments, and Data. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig:data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
\caption{\textbf{Deep Research Task Metrics (Agents)}: Category-wise scores across Properties, Micro/Macro-Experiments, and Data.}
3411
+
\caption{\textbf{Deep Research Task Metrics (Agents)}: Category-wise scores across Properties, Micro/Macro-Experiments, and Data. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig:data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
\caption{\textbf{Dry Experiment Function Categories}: Completion scores across six function types.}
3452
+
\caption{\textbf{Dry Experiment Function Categories}: Completion scores across six function types. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig:data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
\caption{\textbf{Experimental Reasoning by Type (Multi-choice Accuracy)}: Scores across signal, attribute, comparative, and causal reasoning.}
3488
+
\caption{\textbf{Experimental Reasoning by Type (Multi-choice Accuracy)}: Scores across signal, attribute, comparative, and causal reasoning. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig:data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
\caption{\textbf{Deep Research Across Subjects (LLMs)}: Subject-wise scores across ten scientific domains.}
3529
+
\caption{\textbf{Deep Research Across Subjects (LLMs)}: Subject-wise scores across ten scientific domains. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig:data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
\caption{\textbf{Deep Research Across Subjects (Agents)}: Subject-wise scores across ten scientific domains.}
3560
+
\caption{\textbf{Deep Research Across Subjects (Agents)}: Subject-wise scores across ten scientific domains. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig:data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
\caption{\textbf{Idea Generation Across Subjects}: Subject-wise scores.}
3601
+
\caption{\textbf{Idea Generation Across Subjects}: Subject-wise scores. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig:data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
\caption{\textbf{Dry Experiment Across Subjects}: Subject-wise scores.}
3642
+
\caption{\textbf{Dry Experiment Across Subjects}: Subject-wise scores. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig:data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
\caption{\textbf{Wet Experiment Across Subjects}: Scores across Action Sequence Similarity (SS) and Parameter Accuracy (PA) categories.}
3683
+
\caption{\textbf{Wet Experiment Across Subjects}: Scores across Action Sequence Similarity (SS) and Parameter Accuracy (PA) categories. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig:data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
\caption{\textbf{Experimental Reasoning Across Subjects (Multi-choice Accuracy)}: Subject-wise scores across 10 scientific disciplines.}
3719
+
\caption{\textbf{Experimental Reasoning Across Subjects (Multi-choice Accuracy)}: Subject-wise scores across 10 scientific disciplines. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig:data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
0 commit comments