Skip to content

Commit d3c570a

Browse files
author
unknown
committed
update
1 parent 5d157ac commit d3c570a

File tree

4 files changed

+12
-11
lines changed

4 files changed

+12
-11
lines changed

md_images/data_distribution.png

-1.96 KB
Loading

paper/imgs/data_distribution.png

-511 Bytes
Loading

paper/sections/2-benchmark.tex

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ \section{Scientific General Intelligence: Concept and Operational Definition}
2424

2525
SGI-Bench departs from conventional benchmarks that emphasize factual recall or single-turn reasoning. Instead, it operationalizes the long-horizon workflow of scientific discovery into four interdependent stages: literature review(Deliberation), methodology design(conception), experiment implementation(Action), and experimental analysis(Perception). These stages correspond to fundamental capabilities required of AI systems: information integration and understanding(Scientific Deep Research), design and planning(Idea Generation), experimental execution(AI-Assisted Scientific Experiment), and reasoning-based interpretation(Scientific Experimental Reasoning). Together, they form a unified framework that measures not only what models know but how they think, plan, and adapt in pursuit of new knowledge.
2626

27+
2728
\begin{figure}[ht]
2829
% \vspace{-0.5em}
2930
\centerline
@@ -462,7 +463,7 @@ \subsubsection{Metrics of AI-Assisted Scientific Experiment}
462463

463464
\paragraph{Dry Experiment}
464465
\label{sec: Metric of Dry Experiment}
465-
Dry experiments focus on code generation task. Specifically, each problem includes background information, data code, and main code with certain functions masked. The model is tasked with completing the missing functions. Each problem contains 5 unit tests. Our metrics capture both correctness and execution behavior of the generated code.~\cite{jain2024livecodebenchholisticcontaminationfree}
466+
Dry experiments focus on code generation task. Specifically, each problem includes background information, data code, and main code with certain functions masked. The model is tasked with completing the missing functions. Each problem contains 5 unit tests. Our metrics capture both correctness and execution behavior of the generated code~\cite{jain2024livecodebenchholisticcontaminationfree}.
466467

467468
\begin{tcolorbox}[
468469
breakable,

paper/sections/X-appendix.tex

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -3377,7 +3377,7 @@ \subsection{Supplementary Evaluation Results}
33773377
\bottomrule
33783378
\end{tabular}
33793379
}
3380-
\caption{\textbf{Deep Research Task Metrics (LLMs)}: Category-wise scores across Properties, Micro/Macro-Experiments, and Data.}
3380+
\caption{\textbf{Deep Research Task Metrics (LLMs)}: Category-wise scores across Properties, Micro/Macro-Experiments, and Data. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig: data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
33813381
\label{tab:llms_deep_research_task_metric}
33823382
\end{table}
33833383

@@ -3408,7 +3408,7 @@ \subsection{Supplementary Evaluation Results}
34083408
\bottomrule
34093409
\end{tabular}
34103410
}
3411-
\caption{\textbf{Deep Research Task Metrics (Agents)}: Category-wise scores across Properties, Micro/Macro-Experiments, and Data.}
3411+
\caption{\textbf{Deep Research Task Metrics (Agents)}: Category-wise scores across Properties, Micro/Macro-Experiments, and Data. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig: data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
34123412
\label{tab:agents_deep_research_task_metric}
34133413
\end{table}
34143414

@@ -3449,7 +3449,7 @@ \subsection{Supplementary Evaluation Results}
34493449
\bottomrule
34503450
\end{tabular}
34513451
}
3452-
\caption{\textbf{Dry Experiment Function Categories}: Completion scores across six function types.}
3452+
\caption{\textbf{Dry Experiment Function Categories}: Completion scores across six function types. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig: data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
34533453
\label{tab:dry_task_metric_table}
34543454
\end{table}
34553455

@@ -3485,7 +3485,7 @@ \subsection{Supplementary Evaluation Results}
34853485
\bottomrule
34863486
\end{tabular}
34873487
}
3488-
\caption{\textbf{Experimental Reasoning by Type (Multi-choice Accuracy)}: Scores across signal, attribute, comparative, and causal reasoning.}
3488+
\caption{\textbf{Experimental Reasoning by Type (Multi-choice Accuracy)}: Scores across signal, attribute, comparative, and causal reasoning. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig: data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
34893489
\label{tab:mcp_task_metric_table}
34903490
\end{table}
34913491

@@ -3526,7 +3526,7 @@ \subsection{Supplementary Evaluation Results}
35263526
\bottomrule
35273527
\end{tabular}
35283528
}
3529-
\caption{\textbf{Deep Research Across Subjects (LLMs)}: Subject-wise scores across ten scientific domains.}
3529+
\caption{\textbf{Deep Research Across Subjects (LLMs)}: Subject-wise scores across ten scientific domains. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig: data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
35303530
\label{tab:llms_deep_research_subject_metric_table}
35313531
\end{table}
35323532

@@ -3557,7 +3557,7 @@ \subsection{Supplementary Evaluation Results}
35573557
\bottomrule
35583558
\end{tabular}
35593559
}
3560-
\caption{\textbf{Deep Research Across Subjects (Agents)}: Subject-wise scores across ten scientific domains.}
3560+
\caption{\textbf{Deep Research Across Subjects (Agents)}: Subject-wise scores across ten scientific domains. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig: data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
35613561
\label{tab:agents_deep_research_subject_metric_table}
35623562
\end{table}
35633563

@@ -3598,7 +3598,7 @@ \subsection{Supplementary Evaluation Results}
35983598
\bottomrule
35993599
\end{tabular}
36003600
}
3601-
\caption{\textbf{Idea Generation Across Subjects}: Subject-wise scores.}
3601+
\caption{\textbf{Idea Generation Across Subjects}: Subject-wise scores. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig: data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
36023602
\label{tab:idea_subject_metric_table}
36033603
\end{table}
36043604

@@ -3639,7 +3639,7 @@ \subsection{Supplementary Evaluation Results}
36393639
\bottomrule
36403640
\end{tabular}
36413641
}
3642-
\caption{\textbf{Dry Experiment Across Subjects}: Subject-wise scores.}
3642+
\caption{\textbf{Dry Experiment Across Subjects}: Subject-wise scores. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig: data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
36433643
\label{tab:dry_subject_metric_table2}
36443644
\end{table}
36453645

@@ -3680,7 +3680,7 @@ \subsection{Supplementary Evaluation Results}
36803680
\bottomrule
36813681
\end{tabular}
36823682
}
3683-
\caption{\textbf{Wet Experiment Across Subjects}: Scores across Action Sequence Similarity (SS) and Parameter Accuracy (PA) categories.}
3683+
\caption{\textbf{Wet Experiment Across Subjects}: Scores across Action Sequence Similarity (SS) and Parameter Accuracy (PA) categories. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig: data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
36843684
\label{tab:wet_subject_metric_table}
36853685
\end{table}
36863686

@@ -3716,6 +3716,6 @@ \subsection{Supplementary Evaluation Results}
37163716
\bottomrule
37173717
\end{tabular}
37183718
}
3719-
\caption{\textbf{Experimental Reasoning Across Subjects (Multi-choice Accuracy)}: Subject-wise scores across 10 scientific disciplines.}
3719+
\caption{\textbf{Experimental Reasoning Across Subjects (Multi-choice Accuracy)}: Subject-wise scores across 10 scientific disciplines. Note: Because different subjects have different characteristics, the number of questions in each category is not the same (Figure~\ref{fig: data_distribution}). Therefore, the overall performance of the model cannot be obtained by directly averaging the values in the table.}
37203720
\label{tab:mcp_subject_metric_table}
37213721
\end{table}

0 commit comments

Comments
 (0)