Skip to content

Commit 0397391

Browse files
author
unknown
committed
update
1 parent 871f050 commit 0397391

File tree

3 files changed

+2
-2
lines changed

3 files changed

+2
-2
lines changed

paper.pdf

-17 Bytes
Binary file not shown.

paper/sections/4-evaluation.tex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -380,7 +380,7 @@ \subsection{Scientific Experimental Reasoning}
380380
\end{figure}
381381

382382
\textbf{Reasoning validity often exceeds answer accuracy.}
383-
As shown in Figure~\ref{fig: mm_results}, closed-source LLMs generally outperform open-source counterparts on both metrics, with the best closed-source models achieving higher MCA (e.g., up to 41.9) and RV (e.g., up to 57.1) than the best open-source models (MCA 37.8, RV 52.3). However, several open-source models remain competitive with or exceed some closed-source systems in specific metrics (e.g., Qwen3-VL-235B-A22B RV 50.5 > GPT-4o RV 45.4), indicating nontrivial overlap. Most models score higher in Reasoning Validity than in Multi-choice Accuracy, suggesting that even when the final choice is incorrect, explanations often preserve partial logical coherence. Variance is moderate—particularly among closed-source models—while only a few models (e.g., Intern-S1-mini) show noticeably lower performance, pointing to the importance of scale for robust multimodal scientific reasoning.
383+
As shown in Figure~\ref{fig: mm_results}, closed-source LLMs generally outperform open-source counterparts on both metrics, with the best closed-source models achieving higher MCA (e.g., up to 41.9) and RV (e.g., up to 71.3) than the best open-source models (MCA 37.8, RV 52.3). However, several open-source models remain competitive with or exceed some closed-source systems in specific metrics (e.g., Qwen3-VL-235B-A22B RV 50.5 > GPT-4o RV 45.4), indicating nontrivial overlap. Most models score higher in Reasoning Validity than in Multi-choice Accuracy, suggesting that even when the final choice is incorrect, explanations often preserve partial logical coherence. Variance is moderate—particularly among closed-source models—while only a few models (e.g., Intern-S1-mini) show noticeably lower performance, pointing to the importance of scale for robust multimodal scientific reasoning.
384384

385385

386386
\begin{figure}[ht]

paper/sections/5-discussion.tex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -186,7 +186,7 @@ \subsection{Fragmentation Across the Four Quadrants of SGI}
186186
At a finer granularity, Deep Research tasks involving \textbf{Data} and \textbf{Properties} are the weakest: performance on these categories is substantially below that of \textbf{Micro-} and \textbf{Macro-experiment} questions, with \emph{all four categories rarely exceeding 30\%} accuracy (Figure~\ref{fig: deep research on different task}). This aligns with the task design: data/property questions require retrieving dispersed numerical details across heterogeneous papers, while experiment-oriented questions provide more structured evidence. The results thus expose a core SGI bottleneck: \emph{meta-analytic retrieval + numerical aggregation over scattered literature}.
187187

188188
\paragraph{Conception: Ideas lack implementability.}
189-
Idea Generation in SGI-Bench is assessed using \textbf{Effectiveness}, \textbf{Detailedness}, and \textbf{Feasibility} (Table~\ref{tab:idea_gen_res}). \textbf{Feasibility is low across models}: many systems score in the 14–20 range, and the best result reaches 22.90 (\texttt{o3}), indicating that feasibility consistently lags behind novelty and detailedness. \textbf{Detailedness remains insufficient for several models}, with implementation steps frequently missing concrete parameters, resource assumptions, or step ordering; \textbf{Effectiveness is moderate for most systems}, with the highest result of 40.92 (GPT-5) and open-source models clustering around 24.95–28.50 (e.g., DeepSeek-V3.2, Llama-4-Scout).
189+
Idea Generation in SGI-Bench is assessed using \textbf{Effectiveness}, \textbf{Detailedness}, and \textbf{Feasibility} (Table~\ref{tab:idea_gen_res}). \textbf{Feasibility is low across models}: many systems score in the 14–20 range, and the best result reaches 22.90 (\texttt{o3}), indicating that feasibility consistently lags behind novelty and detailedness. \textbf{Detailedness remains insufficient for several models}, with implementation steps frequently missing concrete parameters, resource assumptions, or step ordering; \textbf{Effectiveness is moderate for most systems}, with the highest result of 51.36 (GPT-5.2-Pro) and open-source models clustering around 24.95–28.50 (e.g., DeepSeek-V3.2, Llama-4-Scout).
190190

191191
Recurring issues include: (i) underspecified implementation steps—absent data acquisition or preprocessing plans, missing hyperparameters or compute assumptions, vague module choices (e.g., solver type, training objective, evaluation protocol), and unclear interfaces, ordering, or data flow; and (ii) infeasible procedures—reliance on unavailable instruments or data, uncoordinated pipelines that cannot be executed, and designs lacking reproducibility.
192192

0 commit comments

Comments
 (0)