update

unknown · unknown · commit 0397391ff0f9 · 2025-12-14T03:53:46.000+08:00
diff --git a/paper.pdf b/paper.pdf
diff --git a/paper/sections/4-evaluation.tex b/paper/sections/4-evaluation.tex
@@ -380,7 +380,7 @@ \subsection{Scientific Experimental Reasoning}
 \end{figure}
 
 \textbf{Reasoning validity often exceeds answer accuracy.}
-As shown in Figure~\ref{fig: mm_results}, closed-source LLMs generally outperform open-source counterparts on both metrics, with the best closed-source models achieving higher MCA (e.g., up to 41.9) and RV (e.g., up to 57.1) than the best open-source models (MCA 37.8, RV 52.3). However, several open-source models remain competitive with or exceed some closed-source systems in specific metrics (e.g., Qwen3-VL-235B-A22B RV 50.5 > GPT-4o RV 45.4), indicating nontrivial overlap. Most models score higher in Reasoning Validity than in Multi-choice Accuracy, suggesting that even when the final choice is incorrect, explanations often preserve partial logical coherence. Variance is moderate—particularly among closed-source models—while only a few models (e.g., Intern-S1-mini) show noticeably lower performance, pointing to the importance of scale for robust multimodal scientific reasoning.
+As shown in Figure~\ref{fig: mm_results}, closed-source LLMs generally outperform open-source counterparts on both metrics, with the best closed-source models achieving higher MCA (e.g., up to 41.9) and RV (e.g., up to 71.3) than the best open-source models (MCA 37.8, RV 52.3). However, several open-source models remain competitive with or exceed some closed-source systems in specific metrics (e.g., Qwen3-VL-235B-A22B RV 50.5 > GPT-4o RV 45.4), indicating nontrivial overlap. Most models score higher in Reasoning Validity than in Multi-choice Accuracy, suggesting that even when the final choice is incorrect, explanations often preserve partial logical coherence. Variance is moderate—particularly among closed-source models—while only a few models (e.g., Intern-S1-mini) show noticeably lower performance, pointing to the importance of scale for robust multimodal scientific reasoning.
 
 
 \begin{figure}[ht]
diff --git a/paper/sections/5-discussion.tex b/paper/sections/5-discussion.tex
@@ -186,7 +186,7 @@ \subsection{Fragmentation Across the Four Quadrants of SGI}
 At a finer granularity, Deep Research tasks involving \textbf{Data} and \textbf{Properties} are the weakest: performance on these categories is substantially below that of \textbf{Micro-} and \textbf{Macro-experiment} questions, with \emph{all four categories rarely exceeding 30\%} accuracy (Figure~\ref{fig: deep research on different task}). This aligns with the task design: data/property questions require retrieving dispersed numerical details across heterogeneous papers, while experiment-oriented questions provide more structured evidence. The results thus expose a core SGI bottleneck: \emph{meta-analytic retrieval + numerical aggregation over scattered literature}.
 
 \paragraph{Conception: Ideas lack implementability.}
-Idea Generation in SGI-Bench is assessed using \textbf{Effectiveness}, \textbf{Detailedness}, and \textbf{Feasibility} (Table~\ref{tab:idea_gen_res}). \textbf{Feasibility is low across models}: many systems score in the 14–20 range, and the best result reaches 22.90 (\texttt{o3}), indicating that feasibility consistently lags behind novelty and detailedness. \textbf{Detailedness remains insufficient for several models}, with implementation steps frequently missing concrete parameters, resource assumptions, or step ordering; \textbf{Effectiveness is moderate for most systems}, with the highest result of 40.92 (GPT-5) and open-source models clustering around 24.95–28.50 (e.g., DeepSeek-V3.2, Llama-4-Scout).
+Idea Generation in SGI-Bench is assessed using \textbf{Effectiveness}, \textbf{Detailedness}, and \textbf{Feasibility} (Table~\ref{tab:idea_gen_res}). \textbf{Feasibility is low across models}: many systems score in the 14–20 range, and the best result reaches 22.90 (\texttt{o3}), indicating that feasibility consistently lags behind novelty and detailedness. \textbf{Detailedness remains insufficient for several models}, with implementation steps frequently missing concrete parameters, resource assumptions, or step ordering; \textbf{Effectiveness is moderate for most systems}, with the highest result of 51.36 (GPT-5.2-Pro) and open-source models clustering around 24.95–28.50 (e.g., DeepSeek-V3.2, Llama-4-Scout).
 
 Recurring issues include: (i) underspecified implementation steps—absent data acquisition or preprocessing plans, missing hyperparameters or compute assumptions, vague module choices (e.g., solver type, training objective, evaluation protocol), and unclear interfaces, ordering, or data flow; and (ii) infeasible procedures—reliance on unavailable instruments or data, uncoordinated pipelines that cannot be executed, and designs lacking reproducibility.