InternScience
diff --git a/‎github_images/pipeline.png‎
-943 Bytes b/‎github_images/pipeline.png‎
-943 Bytes
diff --git a/‎github_images/teaser.png‎
54.2 KB b/‎github_images/teaser.png‎
54.2 KB
diff --git a/‎index.html‎
Lines changed: 7 additions & 7 deletions b/‎index.html‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎md_images/pipeline.png‎
-4.66 KB b/‎md_images/pipeline.png‎
-4.66 KB
diff --git a/‎md_images/teaser.png‎
-18.5 KB b/‎md_images/teaser.png‎
-18.5 KB
diff --git a/‎paper.pdf‎
58.2 KB b/‎paper.pdf‎
58.2 KB
diff --git a/‎paper/imgs/pipeline.png‎
-943 Bytes b/‎paper/imgs/pipeline.png‎
-943 Bytes
diff --git a/‎paper/imgs/teaser.png‎
54.2 KB b/‎paper/imgs/teaser.png‎
54.2 KB
diff --git a/‎paper/main.bib‎
Lines changed: 77 additions & 8 deletions b/‎paper/main.bib‎
Lines changed: 77 additions & 8 deletions
diff --git a/‎paper/sections/0-abstract.tex‎
Lines changed: 1 addition & 1 deletion b/‎paper/sections/0-abstract.tex‎
Lines changed: 1 addition & 1 deletion
@@ -265,7 +265,7 @@ <h1 class="fade-in-up delay-100 text-5xl md:text-7xl font-bold text-slate-900 le
             </h1>
 
             <p class="fade-in-up delay-200 text-xl text-slate-600 max-w-3xl mx-auto leading-relaxed px-4">
-                Scientific General Intelligence (SGI) is defined as an AI system that can autonomously navigate the full, iterative cycle of scientific inquiry—Deliberation, Conception, Action, and Perception—with the versatility and proficiency of a human scientist. SGI-Bench operationalizes this definition via four scientist-aligned task families: deep research, idea generation, AI-assisted experiments (dry/wet), and multimodal experimental reasoning. The benchmark spans 10 disciplines and more than 1,000 expert-curated samples inspired by Science's 125 Big Questions.
+                Scientific General Intelligence (SGI) is defined as an AI system that can autonomously navigate the full, iterative cycle of scientific inquiry—Deliberation, Conception, Action, and Perception—with the versatility and proficiency of a human scientist. SGI-Bench operationalizes this definition via four scientist-aligned task families: deep research, idea generation, dry/wet experiments, and multimodal experimental reasoning. The benchmark spans 10 disciplines and more than 1,000 expert-curated samples inspired by Science's 125 Big Questions.
             </p>
 
             <div class="fade-in-up delay-300 mt-12 p-2 glass rounded-2xl mx-auto max-w-full sm:max-w-4xl">
@@ -309,7 +309,7 @@ <h4 class="font-semibold text-slate-900">Conception</h4>
                         <div class="flex-shrink-0 w-10 h-10 rounded-full bg-green-50 flex items-center justify-center text-green-600"><i class="fas fa-code"></i></div>
                         <div>
                             <h4 class="font-semibold text-slate-900">Action</h4>
-                            <p class="text-sm text-slate-500">AI-assisted experiments: dry (code/simulation) and wet (lab protocol).</p>
+                            <p class="text-sm text-slate-500">Scientific experiments: dry (code generation) and wet (lab protocol development).</p>
                         </div>
                     </div>
                     <div class="flex gap-4 p-4 glass rounded-xl border-l-4 border-orange-500">
@@ -347,7 +347,7 @@ <h4 class="font-semibold text-slate-900">Raw Corpus</h4>
                         <div class="flex-shrink-0 w-10 h-10 rounded-full bg-indigo-50 flex items-center justify-center text-indigo-600"><i class="fas fa-pen-nib"></i></div>
                         <div>
                             <h4 class="font-semibold text-slate-900">Question Construction</h4>
-                            <p class="text-sm text-slate-500">100+ graduate/PhD annotators; continuous expert review for scientific value.</p>
+                            <p class="text-sm text-slate-500">100+ Master's and PhD holders; continuous expert review for scientific value.</p>
                         </div>
                     </div>
                     <div class="flex gap-4 p-4 glass rounded-xl border-l-4 border-amber-500">
@@ -366,7 +366,7 @@ <h4 class="font-semibold text-slate-900">Difficulty Filtering</h4>
                     </div>
                 </div>
                 <p class="text-base md:text-base text-slate-600 leading-relaxed">
-                    SGI-Bench data is scientist-aligned and high-fidelity: an expert-sourced corpus spanning 10 disciplines (inspired by Science’s 125 Big Questions), questions constructed by 100+ graduate/PhD annotators with continuous scientist-in-the-loop review, multi-stage cleaning (rules + model checks + expert QA) to ensure executability and unique answers, and difficulty filtering that removes items solved by >50% strong LLMs—yielding authentic, challenging, and broadly representative scientific tasks.
+                    SGI-Bench data is scientist-aligned and high-fidelity: an expert-sourced corpus spanning 10 disciplines (inspired by Science’s 125 Big Questions), questions constructed by 100+ Master's and PhD holders with continuous scientist-in-the-loop review, multi-stage cleaning (rules + model checks + expert QA) to ensure executability and unique answers, and difficulty filtering that removes items solved by >50% strong LLMs—yielding authentic, challenging, and broadly representative scientific tasks.
                 </p>
             </div>
         </div>
@@ -637,7 +637,7 @@ <h3 class="text-2xl font-bold text-slate-800 flex items-center gap-2">
                 </div>
             </div>
             <div class="mt-4 text-slate-500 text-sm leading-relaxed bg-white/50 p-4 rounded-lg border border-slate-100">
-                <p><strong class="text-slate-700">Overview Results Across SGI-Bench Tasks:</strong> Aggregated performance across Deep Research, Idea Generation, Dry/Wet Experiment, and Scientific Experimental Reasoning. The scores for Deep Research are based on the exact match metric (the strictest metric). Idea Generation scores are the average of four metrics evaluating ideas. Dry Experiment scores are based on PassAll@5 (the strictest metric). Wet Experiment scores are the average of action sequence similarity and parameter accuracy. Experimental Reasoning scores are based on the multi-choice accuracy metric (the strictest metric). The SGI-Score is the average across these tasks, reflecting the overall capability of an AI model in various scientific research scenarios. Note: The Experimental Reasoning metrics for Qwen3-Max and Qwen3-8B come from their multimodal versions.</p>
+                <p><strong class="text-slate-700">Overview Results Across SGI-Bench Tasks:</strong> Aggregated performance across Deep Research, Idea Generation, Dry/Wet Experiment, and Experimental Reasoning. The scores for Deep Research are based on the exact match metric (the strictest metric). Idea Generation scores are the average of four metrics evaluating ideas. Dry Experiment scores are based on PassAll@5 (the strictest metric). Wet Experiment scores are the average of action sequence similarity and parameter accuracy. Experimental Reasoning scores are based on the multi-choice accuracy metric (the strictest metric). The SGI-Score is the average across these tasks, reflecting the overall capability of an AI model in various scientific research scenarios. Note: The Experimental Reasoning metrics for Qwen3-Max and Qwen3-8B come from their multimodal versions.</p>
             </div>
         </div>
 
@@ -646,7 +646,7 @@ <h3 class="text-2xl font-bold text-slate-800 flex items-center gap-2">
             <div class="flex flex-col sm:flex-row justify-between items-end mb-4 gap-4">
                 <h3 class="text-2xl font-bold text-slate-800 flex items-center gap-2">
                     <span class="w-8 h-8 rounded-lg bg-blue-100 text-blue-600 flex items-center justify-center text-sm"><i class="fas fa-magnifying-glass"></i></span>
-                    Deep Research
+                    Scientific Deep Research
                 </h3>
             </div>
 
@@ -862,7 +862,7 @@ <h3 class="text-2xl font-bold text-slate-800 flex items-center gap-2">
             <h2 class="text-2xl font-bold text-slate-900 mb-6 text-center">Conclusion</h2>
             <div class="text-slate-600 leading-relaxed space-y-4">
                 <p>
-                    This work advances the study of Scientific General Intelligence (SGI) from both theory and practice. Grounded in the Practical Inquiry Model, we formalize SGI as the capacity to navigate the iterative cycle of Deliberation, Conception, Action, and Perception with the versatility of a human scientist. Building on this principle-grounded definition, we operationalize SGI through SGI-Bench, a comprehensive, scientist-aligned benchmark that instantiates four core task families: Scientific Deep Research, Idea Generation, AI-Assisted Scientific Experiment (dry/wet), and Scientific Experimental Reasoning. Complemented by our agentic evaluation framework and multi-metric protocol, SGI-Bench enables scalable, transparent, and domain-faithful assessment.
+                    This work advances the study of Scientific General Intelligence (SGI) from both theory and practice. Grounded in the Practical Inquiry Model, we formalize SGI as the capacity to navigate the iterative cycle of Deliberation, Conception, Action, and Perception with the versatility of a human scientist. Building on this principle-grounded definition, we operationalize SGI through SGI-Bench, a comprehensive, scientist-aligned benchmark that instantiates four core task families: Scientific Deep Research, Idea Generation, Dry/Wet Experiment, and Experimental Reasoning. Complemented by our agentic evaluation framework and multi-metric protocol, SGI-Bench enables scalable, transparent, and domain-faithful assessment.
                 </p>
                 <p>
                     Experiments reveal a consistent pattern: in Deep Research, models show step-level alignment but low exact-match accuracy (10--20%), with brittleness in quantitative reasoning; in Idea Generation, hypotheses are fluent but underspecified and infeasible; in Dry Experiment, code is executable but PassAll@k remains low; in Wet Experiment, sequences show omissions and misordering; and in Experimental Reasoning, causal reasoning outperforms comparative, with persistent multimodal challenges. These highlight gaps between linguistic fluency and integrated scientific cognition. Moreover, SGI exhibits dynamic capacity: Test-Time Reinforcement Learning with novelty rewards improves idea generation without reference answers.
 
@@ -612,17 +612,15 @@ @misc{Llama4_Release2025
   author = {Meta AI},
   year = {2025},
   month = jul,
-  howpublished = {\url{https://ai.meta.com/blog/llama-4-multimodal-intelligence/}},
-  note = {Official Blog Post}
+  howpublished = {\url{https://ai.meta.com/blog/llama-4-multimodal-intelligence/}}
 }
 
 @techreport{GPT5_SystemCard2025,
   title = {GPT-5 System Card},
   author = {OpenAI},
   institution = {OpenAI},
   year = {2025},
-  url = {https://cdn.openai.com/gpt-5-system-card.pdf},
-  note = {[27]}
+  url = {https://cdn.openai.com/gpt-5-system-card.pdf}
 }
 
 @techreport{GPT5.1_Addendum2025,
@@ -631,16 +629,23 @@ @techreport{GPT5.1_Addendum2025
   institution = {OpenAI},
   year = {2025},
   month = nov,
-  url = {https://cdn.openai.com/pdf/4173ec8d-1229-47db-96de-06d87147e07e/5_1_system_card.pdf},
-  note = {[22]}
+  url = {https://cdn.openai.com/pdf/4173ec8d-1229-47db-96de-06d87147e07e/5_1_system_card.pdf}
+}
+
+@techreport{GPT5.2,
+  title = {Update to GPT-5 System Card: GPT-5.2},
+  author = {OpenAI},
+  institution = {OpenAI},
+  year = {2025},
+  month = Dec,
+  url = {https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf}
 }
 
 @misc{Gemini3_DeepMind2025,
   title = {Gemini 3 Model Card},
   author = {Google DeepMind},
   year = {2025},
-  howpublished = {\url{https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf}},
-  note = {}
+  howpublished = {\url{https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf}}
 }
 
 @article{bai2025intern,
@@ -1070,4 +1075,68 @@ @article{hu2025survey
   author={Hu, Ming and Ma, Chenglong and Li, Wei and Xu, Wanghan and Wu, Jiamin and Hu, Jucheng and Li, Tianbin and Zhuang, Guohang and Liu, Jiaqi and Lu, Yingzhou and others},
   journal={arXiv preprint arXiv:2508.21148},
   year={2025}
+}
+
+@inproceedings{yang2024moose,
+  title={MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses},
+  author={Yang, Zonglin and Liu, Wanhao and Gao, Ben and Xie, Tong and Li, Yuqiang and Ouyang, Wanli and Poria, Soujanya and Cambria, Erik and Zhou, Dongzhan},
+  booktitle={Proceedings of the International Conference on Learning Representations (ICLR)},
+  year={2025}
+}
+
+@book{popper2005logic,
+  title={The logic of scientific discovery},
+  author={Popper, Karl},
+  year={2005},
+  publisher={Routledge}
+}
+
+@book{popper2014conjectures,
+  title={Conjectures and refutations: The growth of scientific knowledge},
+  author={Popper, Karl},
+  year={2014},
+  publisher={routledge}
+}
+
+@book{bacon1878novum,
+  title={Novum organum},
+  author={Bacon, Francis},
+  year={1878},
+  publisher={Clarendon press}
+}
+
+@article{Wan2025DeepResearchAT,
+  title={DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks},
+  author={Haiyuan Wan and Cheng Yang and Junchi Yu and Meiqi Tu and Jiaxuan Lu and Di Yu and Jianbao Cao and Ben Gao and Jiaqing Xie and Aoran Wang and Wenlong Zhang and Philip Torr and Dongzhan Zhou},
+  journal={ArXiv},
+  year={2025},
+  volume={abs/2509.01396},
+  url={https://api.semanticscholar.org/CorpusID:281080495}
+}
+
+@article{Bosse2025DeepRB,
+  title={Deep Research Bench: Evaluating AI Web Research Agents},
+  author={Nikos I. Bosse and Jon Evans and Robert G. Gambee and Daniel Hnyk and Peter M{\"u}hlbacher and Lawrence Phillips and Dan Schwarz and Jack Wildman},
+  journal={ArXiv},
+  year={2025},
+  volume={abs/2506.06287},
+  url={https://api.semanticscholar.org/CorpusID:279251730}
+}
+
+@article{Liu2025BioProBenchCD,
+  title={BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning},
+  author={Yuyang Liu and Liuzhenghao Lv and Xiancheng Zhang and Li Yuan and Yonghong Tian},
+  journal={ArXiv},
+  year={2025},
+  volume={abs/2505.07889},
+  url={https://api.semanticscholar.org/CorpusID:278534452}
+}
+
+@article{Zhou2024LabSafetyBB,
+  title={LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs},
+  author={Yujun Zhou and Jingdong Yang and Kehan Guo and Pin-Yu Chen and Tian Gao and Werner Geyer and Nuno Moniz and Nitesh V. Chawla and Xiangliang Zhang},
+  journal={ArXiv},
+  year={2024},
+  volume={abs/2410.14182},
+  url={https://api.semanticscholar.org/CorpusID:273482719}
 }
@@ -14,7 +14,7 @@
 \textbf{Abstract:}
 
 Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)—the ability to autonomously conceive, investigate, and reason across scientific domains—remains lacking. 
-We present a principled SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, AI-assisted experiments (dry/wet), and experimental reasoning. 
+We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. 
 SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20\%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. 
 We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. 
 Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.