Skip to content

Commit 67f42aa

Browse files
YihengWangYihengWang
authored andcommitted
update README
1 parent 78b4ab0 commit 67f42aa

File tree

12 files changed

+83
-30
lines changed

12 files changed

+83
-30
lines changed

README.md

Lines changed: 83 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,69 +1,112 @@
1-
# SciEval ToolKit
1+
<h1 align="center">SciEval ToolKit</h1>
22

3-
**SciEval** is an open-source evaluation framework and leaderboard for measuring the *scientific intelligence* of large language and vision–language models.
4-
It targets the full research workflow which is from scientific image understanding to hypothesis generation and provides a reproducible toolkit that unifies data loading, prompt construction, inference and evaluation.
3+
<p align="center"><strong>
4+
A unified evaluation toolkit and leaderboard for rigorously assessing the scientific intelligence of large language and vision–language models across the full research workflow.
5+
</strong></p>
6+
7+
<hr style="width:100%;margin:16px 0;border:0;border-top:0.1px solid #d0d7de;" />
8+
9+
<p align="center">
10+
<a href="https://opencompass.org.cn/Intern-Discovery-Eval">
11+
<img src="https://img.shields.io/badge/Website-SciEval-b8dcff?style=for-the-badge&logo=google-chrome&logoColor=white" />
12+
</a>&nbsp;&nbsp;&nbsp;
13+
<a href="https://huggingface.co/spaces/InternScience/SciEval-Leaderboard">
14+
<img src="https://img.shields.io/badge/LEADERBOARD-Scieval-f6e58d?style=for-the-badge&logo=huggingface" />
15+
</a>&nbsp;&nbsp;&nbsp;
16+
<a href="https://github.com/InternScience/SciEvalKit/blob/main/docs/SciEvalKit.pdf">
17+
<img src="https://img.shields.io/badge/REPORT-Technical-f4c2d7?style=for-the-badge" />
18+
</a>&nbsp;&nbsp;&nbsp;
19+
<a href="https://github.com/InternScience/SciEvalKit">
20+
<img src="https://img.shields.io/badge/GitHub-Repository-c7b9e2?style=for-the-badge&logo=github&logoColor=white" />
21+
</a>
22+
</p>
23+
24+
<p align="center">
25+
<img src="assets/icon/welcome.png" alt="welcome" height="24" style="vertical-align:middle;" />
26+
&nbsp;Welcome to the official repository of <strong>SciEval</strong>!
27+
</p>
28+
29+
30+
## <img src="assets/icon/why.png" alt="why" height="28" style="vertical-align:middle;" />&nbsp;Why SciEval?
31+
32+
33+
**SciEval** is an open‑source evaluation framework and leaderboard aimed at measuring the **scientific intelligence** of large language and vision–language models.
34+
Although modern frontier models often achieve *~90* on general‑purpose benchmarks, their performance drops sharply on rigorous, domain‑specific scientific tasks—revealing a persistent **general‑versus‑scientific gap** that motivates the need for SciEval.
35+
Its design is shaped by following core ideas:
36+
37+
- **Beyond general‑purpose benchmarks ▸** Traditional evaluations focus on surface‑level correctness or broad‑domain reasoning, hiding models’ weaknesses in realistic scientific problem solving. SciEval makes this **general‑versus‑scientific gap** explicit and supplies the evaluation infrastructure needed to guide the integration of broad instruction‑tuned abilities with specialised skills in coding, symbolic reasoning and diagram understanding.
38+
- **End‑to‑end workflow coverage ▸** SciEval spans the full research pipeline—such as **image interpretation, symbolic reasoning, executable code generation, and hypothesis generation**—instead of isolated subtasks.
39+
- **Capability‑oriented & reproducible ▸** A unified toolkit for **dataset construction, prompt engineering, inference, and expert‑aligned scoring** ensures transparent and repeatable comparisons.
40+
- **Grounded in real scenarios ▸** Benchmarks use domain‑specific data and tasks so performance reflects **actual scientific practice**, not synthetic proxies.
541

642
<div align="center">
743
<img src="assets/github.png" alt="SciEval capability radar" width="100%">
844
</div>
945

10-
Modern frontier language models routinely score near *90* on general‑purpose benchmarks, yet even the strongest model (e.g., **Gemini 3 Pro**) drops below *60* when challenged by rigorous, domain‑specific scientific tasks. SciEval makes this **general‑versus‑scientific gap** explicit and supplies the evaluation infrastructure needed to guide the integration of broad instruction‑tuned abilities with specialised skills in coding, symbolic reasoning and diagram understanding.
46+
47+
## <img src="assets/icon/progress.png" alt="progress" height="28" style="vertical-align:middle;" />&nbsp;Progress in Scientific Intelligence
48+
49+
*Realtime updates — scores are synchronized with the [Intern‑Discovery‑Eval](https://opencompass.org.cn/Intern-Discovery-Eval/rank) leaderboard.*
1150

1251
<div align="center">
1352
<img src="assets/general_scientific_comparison.png" alt="SciEval capability radar" width="100%">
1453
</div>
1554

16-
## Key Features
17-
| Category | Highlights |
18-
|----------|------------|
19-
| **Five Core Dimensions** | Scientific Knowledge Understanding, Scientific Code Generation, Scientific Symbolic Reasoning, Scientific Hypothesis Generation, Scientific Image Understanding. |
20-
| **Discipline Coverage** | Life Science • Astronomy • Earth Science • Chemistry • Materials Science • Physics. |
21-
| **Multimodal & Executable Scoring** | Supports text, code, and image inputs; integrates code tasks and LLM-judge fallback for open-ended answers. |
22-
| **Reproducible & Extensible** | Clear dataset and model registries, minimised hard-coding and modular evaluators make new tasks or checkpoints easy to plug in. |
55+
- **General benchmarks overestimate scientific competence.** Even the strongest frontier models (e.g., **Gemini 3 Pro**) score below **60** on **Scientific Text Capability** , despite scoring near *90* on widely used general‑purpose benchmarks.
56+
- **Multimodal capability is breaking the 60‑point barrier.** **Gemini 3 Pro** leads **Scientific Multimodal Capability** with **62.88**, reflecting strong performance in multimodal perception and reasoning.
57+
- **Open‑source systems are rapidly closing the gap.** *Qwen3‑VL‑235B‑A22B* and *Qwen3‑Max* now match or surpass several proprietary models in symbolic reasoning and code generation, signalling healthy community progress.
58+
- **Symbolic reasoning and code generation remain bottlenecks.** No model exceeds **50** in equation‑level manipulation or **30** in end‑to‑end executable code tasks, indicating that scientific workflows requiring programmatic pipelines still fail frequently.
2359

24-
<hr style="height:1px;background:black;border:none;" />
2560

26-
## News
27-
* **2025‑12‑05 · SciEval v1 Launch**
28-
&nbsp;&nbsp;• Initial public release of a science‑focused evaluation toolkit and leaderboard devoted to realistic research workflows.
61+
## <img src="assets/icon/key.png" alt="key" height="28" style="vertical-align:middle;" />&nbsp;Key Features
2962

30-
&nbsp;&nbsp;• Initial evaluation of 20 frontier models (closed & open source) now live at <https://discovery.intern-ai.org.cn/sciprismax/leaderboard>.
63+
| Category | Highlights |
64+
| ----------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
65+
| **Seven Core Dimensions** | Scientific Knowledge Understanding, Scientific Code Generation, Scientific Symbolic Reasoning, Scientific Hypothesis Generation, Scientific Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding |
66+
| **Discipline Coverage** | Life Science • Astronomy • Earth Science • Chemistry • Materials Science • Physics. |
67+
| **Multimodal & Executable Scoring** | Supports text, code, and image inputs; integrates code tasks and LLM-judge fallback for open-ended answers. |
68+
| **Reproducible & Extensible** | Clear dataset and model registries, minimised hard-coding and modular evaluators make new tasks or checkpoints easy to plug in. |
3169

32-
&nbsp;&nbsp;• Coverage: five scientific capability dimensions × six major disciplines in the initial benchmark suite.
3370

34-
* **Community Submissions Open**
35-
Submit your benchmarks via pull request to appear on the official leaderboard.
71+
## <img src="assets/icon/news.png" alt="news" height="28" style="vertical-align:middle;" />&nbsp;News
72+
* **[2025‑12‑12] · 📰 Evaluation Published on OpenCompass**
73+
&nbsp;&nbsp;• SciEval’s benchmark results are now live on the [OpenCompass](https://opencompass.org.cn/Intern-Discovery-Eval) platform, providing broader community visibility and comparison.
3674

37-
## Codebase Updates
38-
* **Execution‑based Scoring**
39-
Code‑generation tasks (SciCode, AstroVisBench) are now graded via sandboxed unit tests.
75+
* **[2025‑12‑05] · 🚀 SciEval v1 Launch**
76+
&nbsp;&nbsp;• Initial public release of a science‑focused evaluation toolkit and leaderboard devoted to realistic research workflows.
4077

41-
## Quick Start
78+
&nbsp;&nbsp;• Coverage: seven scientific capability dimensions × six major disciplines in the initial benchmark suite.
79+
* **[2025‑12‑05] · 🌟 Community Submissions Open**
80+
&nbsp;&nbsp;• Submit your benchmarks via pull request to appear on the official leaderboard.
4281

43-
Get from clone to first scores in minutes&mdash;see our local
44-
[QuickStart](docs/en/Quickstart.md) / [快速开始](docs/zh-CN/Quickstart.md)
45-
guides, or consult the VLMEvalKit tutorial
46-
<https://vlmevalkit.readthedocs.io/en/latest/Quickstart.html> for additional reference.
82+
## <img src="assets/icon/start.png" alt="start" height="28" style="vertical-align:middle;" />&nbsp;Quick Start
83+
84+
Get from clone to first scores in minutes&mdash;see our local [QuickStart](docs/en/Quickstart.md) / [快速开始](docs/zh-CN/Quickstart.md) guides, or consult the [VLMEvalKit tutorial](https://vlmevalkit.readthedocs.io/en/latest/Quickstart.html) for additional reference.
4785

4886
### 1 · Install
87+
4988
```bash
5089
git clone https://github.com/InternScience/SciEvalKit.git
51-
cd SciEval-Kit
90+
cd SciEvalKit
5291
pip install -e .[all] # brings in vllm, openai‑sdk, hf_hub, etc.
5392
```
5493

55-
### 2 · (Optional) add API keys
94+
### 2 · (Optional) add API keys
95+
5696
Create a `.env` at the repo root **only if** you will call API models or
5797
use an LLM‑as‑judge backend:
98+
5899
```bash
59100
OPENAI_API_KEY=...
60101
GOOGLE_API_KEY=...
61102
DASHSCOPE_API_KEY=...
62103
```
104+
63105
If no keys are provided, SciEval falls back to rule‑based scoring
64106
whenever possible.
65107

66108
### 3 · Run a API demo test
109+
67110
```bash
68111
python run.py \
69112
--dataset SFE \
@@ -74,6 +117,7 @@ python run.py \
74117
```
75118

76119
### 4 · Evaluate a local/GPU model
120+
77121
```bash
78122
python run.py \
79123
--dataset MaScQA \
@@ -86,8 +130,17 @@ python run.py \
86130
# if the benchmark requires an LLM judge.
87131
```
88132

89-
## Acknowledgements
133+
## <img src="assets/icon/update.png" alt="update" height="28" style="vertical-align:middle;" />&nbsp;Codebase Updates
134+
135+
* **Execution‑based Scoring**
136+
Code‑generation tasks (SciCode, AstroVisBench) are now graded via sandboxed unit tests.
137+
138+
139+
140+
## <img src="assets/icon/thanks.png" alt="thanks" height="28" style="vertical-align:middle;" />&nbsp;Acknowledgements
90141

91142
SciEval ToolKit is built on top of the excellent **[VLMEvalKit](https://github.com/open-compass/VLMEvalKit)** framework and we thank the OpenCompass team not only for open‑sourcing their engine, but also for publishing thorough deployment and development guides ([Quick Start](https://vlmevalkit.readthedocs.io/en/latest/Quickstart.html)[Development Notes](https://vlmevalkit.readthedocs.io/en/latest/Development.html)) that streamlined our integration.
92143

93144
We also acknowledge the core SciEval contributors for their efforts on dataset curation, evaluation design, and engine implementation: Jun Yao, Han Deng, Yizhou Wang, Jiabei Xiao, Jiaqi Liu, Encheng Su, Yujie Liu, Weida Wang, Junchi Yao, Haoran Sun, Runmin Ma, Bo Zhang, Dongzhan Zhou, Shufei Zhang, Peng Ye, Xiaosong Wang, and Shixiang Tang, as well as all community testers who provided early feedback.
145+
146+
SciEvalKit contributors can join the author list of the report based on their contribution to the repository. Specifically, it requires 3 major contributions (implement a new benchmark, foundation model, or contribute a major feature). We will update the report quarterly and an additional section that details each developer’s contribution will be appended in the next update.
-670 KB
Loading

assets/github.png

696 KB
Loading

assets/icon/key.png

9.33 KB
Loading

assets/icon/news.png

10.9 KB
Loading

assets/icon/progress.png

8.94 KB
Loading

assets/icon/start.png

10.2 KB
Loading

assets/icon/thanks.png

3.87 KB
Loading

assets/icon/update.png

7.49 KB
Loading

assets/icon/welcome.png

9.31 KB
Loading

0 commit comments

Comments
 (0)