You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+11-11Lines changed: 11 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,16 +1,16 @@
1
-
# PrismaEval ToolKit
1
+
# SciEval ToolKit
2
2
3
-
**PrismaEval** is an open-source evaluation framework and leaderboard for measuring the *scientific intelligence* of large language and vision–language models.
3
+
**SciEval** is an open-source evaluation framework and leaderboard for measuring the *scientific intelligence* of large language and vision–language models.
4
4
It targets the full research workflow which is from scientific image understanding to hypothesis generation and provides a reproducible toolkit that unifies data loading, prompt construction, inference and evaluation.
Modern frontier language models routinely score near *90* on general‑purpose benchmarks, yet even the strongest model (e.g., **Gemini 3 Pro**) drops below *60* when challenged by rigorous, domain‑specific scientific tasks. PrismaEval makes this **general‑versus‑scientific gap** explicit and supplies the evaluation infrastructure needed to guide the integration of broad instruction‑tuned abilities with specialised skills in coding, symbolic reasoning and diagram understanding.
10
+
Modern frontier language models routinely score near *90* on general‑purpose benchmarks, yet even the strongest model (e.g., **Gemini 3 Pro**) drops below *60* when challenged by rigorous, domain‑specific scientific tasks. SciEval makes this **general‑versus‑scientific gap** explicit and supplies the evaluation infrastructure needed to guide the integration of broad instruction‑tuned abilities with specialised skills in coding, symbolic reasoning and diagram understanding.
pip install -e .[all] # brings in vllm, openai‑sdk, hf_hub, etc.
53
53
```
54
54
@@ -60,7 +60,7 @@ OPENAI_API_KEY=...
60
60
GOOGLE_API_KEY=...
61
61
DASHSCOPE_API_KEY=...
62
62
```
63
-
If no keys are provided, PrismaEval falls back to rule‑based scoring
63
+
If no keys are provided, SciEval falls back to rule‑based scoring
64
64
whenever possible.
65
65
66
66
### 3 · Run a API demo test
@@ -88,6 +88,6 @@ python run.py \
88
88
89
89
## Acknowledgements
90
90
91
-
PrismaEval ToolKit is built on top of the excellent **[VLMEvalKit](https://github.com/open-compass/VLMEvalKit)** framework and we thank the OpenCompass team not only for open‑sourcing their engine, but also for publishing thorough deployment and development guides ([Quick Start](https://vlmevalkit.readthedocs.io/en/latest/Quickstart.html), [Development Notes](https://vlmevalkit.readthedocs.io/en/latest/Development.html)) that streamlined our integration.
91
+
SciEval ToolKit is built on top of the excellent **[VLMEvalKit](https://github.com/open-compass/VLMEvalKit)** framework and we thank the OpenCompass team not only for open‑sourcing their engine, but also for publishing thorough deployment and development guides ([Quick Start](https://vlmevalkit.readthedocs.io/en/latest/Quickstart.html), [Development Notes](https://vlmevalkit.readthedocs.io/en/latest/Development.html)) that streamlined our integration.
92
92
93
-
We also acknowledge the core PrismaEval contributors for their efforts on dataset curation, evaluation design, and engine implementation: Jun Yao, Han Deng, Yizhou Wang, Jiabei Xiao, Jiaqi Liu, Encheng Su, Yujie Liu, Weida Wang, Junchi Yao, Haoran Sun, Runmin Ma, Bo Zhang, Dongzhan Zhou, Shufei Zhang, Peng Ye, Xiaosong Wang, and Shixiang Tang, as well as all community testers who provided early feedback.
93
+
We also acknowledge the core SciEval contributors for their efforts on dataset curation, evaluation design, and engine implementation: Jun Yao, Han Deng, Yizhou Wang, Jiabei Xiao, Jiaqi Liu, Encheng Su, Yujie Liu, Weida Wang, Junchi Yao, Haoran Sun, Runmin Ma, Bo Zhang, Dongzhan Zhou, Shufei Zhang, Peng Ye, Xiaosong Wang, and Shixiang Tang, as well as all community testers who provided early feedback.
0 commit comments