Skip to content

Commit bc7a58b

Browse files
younesbelkadaDaGrapixEricSaikali
authored
Feat: Add team Shaikespear submission from NeurIPS E2LM Competition (#3437)
* Add Shaipeskear submission Co-authored-by: Anthony Kalaydjian <[email protected]> Co-authored-by: EricSaikali <[email protected]> * pre-commit * move to e2lmc --------- Co-authored-by: Anthony Kalaydjian <[email protected]> Co-authored-by: EricSaikali <[email protected]>
1 parent b315ef3 commit bc7a58b

File tree

7 files changed

+98
-0
lines changed

7 files changed

+98
-0
lines changed
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# SciKnowEval_mcqa
2+
3+
This task was submitted at the [NeurIPS 2025 E2LM](https://e2lmc.github.io/) competition, and reached $3^{rd}$ place on the general leaderboard.
4+
Its intended use is within the context of Small Language Model (SLM) evaluation in early training stages. More details are provided in the competition [proposal paper](https://arxiv.org/pdf/2506.07731).
5+
6+
### Benchmark details
7+
8+
This task uses a subset of the [SciKnowEval](https://huggingface.co/datasets/hicai-zju/SciKnowEval) dataset. Specifically, it filters out non-MCQA samples and focuses on questions from levels L1, L2, and L3, which are designed to assess knowledge memory, comprehension and reasoning respectively, as described in the original [paper](https://arxiv.org/pdf/2406.09098v2).
9+
10+
The full SciKnowEval dataset is a comprehensive benchmark for evaluating the scientific knowledge reasoning capabilities of Large Language Models (LLMs). It spans across a few scientific domains: Physics, Chemistry, Biology and Materials.
11+
12+
SciKnowEval_mcqa dataset: https://huggingface.co/datasets/ShAIkespear/SciKnowEval_mcqa
13+
14+
### Citation
15+
16+
```
17+
@misc{sci-know-2025-mcqa,
18+
title = "SciKnowEval_mcqa: A Benchmark for Small Language Model Evaluation in their Early Training Stages",
19+
author = "Anthony Kalaydjian, Eric Saikali",
20+
year = "2025",
21+
}
22+
```
23+
24+
### Groups and Tasks
25+
26+
#### Groups
27+
28+
* `sciknoweval_mcqa`: Evaluates `sciknoweval_Biology`, `sciknoweval_Chemistry`, `sciknoweval_Materials` and `sciknoweval_Physics`
29+
30+
#### Tasks
31+
* `sciknoweval_Biology`: Data across all remaining splits corresponding to Biology MCQs.
32+
* `sciknoweval_Chemistry`: Data across all remaining splits corresponding to Chemistry MCQs.
33+
* `sciknoweval_Materials`: Data across all remaining splits corresponding to Materials MCQs.
34+
* `sciknoweval_Physics`: Data across all remaining splits corresponding to Physics MCQs.
35+
36+
### Checklist
37+
38+
For adding novel benchmarks/datasets to the library:
39+
* [ ] Is the task an existing benchmark in the literature?
40+
* [x] Have you referenced the original paper that introduced the task?
41+
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
42+
43+
44+
If other tasks on this dataset are already supported:
45+
* [x] Is the "Main" variant of this task clearly denoted?
46+
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
47+
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
group: sciknoweval_mcqa
2+
group_alias: sciknoweval_mcqa (var5shots)
3+
task:
4+
- sciknoweval_mcqa_task
5+
aggregate_metric_list:
6+
- metric: acc
7+
weight_by_size: True
8+
metadata:
9+
version: 2
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
dataset_path: ShAIkespear/SciKnowEval_mcqa
2+
output_type: multiple_choice
3+
test_split: test
4+
fewshot_split: dev
5+
num_fewshot: 5
6+
fewshot_config:
7+
sampler: first_n
8+
doc_to_text: "Question: {{question.strip()}}\nAnswer:"
9+
doc_to_choice: "{{choices}}"
10+
doc_to_target: "{{answer}}"
11+
metadata:
12+
version: 1.0
13+
dataset_kwargs:
14+
trust_remote_code: true
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
"dataset_name": "Biology"
2+
"description": "The following are multiple choice questions (with answers) about Biology.\n\
3+
\n"
4+
"include": "_var5shots_template_yaml"
5+
"tag": "sciknoweval_mcqa_task"
6+
"task": "sciknoweval_mcqa_var5shots_Biology"
7+
"task_alias": "Biology"
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
"dataset_name": "Chemistry"
2+
"description": "The following are multiple choice questions (with answers) about Chemistry.\n\
3+
\n"
4+
"include": "_var5shots_template_yaml"
5+
"tag": "sciknoweval_mcqa_task"
6+
"task": "sciknoweval_mcqa_var5shots_Chemistry"
7+
"task_alias": "Chemistry"
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
"dataset_name": "Material"
2+
"description": "The following are multiple choice questions (with answers) about Material.\n\
3+
\n"
4+
"include": "_var5shots_template_yaml"
5+
"tag": "sciknoweval_mcqa_task"
6+
"task": "sciknoweval_mcqa_var5shots_Material"
7+
"task_alias": "Material"
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
"dataset_name": "Physics"
2+
"description": "The following are multiple choice questions (with answers) about Physics.\n\
3+
\n"
4+
"include": "_var5shots_template_yaml"
5+
"tag": "sciknoweval_mcqa_task"
6+
"task": "sciknoweval_mcqa_var5shots_Physics"
7+
"task_alias": "Physics"

0 commit comments

Comments
 (0)