Skip to content

Commit bbfc8c3

Browse files
authored
Merge branch 'scicode-bench:main' into main
2 parents adce2a4 + 24760e7 commit bbfc8c3

File tree

2 files changed

+12
-2
lines changed

2 files changed

+12
-2
lines changed

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@
55

66
This repo contains the evaluation code for the paper "[SciCode: A Research Coding Benchmark Curated by Scientists](https://arxiv.org/abs/2407.13168)"
77

8+
## 🔔News
9+
**[2024-07-24]: We add the scientist-annotated background and support setup for w/ background evaluation.**
10+
811
## Introduction
912
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of **16** subdomains from **6** domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains **338** subproblems decomposed from **80** challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only **4.6%** of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.
1013

@@ -27,7 +30,6 @@ SciCode sources challenging and realistic research-level coding problems across
2730
| Qwen2-72B-Instruct | 17 | 1.5 |
2831
| Llama-3.1-70B-Instruct | 16.3 | 1.5 |
2932
| Mixtral-8x22B-Instruct | 16.3 | 0 |
30-
| GPT-4o-mini | 15.3 | 1.5 |
3133
| Llama-3-70B-Chat | 14.6 | 0 |
3234

3335
## Instructions to evaluate a new model

eval/scripts/README.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,18 +9,26 @@ ANTHROPIC_KEY = 'your_api_key'
99
GOOGLE_KEY = 'your_api_key' 
1010
```
1111

12-
For example, to create model results with `gpt-4o` and the default settings, go to the root of this repo and run
12+
For example, to create model results with `gpt-4o` and the default settings, go to the root of this repo and run
1313

1414
```bash
1515
python eval/scripts/gencode_json.py --model gpt-4o
1616
```
1717

18+
For results with scientist-annotated background, run
19+
20+
```bash
21+
python eval/scripts/gencode_json.py --model gpt-4o --with-background
22+
```
23+
24+
1825
### Command-Line Arguments
1926

2027
- `--model` - Specifies the model name used for generating responses.
2128
- `--output-dir` - Directory to store the generated code outputs (Default: `eval_results/generated_code`).
2229
- `--input-path` - Directory containing the JSON files describing the problems (Default: `eval/data/problems_all.jsonl`).
2330
- `--prompt-dir` - Directory where prompt files are saved (Default: `eval_results/prompt`).
31+
- `--with-background` - Include problem background if enabled.
2432
- `--temperature` - Controls the randomness of the generation (Default: 0).
2533

2634
## **Evaluate generated code**

0 commit comments

Comments
 (0)