You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+4-1Lines changed: 4 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -72,6 +72,8 @@ accelerate config
72
72
73
73
This evaluation harness can also be used in an evaluation only mode, you can use a Multi-CPU setting. For large model, set up the precision of the model using the `--precision` flag instead of accelerate config to have only one copy of the model in memory.
74
74
75
+
The evaluation part (solutions execution) for [MultiPL-E](https://github.com/nuprl/MultiPL-E) requires extra dependencies for some programming languages, we provide a Dockerfile with all dependencies, see section [Docker](#docker-containers) for more details.
76
+
75
77
## Usage
76
78
You can use this evaluation harness to generate text solutions to code benchmarks with your model, to evaluate (and execute) the solutions or to do both. While it is better to use GPUs for the generation, the evaluation only requires CPUs. So it might be beneficial to separate these two steps. By default both generation and evaluation are performed.
77
79
@@ -90,6 +92,7 @@ accelerate launch main.py \
90
92
--do_sample True \
91
93
--n_samples 100 \
92
94
--batch_size 10 \
95
+
--precision <PRECISION> \
93
96
--allow_code_execution \
94
97
--save_generations
95
98
```
@@ -151,7 +154,7 @@ accelerate launch main.py \
151
154
--save_generations_path generations_py.json
152
155
```
153
156
154
-
To run the container (here from image `evaluation-harness`) to evaluate on `generations_py.json`, or another file mount it with `-v`, specify `n_samples` and allow code execution with `--allow_code_execution` (and add the number of problems `--limit` if it was used during generation):
157
+
To run the container (here from image `evaluation-harness-multiple`) to evaluate on `generations_py.json`, or another file mount it with `-v`, specify `n_samples` and allow code execution with `--allow_code_execution` (and add the number of problems `--limit` if it was used during generation):
Copy file name to clipboardExpand all lines: docs/README.md
+57-22Lines changed: 57 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -75,6 +75,63 @@ accelerate launch main.py \
75
75
76
76
Low temperatures generally work better for small $k$ in pass@k.
77
77
78
+
### DS-1000
79
+
[DS-1000](https://ds1000-code-gen.github.io/): Code generation benchmark with 1000 data science questions spanning seven Python libraries that (1) reflects diverse, realistic, and practical use cases, (2) has a reliable metric, (3) defends against memorization by perturbing questions.
80
+
81
+
The task can be specified as `--tasks ds1000-$SUBSET-$MODE`, where subset can include `all` libraries or any of the following subsets: `numpy`, `scipy`, `pandas`, `tensorflow`, `pytorch`, `sklearn`, `matplotlib`. Supported generation modes are `completion` (purely autoregressive) or `insertion` (via fill-in-middle [FIM]).
82
+
83
+
- Prompts & Generation: prompts include partial code with one or more missing lines. The form of such prompts varies between `completion` and `insertion` modes (`[insert]` token used to reflect FIM region). Default generation args are reflected below.
84
+
- Evaluation: generations are evaluated via execution of unit tests. As in the original manuscript, $pass@1$ is evaluated over each of `num_samples` and the mean pass rate is returned as the metric. Default evaluation args are presented below.
85
+
86
+
Below is the command to run evaluation on the full benchmark in insertion mode with the arguments that correspond to the original manuscript.
[MultiPL-E](https://huggingface.co/datasets/nuprl/MultiPL-E): is a benchamrk for evaluating large language models for code generation that supports 18 programming languages. It takes the OpenAI "HumanEval" Python benchmark and uses little compilers to translate them to other languages. We use similar implementation as [the original repository](https://github.com/nuprl/MultiPL-E/tree/main) and evaluation parameters are similar to HumanEval. Although for this benchmark, we strongly recommend using the provided Dockerfile to build the MultiPL-E container with all required dependencies, and for more safety especially when evaluating on languages like `bash`.
103
+
```bash
104
+
$ sudo make DOCKERFILE=Dockerfile-multiple all
105
+
```
106
+
This creates an image called `evaluation-harness-multiple`.
107
+
108
+
Suppose you generated text with the `bigcode/santacoder` model and saved it in `generations_py.json` with:
109
+
```bash
110
+
accelerate launch main.py \
111
+
--model bigcode/santacoder \
112
+
--tasks multiple-py \
113
+
--max_length_generation 650 \
114
+
--temperature 0.8 \
115
+
--do_sample True \
116
+
--n_samples 200 \
117
+
--batch_size 200 \
118
+
--trsut_remote_code \
119
+
--generation_only \
120
+
--save_generations \
121
+
--save_generations_path generations_py.json
122
+
```
123
+
To run the container (here from image `evaluation-harness-multiple`) to evaluate on `generations_py.json`, or another file mount it with `-v`, specify `n_samples` and allow code execution with `--allow_code_execution` (and add the number of problems `--limit` if it was used during generation):
Execution time may vary depending on the programming languages.
134
+
78
135
### APPS
79
136
[APPS](https://huggingface.co/datasets/codeparrot/apps): is a challenging benchmark for code generation with 10,000 Python problems,
80
137
5,000 for the training and 5000 for the evaluation. It has three difficulty levels: introductory, interview and competition.
@@ -144,28 +201,6 @@ accelerate launch main.py \
144
201
We expect a model [finetuned](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/finetuning/APPS) on the train split of APPS.
145
202
TODO: add few-shot setup for APPS.
146
203
147
-
### DS-1000
148
-
[DS-1000](https://ds1000-code-gen.github.io/): Code generation benchmark with 1000 data science questions spanning seven Python libraries that (1) reflects diverse, realistic, and practical use cases, (2) has a reliable metric, (3) defends against memorization by perturbing questions.
149
-
150
-
The task can be specified as `--tasks ds1000-$SUBSET-$MODE`, where subset can include `all` libraries or any of the following subsets: `numpy`, `scipy`, `pandas`, `tensorflow`, `pytorch`, `sklearn`, `matplotlib`. Supported generation modes are `completion` (purely autoregressive) or `insertion` (via fill-in-middle [FIM]).
151
-
152
-
- Prompts & Generation: prompts include partial code with one or more missing lines. The form of such prompts varies between `completion` and `insertion` modes (`[insert]` token used to reflect FIM region). Default generation args are reflected below.
153
-
- Evaluation: generations are evaluated via execution of unit tests. As in the original manuscript, $pass@1$ is evaluated over each of `num_samples` and the mean pass rate is returned as the metric. Default evaluation args are presented below.
154
-
155
-
Below is the command to run evaluation on the full benchmark in insertion mode with the arguments that correspond to the original manuscript.
0 commit comments