Skip to content

Commit 8b5a5c8

Browse files
committed
add multipl-e to docs
1 parent 6960ff3 commit 8b5a5c8

File tree

2 files changed

+61
-23
lines changed

2 files changed

+61
-23
lines changed

README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,8 @@ accelerate config
7272

7373
This evaluation harness can also be used in an evaluation only mode, you can use a Multi-CPU setting. For large model, set up the precision of the model using the `--precision` flag instead of accelerate config to have only one copy of the model in memory.
7474

75+
The evaluation part (solutions execution) for [MultiPL-E](https://github.com/nuprl/MultiPL-E) requires extra dependencies for some programming languages, we provide a Dockerfile with all dependencies, see section [Docker](#docker-containers) for more details.
76+
7577
## Usage
7678
You can use this evaluation harness to generate text solutions to code benchmarks with your model, to evaluate (and execute) the solutions or to do both. While it is better to use GPUs for the generation, the evaluation only requires CPUs. So it might be beneficial to separate these two steps. By default both generation and evaluation are performed.
7779

@@ -90,6 +92,7 @@ accelerate launch main.py \
9092
--do_sample True \
9193
--n_samples 100 \
9294
--batch_size 10 \
95+
--precision <PRECISION> \
9396
--allow_code_execution \
9497
--save_generations
9598
```
@@ -151,7 +154,7 @@ accelerate launch main.py \
151154
--save_generations_path generations_py.json
152155
```
153156

154-
To run the container (here from image `evaluation-harness`) to evaluate on `generations_py.json`, or another file mount it with `-v`, specify `n_samples` and allow code execution with `--allow_code_execution` (and add the number of problems `--limit` if it was used during generation):
157+
To run the container (here from image `evaluation-harness-multiple`) to evaluate on `generations_py.json`, or another file mount it with `-v`, specify `n_samples` and allow code execution with `--allow_code_execution` (and add the number of problems `--limit` if it was used during generation):
155158
```bash
156159
$ sudo docker run -v $(pwd)/generations_py.json:/app/generations_py.json:ro -it evaluation-harness-multiple python3 main.py \
157160
--model bigcode/santacoder \

docs/README.md

Lines changed: 57 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,63 @@ accelerate launch main.py \
7575

7676
Low temperatures generally work better for small $k$ in pass@k.
7777

78+
### DS-1000
79+
[DS-1000](https://ds1000-code-gen.github.io/): Code generation benchmark with 1000 data science questions spanning seven Python libraries that (1) reflects diverse, realistic, and practical use cases, (2) has a reliable metric, (3) defends against memorization by perturbing questions.
80+
81+
The task can be specified as `--tasks ds1000-$SUBSET-$MODE`, where subset can include `all` libraries or any of the following subsets: `numpy`, `scipy`, `pandas`, `tensorflow`, `pytorch`, `sklearn`, `matplotlib`. Supported generation modes are `completion` (purely autoregressive) or `insertion` (via fill-in-middle [FIM]).
82+
83+
- Prompts & Generation: prompts include partial code with one or more missing lines. The form of such prompts varies between `completion` and `insertion` modes (`[insert]` token used to reflect FIM region). Default generation args are reflected below.
84+
- Evaluation: generations are evaluated via execution of unit tests. As in the original manuscript, $pass@1$ is evaluated over each of `num_samples` and the mean pass rate is returned as the metric. Default evaluation args are presented below.
85+
86+
Below is the command to run evaluation on the full benchmark in insertion mode with the arguments that correspond to the original manuscript.
87+
88+
```bash
89+
export TF_FORCE_GPU_ALLOW_GROWTH=true
90+
TF_CPP_MIN_LOG_LEVEL=3 accelerate launch main.py \
91+
--model <MODEL_NAME> \
92+
--batch_size <BATCH_SIZE> \
93+
--tasks ds1000-all-insertion \
94+
--n_samples 40 \
95+
--max_length_generation 1024 \
96+
--temperature 0.2 \
97+
--top_p 0.95 \
98+
--allow_code_execution
99+
```
100+
101+
### MultiPL-E
102+
[MultiPL-E](https://huggingface.co/datasets/nuprl/MultiPL-E): is a benchamrk for evaluating large language models for code generation that supports 18 programming languages. It takes the OpenAI "HumanEval" Python benchmark and uses little compilers to translate them to other languages. We use similar implementation as [the original repository](https://github.com/nuprl/MultiPL-E/tree/main) and evaluation parameters are similar to HumanEval. Although for this benchmark, we strongly recommend using the provided Dockerfile to build the MultiPL-E container with all required dependencies, and for more safety especially when evaluating on languages like `bash`.
103+
```bash
104+
$ sudo make DOCKERFILE=Dockerfile-multiple all
105+
```
106+
This creates an image called `evaluation-harness-multiple`.
107+
108+
Suppose you generated text with the `bigcode/santacoder` model and saved it in `generations_py.json` with:
109+
```bash
110+
accelerate launch main.py \
111+
--model bigcode/santacoder \
112+
--tasks multiple-py \
113+
--max_length_generation 650 \
114+
--temperature 0.8 \
115+
--do_sample True \
116+
--n_samples 200 \
117+
--batch_size 200 \
118+
--trsut_remote_code \
119+
--generation_only \
120+
--save_generations \
121+
--save_generations_path generations_py.json
122+
```
123+
To run the container (here from image `evaluation-harness-multiple`) to evaluate on `generations_py.json`, or another file mount it with `-v`, specify `n_samples` and allow code execution with `--allow_code_execution` (and add the number of problems `--limit` if it was used during generation):
124+
```bash
125+
$ sudo docker run -v $(pwd)/generations_py.json:/app/generations_py.json:ro -it evaluation-harness-multiple python3 main.py \
126+
--model bigcode/santacoder \
127+
--tasks multiple-py \
128+
--load_generations_path /app/generations_py.json \
129+
--allow_code_execution \
130+
--temperature 0.8 \
131+
--n_samples 200
132+
```
133+
Execution time may vary depending on the programming languages.
134+
78135
### APPS
79136
[APPS](https://huggingface.co/datasets/codeparrot/apps): is a challenging benchmark for code generation with 10,000 Python problems,
80137
5,000 for the training and 5000 for the evaluation. It has three difficulty levels: introductory, interview and competition.
@@ -144,28 +201,6 @@ accelerate launch main.py \
144201
We expect a model [finetuned](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/finetuning/APPS) on the train split of APPS.
145202
TODO: add few-shot setup for APPS.
146203

147-
### DS-1000
148-
[DS-1000](https://ds1000-code-gen.github.io/): Code generation benchmark with 1000 data science questions spanning seven Python libraries that (1) reflects diverse, realistic, and practical use cases, (2) has a reliable metric, (3) defends against memorization by perturbing questions.
149-
150-
The task can be specified as `--tasks ds1000-$SUBSET-$MODE`, where subset can include `all` libraries or any of the following subsets: `numpy`, `scipy`, `pandas`, `tensorflow`, `pytorch`, `sklearn`, `matplotlib`. Supported generation modes are `completion` (purely autoregressive) or `insertion` (via fill-in-middle [FIM]).
151-
152-
- Prompts & Generation: prompts include partial code with one or more missing lines. The form of such prompts varies between `completion` and `insertion` modes (`[insert]` token used to reflect FIM region). Default generation args are reflected below.
153-
- Evaluation: generations are evaluated via execution of unit tests. As in the original manuscript, $pass@1$ is evaluated over each of `num_samples` and the mean pass rate is returned as the metric. Default evaluation args are presented below.
154-
155-
Below is the command to run evaluation on the full benchmark in insertion mode with the arguments that correspond to the original manuscript.
156-
157-
```bash
158-
export TF_FORCE_GPU_ALLOW_GROWTH=true
159-
TF_CPP_MIN_LOG_LEVEL=3 accelerate launch main.py \
160-
--model <MODEL_NAME> \
161-
--batch_size <BATCH_SIZE> \
162-
--tasks ds1000-all-insertion \
163-
--n_samples 40 \
164-
--max_length_generation 1024 \
165-
--temperature 0.2 \
166-
--top_p 0.95 \
167-
--allow_code_execution
168-
```
169204

170205
## Code generation benchmarks without unit tests
171206

0 commit comments

Comments
 (0)