Skip to content

Commit c3b9717

Browse files
authored
Merge pull request #67 from bigcode-project/update-readme
Update readme & docs
2 parents c9259e7 + 41ef785 commit c3b9717

File tree

4 files changed

+91
-143
lines changed

4 files changed

+91
-143
lines changed

README.md

Lines changed: 32 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -17,26 +17,33 @@
1717

1818
## Features
1919

20-
This is a framework for the evaluation of code generation models. This is a work in progress part of the BigCode project, and is inspired from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) for evaluating language models in general. We welcome contributions to fix issues, enhance features and add new benchmarks. You can find a contribution guides in [`docs/guide.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/guide.md) and [`CONTRIBUTING.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/CONTRIBUTING.md) and more documentation in [`docs/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md).
20+
This is a framework for the evaluation of code generation models. This work is inspired from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) for evaluating language models in general. We welcome contributions to fix issues, enhance features and add new benchmarks. You can find contribution guides in [`docs/guide.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/guide.md) and [`CONTRIBUTING.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/CONTRIBUTING.md) and more documentation in [`docs/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md).
2121

2222
Below are the features and tasks of this framework:
2323

24-
- Any autoregressive model available on [Hugging Face hub](https://huggingface.co/) can be used, but we recommend using code generation models trained specifically on Code such as [CodeParrot](https://huggingface.co/codeparrot/codeparrot), [InCoder](https://huggingface.co/facebook/incoder-6B) and [CodeGen](https://huggingface.co/Salesforce/codegen-16B-mono).
25-
- 3 code generation **Python** tasks (with unit tests): [HumanEval](https://huggingface.co/datasets/openai_humaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps) and [MBPP](https://huggingface.co/datasets/mbpp).
26-
- [CoNaLa](https://huggingface.co/datasets/neulab/conala) for **Python** code generation (2-shot setting and evaluation with BLEU score)
27-
- [Concode](https://huggingface.co/datasets/code_x_glue_tc_text_to_code) for **Java** code generation (2-shot setting and evaluation with BLEU score)
28-
- Code to text task from [CodeXGLUE](https://huggingface.co/datasets/code_x_glue_ct_code_to_text) (zero-shot & fine-tuning) for 6 languages: **Python, Go, Ruby, Java, JavaScript and PHP.**
29-
- 3 multilingual downstream classification tasks: [Java Complexity prediction](https://huggingface.co/datasets/codeparrot/codecomplex), [Java code equivalence prediction](https://huggingface.co/datasets/code_x_glue_cc_clone_detection_big_clone_bench), [C code defect prediction](https://huggingface.co/datasets/code_x_glue_cc_defect_detection).
24+
- Features:
25+
- Any autoregressive model available on [Hugging Face hub](https://huggingface.co/) can be used, but we recommend using code generation models trained specifically on Code such as [SantaCoder](https://huggingface.co/bigcode/santacoder), [InCoder](https://huggingface.co/facebook/incoder-6B) and [CodeGen](https://huggingface.co/Salesforce/codegen-16B-mono).
26+
- We provide Multi-GPU text generation with `accelerate` and Dockerfiles for evaluating on Docker containers for security and reproducibility.
3027

28+
- Tasks:
29+
- 4 code generation **Python** tasks (with unit tests): [HumanEval](https://huggingface.co/datasets/openai_humaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp) and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode.
30+
- [MultiPL-E](https://github.com/nuprl/MultiPL-E) evaluation suite (HumanEval translated into **18** programming languages)
31+
- [Pal](https://github.com/reasoning-machines/pal) Program-aided Language Models evaluation for grade school math problems : [GSM8K](https://huggingface.co/datasets/gsm8k) and [GSM-HARD](https://huggingface.co/datasets/reasoning-machines/gsm-hard). These problems are solved by generating reasoning chains of text and code.
32+
- Code to text task from [CodeXGLUE](https://huggingface.co/datasets/code_x_glue_ct_code_to_text) (zero-shot & fine-tuning) for 6 languages: **Python, Go, Ruby, Java, JavaScript and PHP.** Documentation translation task from [CodeXGLUE](https://huggingface.co/datasets/code_x_glue_tt_text_to_text).
33+
- [CoNaLa](https://huggingface.co/datasets/neulab/conala) for **Python** code generation (2-shot setting and evaluation with BLEU score)
34+
- [Concode](https://huggingface.co/datasets/code_x_glue_tc_text_to_code) for **Java** code generation (2-shot setting and evaluation with BLEU score)
35+
- 3 multilingual downstream classification tasks: [Java Complexity prediction](https://huggingface.co/datasets/codeparrot/codecomplex), [Java code equivalence prediction](https://huggingface.co/datasets/code_x_glue_cc_clone_detection_big_clone_bench), [C code defect prediction](https://huggingface.co/datasets/code_x_glue_cc_defect_detection).
36+
37+
More details about each task can be found in the documentation in [`docs/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md).
3138
## Setup
3239

3340
```bash
3441
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
3542
cd bigcode-evaluation-harness
3643
```
37-
Install [`torch`](https://pytorch.org/get-started/locally/) based on your device type and the other packages using:
44+
Install [`torch`](https://pytorch.org/get-started/locally/) based on your device type, and install the other packages using:
3845
```
39-
pip install -r requirements.txt
46+
pip install -e .
4047
```
4148
To run the `DS-1000` benchmark, additional constraints must be resolved.
4249
```
@@ -63,15 +70,17 @@ We use [`accelerate`](https://huggingface.co/docs/accelerate/index) to generate
6370
accelerate config
6471
```
6572

66-
This evaluation harness can also be used in an evaluation only mode, you can use a Multi-CPU setting. For this mode you can also find an example of setup instructions in `evaluation_setup.sh`, where we configure the environment and evaluate some MBPP generations donwloaded from the hub.
73+
This evaluation harness can also be used in an evaluation only mode, you can use a Multi-CPU setting. For large models, we recommend specifying the precision of the model using the `--precision` flag instead of accelerate config to have only one copy of the model in memory.
74+
75+
The evaluation part (solutions execution) for [MultiPL-E](https://github.com/nuprl/MultiPL-E) requires extra dependencies for some programming languages, we provide a Dockerfile with all dependencies, see section [Docker](#docker-containers) for more details.
6776

6877
## Usage
6978
You can use this evaluation harness to generate text solutions to code benchmarks with your model, to evaluate (and execute) the solutions or to do both. While it is better to use GPUs for the generation, the evaluation only requires CPUs. So it might be beneficial to separate these two steps. By default both generation and evaluation are performed.
7079

7180
For more details on how to evaluate on the tasks, please refer to the documentation in [`docs/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md).
7281

7382
### Generation and evaluation
74-
Below are some examples to generate and evaluate on some tasks.
83+
Below is an example to generate and evaluate on a task.
7584

7685
```bash
7786
accelerate launch main.py \
@@ -83,6 +92,7 @@ accelerate launch main.py \
8392
--do_sample True \
8493
--n_samples 100 \
8594
--batch_size 10 \
95+
--precision <PRECISION> \
8696
--allow_code_execution \
8797
--save_generations
8898
```
@@ -97,21 +107,22 @@ Some tasks don't require code execution such as
97107

98108
### Generation only
99109

100-
If you want to generate solutions without executing and evaluating the code, call `--generation_only`, in addition to the instructions above. This will save the solutions in a json file in the working directory.
110+
If you want to generate solutions without executing and evaluating the code, call `--generation_only`, in addition to the instructions above. This will save the solutions in a json file provided in `save_generation_path` in the working directory.
101111

102-
This can be useful if you don't want to execute code in the machine you're using for generations for security or efficiency reasons. For instance, you can do the generations on multiple GPUs, but switch to a multiple workers CPU machine for the execution, which can save money and time.
112+
This can be useful if you don't want to execute code in the machine you're using for generations for security or efficiency reasons. For instance, you can do the generations on multiple GPUs, but switch to a multiple workers CPU machine or docker container for the execution.
103113

104114
### Evaluation only
105115

106-
If you already have the generations in a json file from this evaluation harness and want to evaluate them, specify the path of the generations via the `generation_path` argument. You may need to reconfigure `accelerate` to use multiple CPUs. For this mode, you can also find an example of setup instructions in `evaluation_setup.sh`.
116+
If you already have the generations in a json file from this evaluation harness and want to evaluate them, specify the path of the generations via the `load_generations_path` argument. You may need to reconfigure `accelerate` to use multiple CPUs.
107117

108-
Below is an example, be mind of specifying arguments proper to the task you are evaluating on, and note that `model` value here only serves for documenting the experiment.
118+
Below is an example, be mind of specifying arguments proper to the task you are evaluating on, and note that `model` value here only serves for documenting the experiment. Also add `--n_samples` to specify the number of samples to evaluate per problem (usually the same value used in generation).
109119

110120
```bash
111121
accelerate launch main.py --tasks mbpp --allow_code_execution --load_generations_path generations.json --model incoder-temperature-08
112122
```
123+
113124
## Docker containers
114-
For safety, we provide a Dockerfiles to do the execution inside a docker container. To do that, first, do the generation on your machine and save them in generations.json by adding the flag --generation_only to the command. Then build the docker container and run the evaluation inside it.
125+
For safety, we provide a Dockerfiles to do the execution inside a docker container. To do that, first, do the generation on your machine and save them in `generations.json` for example by adding the flag `--generation_only` to the command. Then build the docker container and run the evaluation inside it.
115126

116127
### Building Docker image
117128
Here's how to build a docker image for the evaluation harness:
@@ -127,7 +138,7 @@ $ sudo make DOCKERFILE=Dockerfile-multiple all
127138
This creates an image called `evaluation-harness-multiple`.
128139

129140
### Evaluating inside a container
130-
Suppose you generated text with the `bigcode/santacoder` model and saved it in `generations.json` with:
141+
Suppose you generated text with the `bigcode/santacoder` model and saved it in `generations_py.json` with:
131142
```bash
132143
accelerate launch main.py \
133144
--model bigcode/santacoder \
@@ -143,7 +154,7 @@ accelerate launch main.py \
143154
--save_generations_path generations_py.json
144155
```
145156

146-
To run the container (here from image `evaluation-harness`) to evaluate on `generations.json`, or another file mount it with `-v`, specify `n_samples` and allow code execution with `--allow_code_execution` (and add the number of problems `--limit` if it was used during generation):
157+
To run the container (here from image `evaluation-harness-multiple`) to evaluate on `generations_py.json`, or another file mount it with `-v`, specify `n_samples` and allow code execution with `--allow_code_execution` (and add the number of problems `--limit` if it was used during generation):
147158
```bash
148159
$ sudo docker run -v $(pwd)/generations_py.json:/app/generations_py.json:ro -it evaluation-harness-multiple python3 main.py \
149160
--model bigcode/santacoder \
@@ -161,9 +172,7 @@ To implement a new task in this evaluation harness, see the guide in [`docs/guid
161172
We provide documentation for the existing benchmarks and how we make the evaluation in [`docs/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md).
162173

163174
## Remarks
164-
* Currenltly, we use parallel evaluation across multiple GPUs using `accelerate`, this assumes that you can fit the model in one GPU.
165-
* Please note this evaluation harness tries to cover a wide set of models, but there could still be room for improvement based on each model, some might require different prompt engineering or post-processing of the code generations.
166-
* For some scores of ongoing experiments please refer to [`example_scores/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/master/example_scores/README.md).
175+
* Currenltly, we use data parallel evaluation across multiple GPUs using `accelerate`, this assumes that you can fit the model in one GPU.
167176

168177
## Acknowledgements
169178
We thank EleutherAI for their work on the [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness) from which this repository is inspired.
@@ -174,6 +183,8 @@ We thank EleutherAI for their work on the [lm-evaluation harness](https://github
174183
@software{bigcode-evaluation-harness,
175184
author = {Ben Allal, Loubna and
176185
Muennighoff, Niklas and
186+
Kumar Umapathi, Logesh and
187+
Lipkin, Ben and
177188
Von Werra, Leandro},
178189
title = {A framework for the evaluation of code generation models},
179190
howpublished = {\url{https://github.com/bigcode-project/bigcode-evaluation-harness}},

docs/README.md

Lines changed: 59 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,65 @@ accelerate launch main.py \
7575

7676
Low temperatures generally work better for small $k$ in pass@k.
7777

78+
### DS-1000
79+
[DS-1000](https://ds1000-code-gen.github.io/): Code generation benchmark with 1000 data science questions spanning seven Python libraries that (1) reflects diverse, realistic, and practical use cases, (2) has a reliable metric, (3) defends against memorization by perturbing questions.
80+
81+
The task can be specified as `--tasks ds1000-$SUBSET-$MODE`, where subset can include `all` libraries or any of the following subsets: `numpy`, `scipy`, `pandas`, `tensorflow`, `pytorch`, `sklearn`, `matplotlib`. Supported generation modes are `completion` (purely autoregressive) or `insertion` (via fill-in-middle [FIM]).
82+
83+
- Prompts & Generation: prompts include partial code with one or more missing lines. The form of such prompts varies between `completion` and `insertion` modes (`[insert]` token used to reflect FIM region). Default generation args are reflected below.
84+
- Evaluation: generations are evaluated via execution of unit tests. As in the original manuscript, $pass@1$ is evaluated over each of `num_samples` and the mean pass rate is returned as the metric. Default evaluation args are presented below.
85+
86+
Below is the command to run evaluation on the full benchmark in insertion mode with the arguments that correspond to the original manuscript.
87+
88+
```bash
89+
export TF_FORCE_GPU_ALLOW_GROWTH=true
90+
TF_CPP_MIN_LOG_LEVEL=3 accelerate launch main.py \
91+
--model <MODEL_NAME> \
92+
--batch_size <BATCH_SIZE> \
93+
--tasks ds1000-all-insertion \
94+
--n_samples 40 \
95+
--max_length_generation 1024 \
96+
--temperature 0.2 \
97+
--top_p 0.95 \
98+
--allow_code_execution
99+
```
100+
101+
### MultiPL-E
102+
[MultiPL-E](https://huggingface.co/datasets/nuprl/MultiPL-E): is a benchamrk for evaluating large language models for code generation that supports 18 programming languages. It takes the OpenAI "HumanEval" Python benchmark and uses little compilers to translate them to other languages. We use similar implementation as [the original repository](https://github.com/nuprl/MultiPL-E/tree/main) and evaluation parameters are similar to HumanEval. Although for this benchmark, we strongly recommend using the provided Dockerfile to build the MultiPL-E container with all required dependencies, and for more safety especially when evaluating on languages like `bash`.
103+
Tasks are named `multiple-<LANG>` where `<LANG>` is the language name, e.g. `multiple-py` for python.
104+
105+
```bash
106+
$ sudo make DOCKERFILE=Dockerfile-multiple all
107+
```
108+
This creates an image called `evaluation-harness-multiple`.
109+
110+
Suppose you generated text with the `bigcode/santacoder` model and saved it in `generations_py.json` with:
111+
```bash
112+
accelerate launch main.py \
113+
--model bigcode/santacoder \
114+
--tasks multiple-py \
115+
--max_length_generation 650 \
116+
--temperature 0.8 \
117+
--do_sample True \
118+
--n_samples 200 \
119+
--batch_size 200 \
120+
--trsut_remote_code \
121+
--generation_only \
122+
--save_generations \
123+
--save_generations_path generations_py.json
124+
```
125+
To run the container (here from image `evaluation-harness-multiple`) to evaluate on `generations_py.json`, or another file mount it with `-v`, specify `n_samples` and allow code execution with `--allow_code_execution` (and add the number of problems `--limit` if it was used during generation):
126+
```bash
127+
$ sudo docker run -v $(pwd)/generations_py.json:/app/generations_py.json:ro -it evaluation-harness-multiple python3 main.py \
128+
--model bigcode/santacoder \
129+
--tasks multiple-py \
130+
--load_generations_path /app/generations_py.json \
131+
--allow_code_execution \
132+
--temperature 0.8 \
133+
--n_samples 200
134+
```
135+
Execution time may vary depending on the programming languages.
136+
78137
### APPS
79138
[APPS](https://huggingface.co/datasets/codeparrot/apps): is a challenging benchmark for code generation with 10,000 Python problems,
80139
5,000 for the training and 5000 for the evaluation. It has three difficulty levels: introductory, interview and competition.
@@ -144,28 +203,6 @@ accelerate launch main.py \
144203
We expect a model [finetuned](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/finetuning/APPS) on the train split of APPS.
145204
TODO: add few-shot setup for APPS.
146205

147-
### DS-1000
148-
[DS-1000](https://ds1000-code-gen.github.io/): Code generation benchmark with 1000 data science questions spanning seven Python libraries that (1) reflects diverse, realistic, and practical use cases, (2) has a reliable metric, (3) defends against memorization by perturbing questions.
149-
150-
The task can be specified as `--tasks ds1000-$SUBSET-$MODE`, where subset can include `all` libraries or any of the following subsets: `numpy`, `scipy`, `pandas`, `tensorflow`, `pytorch`, `sklearn`, `matplotlib`. Supported generation modes are `completion` (purely autoregressive) or `insertion` (via fill-in-middle [FIM]).
151-
152-
- Prompts & Generation: prompts include partial code with one or more missing lines. The form of such prompts varies between `completion` and `insertion` modes (`[insert]` token used to reflect FIM region). Default generation args are reflected below.
153-
- Evaluation: generations are evaluated via execution of unit tests. As in the original manuscript, $pass@1$ is evaluated over each of `num_samples` and the mean pass rate is returned as the metric. Default evaluation args are presented below.
154-
155-
Below is the command to run evaluation on the full benchmark in insertion mode with the arguments that correspond to the original manuscript.
156-
157-
```bash
158-
export TF_FORCE_GPU_ALLOW_GROWTH=true
159-
TF_CPP_MIN_LOG_LEVEL=3 accelerate launch main.py \
160-
--model <MODEL_NAME> \
161-
--batch_size <BATCH_SIZE> \
162-
--tasks ds1000-all-insertion \
163-
--n_samples 40 \
164-
--max_length_generation 1024 \
165-
--temperature 0.2 \
166-
--top_p 0.95 \
167-
--allow_code_execution
168-
```
169206

170207
## Code generation benchmarks without unit tests
171208

0 commit comments

Comments
 (0)