Skip to content

Commit 46f2736

Browse files
committed
add new tasks to readme and fix some errors
1 parent c9259e7 commit 46f2736

File tree

1 file changed

+22
-15
lines changed

1 file changed

+22
-15
lines changed

README.md

Lines changed: 22 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -17,26 +17,30 @@
1717

1818
## Features
1919

20-
This is a framework for the evaluation of code generation models. This is a work in progress part of the BigCode project, and is inspired from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) for evaluating language models in general. We welcome contributions to fix issues, enhance features and add new benchmarks. You can find a contribution guides in [`docs/guide.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/guide.md) and [`CONTRIBUTING.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/CONTRIBUTING.md) and more documentation in [`docs/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md).
20+
This is a framework for the evaluation of code generation models. This work is inspired from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) for evaluating language models in general. We welcome contributions to fix issues, enhance features and add new benchmarks. You can find a contribution guide in [`docs/guide.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/guide.md) and [`CONTRIBUTING.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/CONTRIBUTING.md) and more documentation in [`docs/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md).
2121

2222
Below are the features and tasks of this framework:
2323

24-
- Any autoregressive model available on [Hugging Face hub](https://huggingface.co/) can be used, but we recommend using code generation models trained specifically on Code such as [CodeParrot](https://huggingface.co/codeparrot/codeparrot), [InCoder](https://huggingface.co/facebook/incoder-6B) and [CodeGen](https://huggingface.co/Salesforce/codegen-16B-mono).
25-
- 3 code generation **Python** tasks (with unit tests): [HumanEval](https://huggingface.co/datasets/openai_humaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps) and [MBPP](https://huggingface.co/datasets/mbpp).
24+
- Any autoregressive model available on [Hugging Face hub](https://huggingface.co/) can be used, but we recommend using code generation models trained specifically on Code such as [SantaCoder](https://huggingface.co/bigcode/santacoder), [InCoder](https://huggingface.co/facebook/incoder-6B) and [CodeGen](https://huggingface.co/Salesforce/codegen-16B-mono).
25+
- 4 code generation **Python** tasks (with unit tests): [HumanEval](https://huggingface.co/datasets/openai_humaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp) and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode.
26+
- [MultiPL-E](https://github.com/nuprl/MultiPL-E) evaluation suite (HumanEval translated into **18** programming languages)
27+
- [Pal](https://github.com/reasoning-machines/pal) Program-aided Language Models evaluation for grade school math problems : [GSM8K](https://huggingface.co/datasets/gsm8k) and [GSM-HARD](https://huggingface.co/datasets/reasoning-machines/gsm-hard). These problems are solved by generating reasoning chains of text and code.
28+
- Code to text task from [CodeXGLUE](https://huggingface.co/datasets/code_x_glue_ct_code_to_text) (zero-shot & fine-tuning) for 6 languages: **Python, Go, Ruby, Java, JavaScript and PHP.** Documentation translation task from [CodeXGLUE](https://huggingface.co/datasets/code_x_glue_tt_text_to_text).
2629
- [CoNaLa](https://huggingface.co/datasets/neulab/conala) for **Python** code generation (2-shot setting and evaluation with BLEU score)
2730
- [Concode](https://huggingface.co/datasets/code_x_glue_tc_text_to_code) for **Java** code generation (2-shot setting and evaluation with BLEU score)
28-
- Code to text task from [CodeXGLUE](https://huggingface.co/datasets/code_x_glue_ct_code_to_text) (zero-shot & fine-tuning) for 6 languages: **Python, Go, Ruby, Java, JavaScript and PHP.**
2931
- 3 multilingual downstream classification tasks: [Java Complexity prediction](https://huggingface.co/datasets/codeparrot/codecomplex), [Java code equivalence prediction](https://huggingface.co/datasets/code_x_glue_cc_clone_detection_big_clone_bench), [C code defect prediction](https://huggingface.co/datasets/code_x_glue_cc_defect_detection).
32+
- Dockerfiles for evaluating on Docker containers for security and reproducibility.
3033

34+
More details about each task can be found in the documentation in [`docs/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md).
3135
## Setup
3236

3337
```bash
3438
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
3539
cd bigcode-evaluation-harness
3640
```
37-
Install [`torch`](https://pytorch.org/get-started/locally/) based on your device type and the other packages using:
41+
Install [`torch`](https://pytorch.org/get-started/locally/) based on your device type, and install the other packages using:
3842
```
39-
pip install -r requirements.txt
43+
pip install -e .
4044
```
4145
To run the `DS-1000` benchmark, additional constraints must be resolved.
4246
```
@@ -63,15 +67,15 @@ We use [`accelerate`](https://huggingface.co/docs/accelerate/index) to generate
6367
accelerate config
6468
```
6569

66-
This evaluation harness can also be used in an evaluation only mode, you can use a Multi-CPU setting. For this mode you can also find an example of setup instructions in `evaluation_setup.sh`, where we configure the environment and evaluate some MBPP generations donwloaded from the hub.
70+
This evaluation harness can also be used in an evaluation only mode, you can use a Multi-CPU setting. For large model, set up the precision of the model using the `--precision` flag instead of accelerate config to have only one copy of the model in memory.
6771

6872
## Usage
6973
You can use this evaluation harness to generate text solutions to code benchmarks with your model, to evaluate (and execute) the solutions or to do both. While it is better to use GPUs for the generation, the evaluation only requires CPUs. So it might be beneficial to separate these two steps. By default both generation and evaluation are performed.
7074

7175
For more details on how to evaluate on the tasks, please refer to the documentation in [`docs/README.md`](https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/docs/README.md).
7276

7377
### Generation and evaluation
74-
Below are some examples to generate and evaluate on some tasks.
78+
Below is an example to generate and evaluate on a task.
7579

7680
```bash
7781
accelerate launch main.py \
@@ -97,21 +101,22 @@ Some tasks don't require code execution such as
97101

98102
### Generation only
99103

100-
If you want to generate solutions without executing and evaluating the code, call `--generation_only`, in addition to the instructions above. This will save the solutions in a json file in the working directory.
104+
If you want to generate solutions without executing and evaluating the code, call `--generation_only`, in addition to the instructions above. This will save the solutions in a json file provided in `save_generation_path` in the working directory.
101105

102-
This can be useful if you don't want to execute code in the machine you're using for generations for security or efficiency reasons. For instance, you can do the generations on multiple GPUs, but switch to a multiple workers CPU machine for the execution, which can save money and time.
106+
This can be useful if you don't want to execute code in the machine you're using for generations for security or efficiency reasons. For instance, you can do the generations on multiple GPUs, but switch to a multiple workers CPU machine or docker container for the execution.
103107

104108
### Evaluation only
105109

106-
If you already have the generations in a json file from this evaluation harness and want to evaluate them, specify the path of the generations via the `generation_path` argument. You may need to reconfigure `accelerate` to use multiple CPUs. For this mode, you can also find an example of setup instructions in `evaluation_setup.sh`.
110+
If you already have the generations in a json file from this evaluation harness and want to evaluate them, specify the path of the generations via the `load_generations_path` argument. You may need to reconfigure `accelerate` to use multiple CPUs. For this mode, you can also find an example of setup instructions in `evaluation_setup.sh`.
107111

108-
Below is an example, be mind of specifying arguments proper to the task you are evaluating on, and note that `model` value here only serves for documenting the experiment.
112+
Below is an example, be mind of specifying arguments proper to the task you are evaluating on, and note that `model` value here only serves for documenting the experiment. Also add `--n_samples` to specify the number of samples to evaluate per problem (usually the same value used in generation).
109113

110114
```bash
111115
accelerate launch main.py --tasks mbpp --allow_code_execution --load_generations_path generations.json --model incoder-temperature-08
112116
```
117+
113118
## Docker containers
114-
For safety, we provide a Dockerfiles to do the execution inside a docker container. To do that, first, do the generation on your machine and save them in generations.json by adding the flag --generation_only to the command. Then build the docker container and run the evaluation inside it.
119+
For safety, we provide a Dockerfiles to do the execution inside a docker container. To do that, first, do the generation on your machine and save them in `generations.json` for example by adding the flag `--generation_only` to the command. Then build the docker container and run the evaluation inside it.
115120

116121
### Building Docker image
117122
Here's how to build a docker image for the evaluation harness:
@@ -127,7 +132,7 @@ $ sudo make DOCKERFILE=Dockerfile-multiple all
127132
This creates an image called `evaluation-harness-multiple`.
128133

129134
### Evaluating inside a container
130-
Suppose you generated text with the `bigcode/santacoder` model and saved it in `generations.json` with:
135+
Suppose you generated text with the `bigcode/santacoder` model and saved it in `generations_py.json` with:
131136
```bash
132137
accelerate launch main.py \
133138
--model bigcode/santacoder \
@@ -143,7 +148,7 @@ accelerate launch main.py \
143148
--save_generations_path generations_py.json
144149
```
145150

146-
To run the container (here from image `evaluation-harness`) to evaluate on `generations.json`, or another file mount it with `-v`, specify `n_samples` and allow code execution with `--allow_code_execution` (and add the number of problems `--limit` if it was used during generation):
151+
To run the container (here from image `evaluation-harness`) to evaluate on `generations_py.json`, or another file mount it with `-v`, specify `n_samples` and allow code execution with `--allow_code_execution` (and add the number of problems `--limit` if it was used during generation):
147152
```bash
148153
$ sudo docker run -v $(pwd)/generations_py.json:/app/generations_py.json:ro -it evaluation-harness-multiple python3 main.py \
149154
--model bigcode/santacoder \
@@ -174,6 +179,8 @@ We thank EleutherAI for their work on the [lm-evaluation harness](https://github
174179
@software{bigcode-evaluation-harness,
175180
author = {Ben Allal, Loubna and
176181
Muennighoff, Niklas and
182+
Kumar Umapathi, Logesh and
183+
Lipkin, Ben and
177184
Von Werra, Leandro},
178185
title = {A framework for the evaluation of code generation models},
179186
howpublished = {\url{https://github.com/bigcode-project/bigcode-evaluation-harness}},

0 commit comments

Comments
 (0)