Skip to content

Commit ae94984

Browse files
committed
doc: 0.2.0 pre-release
1 parent db22671 commit ae94984

File tree

3 files changed

+750
-276
lines changed

3 files changed

+750
-276
lines changed

ADVANCED_USAGE.md

Lines changed: 301 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,301 @@
1+
## 🔥 Advanced Start
2+
3+
To get started, please first set up the environment:
4+
5+
```bash
6+
# Install to use bigcodebench.evaluate
7+
pip install bigcodebench --upgrade
8+
# If you want to use the evaluate locally, you need to install the requirements
9+
pip install -I -r https://raw.githubusercontent.com/bigcode-project/bigcodebench/main/Requirements/requirements-eval.txt
10+
11+
# Install to use bigcodebench.generate
12+
# You are strongly recommended to install the generate dependencies in a separate environment
13+
pip install bigcodebench[generate] --upgrade
14+
```
15+
16+
<details><summary>⏬ Install nightly version <i>:: click to expand ::</i></summary>
17+
<div>
18+
19+
```bash
20+
# Install to use bigcodebench.evaluate
21+
pip install "git+https://github.com/bigcode-project/bigcodebench.git" --upgrade
22+
23+
# Install to use bigcodebench.generate
24+
pip install "git+https://github.com/bigcode-project/bigcodebench.git#egg=bigcodebench[generate]" --upgrade
25+
```
26+
27+
</div>
28+
</details>
29+
30+
<details><summary>⏬ Using BigCodeBench as a local repo? <i>:: click to expand ::</i></summary>
31+
<div>
32+
33+
```bash
34+
git clone https://github.com/bigcode-project/bigcodebench.git
35+
cd bigcodebench
36+
export PYTHONPATH=$PYTHONPATH:$(pwd)
37+
# Install to use bigcodebench.evaluate
38+
pip install -e .
39+
# Install to use bigcodebench.generate
40+
pip install -e .[generate]
41+
```
42+
43+
</div>
44+
</details>
45+
46+
### 🚀 Local Generation
47+
48+
```bash
49+
# when greedy, there is no need for temperature and n_samples
50+
bigcodebench.generate \
51+
--model [model_name] \
52+
--split [complete|instruct] \
53+
--subset [full|hard] \
54+
[--greedy] \
55+
--bs [bs] \
56+
--temperature [temp] \
57+
--n_samples [n_samples] \
58+
--resume \
59+
--backend [vllm|openai|mistral|anthropic|google|hf] \
60+
--tp [TENSOR_PARALLEL_SIZE] \
61+
[--trust_remote_code] \
62+
[--base_url [base_url]] \
63+
[--tokenizer_name [tokenizer_name]]
64+
```
65+
>
66+
The generated code samples will be stored in a file named `[model_name]--bigcodebench-[instruct|complete]--[backend]-[temp]-[n_samples].jsonl`. Alternatively, you can use the following command to utilize our pre-built docker images for generating code samples:
67+
>
68+
```bash
69+
# If you are using GPUs
70+
docker run --gpus '"device=$CUDA_VISIBLE_DEVICES"' -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest \
71+
--model [model_name] \
72+
--split [complete|instruct] \
73+
--subset [full|hard] \
74+
[--greedy] \
75+
--bs [bs] \
76+
--temperature [temp] \
77+
--n_samples [n_samples] \
78+
--resume \
79+
--backend [vllm|openai|mistral|anthropic|google|hf] \
80+
--tp [TENSOR_PARALLEL_SIZE]
81+
82+
# ...Or if you are using CPUs
83+
docker run -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest \
84+
--model [model_name] \
85+
--split [complete|instruct] \
86+
--subset [full|hard] \
87+
[--greedy] \
88+
--bs [bs] \
89+
--temperature [temp] \
90+
--n_samples [n_samples] \
91+
--resume \
92+
--backend [vllm|hf|openai|mistral|anthropic|google]
93+
```
94+
>
95+
```bash
96+
# If you wish to use gated or private HuggingFace models and datasets
97+
docker run -e HUGGING_FACE_HUB_TOKEN=$token -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments4
98+
99+
# Similarly, to use other backends that require authentication
100+
docker run -e OPENAI_API_KEY=$OPENAI_API_KEY -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments
101+
docker run -e ANTHROPIC_KEY=$ANTHROPIC_KEY -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments
102+
docker run -e MISTRAL_KEY=$MISTRAL_KEY -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments
103+
docker run -e GOOGLE_API_KEY=$OPENAI_API_KEY -v $(pwd):/app -t bigcodebench/bigcodebench-generate:latest # omit other arguments
104+
```
105+
>
106+
Following which, you can run the built container as shown in above.
107+
>
108+
<details><summary>🤔 Structure of `problem`? <i>:: click to expand ::</i></summary>
109+
<div>
110+
111+
* `task_id` is the identifier string for the task
112+
* `entry_point` is the name of the function
113+
* `complete_prompt` is the prompt for BigCodeBench-Complete
114+
* `instruct_prompt` is the prompt for BigCodeBench-Instruct
115+
+ `canonical_solution` is the ground-truth implementation
116+
+ `test` is the `unittest.TestCase` class
117+
118+
</div>
119+
</details>
120+
121+
> [!Note]
122+
>
123+
> **Expected Schema of `[model_name]--bigcodebench-[task]--[backend]-[temp]-[n_samples].jsonl`**
124+
>
125+
> 1. `task_id`: Task ID, which are the keys of `get_bigcodebench()`
126+
> 2. `solution` (optional): Self-contained solution (usually including the prompt)
127+
> 3. `raw_solution` (optional): The raw solution generated by the LLM
128+
> * Example: `{"task_id": "BigCodeBench/?", "solution": "def f():\n return 1", "raw_solution": "def f():\n return 1\nprint(f())"}`
129+
130+
131+
<details><summary>🔎 Checking the compatibility of post-processed code<i>:: click to expand ::</i></summary>
132+
<div>
133+
134+
To double-check the post-processing results, you can use `bigcodebench.syncheck` to check the code validity before and after sanitization, which will print erroneous code snippets and why they are wrong:
135+
136+
```bash
137+
# 💡 If you are storing codes in jsonl:
138+
bigcodebench.syncheck --samples samples.jsonl
139+
140+
# 💡 If you are storing codes in directories:
141+
bigcodebench.syncheck --samples /path/to/vicuna-[??]b_temp_[??]
142+
143+
# 💡 Or change the entrypoint to bigcodebench.syncheck in any pre-built docker image, like
144+
docker run -it --entrypoint bigcodebench.syncheck -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --samples samples.jsonl
145+
```
146+
147+
</div>
148+
</details>
149+
150+
151+
### Local Evaluation
152+
153+
You are strongly recommended to use a sandbox such as [docker](https://docs.docker.com/get-docker/):
154+
155+
```bash
156+
# Mount the current directory to the container
157+
# If you want to change the RAM address space limit (in MB, 30 GB by default): `--max-as-limit XXX`
158+
# If you want to change the RAM data segment limit (in MB, 30 GB by default): `--max-data-limit`
159+
# If you want to change the RAM stack limit (in MB, 10 MB by default): `--max-stack-limit`
160+
# If you want to increase the execution time limit (in seconds, 240 seconds by default): `--min-time-limit`
161+
docker run -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --local_execute --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl
162+
163+
# If you only want to check the ground truths
164+
docker run -v $(pwd):/app bigcodebench/bigcodebench-evaluate:latest --local_execute --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl --check-gt-only
165+
```
166+
167+
...Or if you want to try it locally regardless of the risks ⚠️:
168+
169+
First, install the dependencies for BigCodeBench:
170+
171+
```bash
172+
pip install -r https://raw.githubusercontent.com/bigcode-project/bigcodebench/main/Requirements/requirements-eval.txt
173+
```
174+
175+
Then, run the evaluation:
176+
177+
```bash
178+
# ...Or locally ⚠️
179+
bigcodebench.evaluate --local_execute --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl
180+
# ...If you really don't want to check the ground truths
181+
bigcodebench.evaluate --local_execute --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl --no-gt
182+
# If you want to save the pass rate to a file
183+
bigcodebench.evaluate --local_execute --split [complete|instruct] --subset [full|hard] --samples samples-sanitized-calibrated.jsonl --save_pass_rate
184+
185+
# You are strongly recommended to use the following command to clean up the environment after evaluation:
186+
pids=$(ps -u $(id -u) -o pid,comm | grep 'bigcodebench' | awk '{print $1}'); if [ -n \"$pids\" ]; then echo $pids | xargs -r kill; fi;
187+
rm -rf /tmp/*
188+
```
189+
190+
> [!Tip]
191+
>
192+
> If you want to customize the `k` in `Pass@k`, please pass `--pass_k` with a comma-separated string.
193+
> For example, if you want to use `Pass@1` and `Pass@100`, you can pass `--pass_k 1,100`.
194+
195+
> [!Tip]
196+
>
197+
> Do you use a very slow machine?
198+
>
199+
> LLM solutions are regarded as **failed** on timeout (and OOM etc.).
200+
> Specifically, we set the dynamic timeout based on the ground-truth solution's runtime.
201+
>
202+
> Additionally, you are **NOT** encouraged to make your test-bed over stressed while running evaluation.
203+
> For example, using `--parallel 64` on a 4-core machine or doing something else during evaluation are bad ideas...
204+
205+
<details><summary>⌨️ More command-line flags <i>:: click to expand ::</i></summary>
206+
<div>
207+
208+
* `--parallel`: by default half of the cores
209+
210+
</div>
211+
</details>
212+
213+
The output should be like (below is GPT-4 greedy decoding example):
214+
215+
```
216+
Asserting the groundtruth...
217+
Expected outputs computed in 1200.0 seconds
218+
Reading samples...
219+
1140it [00:00, 1901.64it/s]
220+
Evaluating samples...
221+
100%|██████████████████████████████████████████| 1140/1140 [19:53<00:00, 6.75it/s]
222+
BigCodeBench-Instruct-calibrated
223+
Groundtruth pass rate: 1.000
224+
pass@1: 0.568
225+
```
226+
227+
- A cache file named like `samples_eval_results.json` will be cached. Remove it to re-run the evaluation
228+
229+
<details><summary>🤔 How long it would take? <i>:: click to expand ::</i></summary>
230+
<div>
231+
232+
If you do greedy decoding where there is only one sample for each task, the evaluation should take just a few minutes on Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz, composed of 2 sockets, with 18 cores per socket. However, if you have multiple samples for each task, the evaluation will take longer.
233+
Here are some tips to speed up the evaluation:
234+
235+
* Use `--parallel $(nproc)`
236+
* Use our pre-evaluated results (see [LLM-generated code](#-LLM-generated-code))
237+
238+
</div>
239+
</details>
240+
241+
## 🔍 Failure Inspection
242+
243+
You can inspect the failed samples by using the following command:
244+
245+
```bash
246+
# Inspect the failed samples and save the results to `inspect/`
247+
bigcodebench.inspect --eval_results sample-sanitized-calibrated_eval_results.json --split complete --subset hard
248+
249+
# Re-run the inspection in place
250+
bigcodebench.inspect --eval_results sample-sanitized-calibrated_eval_results.json --split complete --subset hard --in_place
251+
```
252+
253+
## 🚀 Full Script
254+
255+
We provide a sample script to run the full pipeline:
256+
257+
```bash
258+
bash run.sh
259+
```
260+
261+
## 📊 Result Analysis
262+
263+
We provide a script to replicate the analysis like Elo Rating and Task Solve Rate, which helps you understand the performance of the models further.
264+
265+
```bash
266+
To run the analysis, you need to put all the `samples_eval_results.json` files in a `results` folder, which is in the same directory as the script.
267+
268+
```bash
269+
cd analysis
270+
python get_results.py
271+
```
272+
273+
## 💻 LLM-generated Code
274+
275+
We share pre-generated code samples from LLMs we have [evaluated](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard):
276+
* See the attachment of our [v0.1.5](https://github.com/bigcode-project/bigcodebench/releases/tag/v0.1.5). We include both `sanitized_samples.zip` and `sanitized_samples_calibrated.zip` for your convenience.
277+
278+
## 🐞 Known Issues
279+
280+
- [x] Due to [the Hugging Face tokenizer update](https://github.com/huggingface/transformers/pull/31305), some tokenizers may be broken and will degrade the performance of the evaluation. Therefore, we set up with `legacy=False` for the initialization. If you notice the unexpected behaviors, please try `--tokenizer_legacy` during the generation.
281+
282+
- [x] Due to the flakiness in the evaluation, the execution results may vary slightly (~0.2% for Full set, and ~0.6% for Hard set) between runs. We are working on improving the evaluation stability.
283+
284+
- [x] You may get errors like `ImportError: /usr/local/lib/python3.10/site-packages/matplotlib/_c_internal_utils.cpython-310-x86_64-linux-gnu.so: failed to map segment from shared object` when running the evaluation. This is due to the memory limit of the docker container. You can increase the memory limit of the docker container to solve this issue. If the issue persists ,please use the real-time code execution session to evaluate the code in the [leaderboard](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard).
285+
286+
- [x] We are aware of the issue of some users needing to use a proxy to access the internet. We are working on a subset of the tasks that do not require internet access to evaluate the code.
287+
288+
## 📜 Citation
289+
290+
```bibtex
291+
@article{zhuo2024bigcodebench,
292+
title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions},
293+
author={Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others},
294+
journal={arXiv preprint arXiv:2406.15877},
295+
year={2024}
296+
}
297+
```
298+
299+
## 🙏 Acknowledgement
300+
301+
- [EvalPlus](https://github.com/evalplus/evalplus)

0 commit comments

Comments
 (0)