@@ -227,12 +227,12 @@ docker run -it --entrypoint bigcodebench.syncheck -v $(pwd):/app bigcodebench/bi
227227You are strongly recommended to use a sandbox such as [docker](https://docs.docker.com/get-docker/):
228228
229229` ` ` bash
230- # mount the current directory to the container
231- docker run -v $( pwd) :/app bigcodebench/bigcodebench-evaluate:latest --subset [complete| instruct] --samples samples.jsonl
230+ # Mount the current directory to the container
231+ docker run -v $( pwd) :/app bigcodebench/bigcodebench-evaluate:latest --subset [complete| instruct] --samples samples-sanitized-calibrated
232232# ...Or locally ⚠️
233- bigcodebench.evaluate --subset [complete| instruct] --samples samples.jsonl
233+ bigcodebench.evaluate --subset [complete| instruct] --samples samples-sanitized-calibrated
234234# ...If the ground truth is working locally (due to some flaky tests)
235- bigcodebench.evaluate --subset [complete| instruct] --samples samples.jsonl --no-gt
235+ bigcodebench.evaluate --subset [complete| instruct] --samples samples-sanitized-calibrated --no-gt
236236` ` `
237237
238238...Or if you want to try it locally regardless of the risks ⚠️:
@@ -247,9 +247,9 @@ Then, run the evaluation:
247247
248248` ` ` bash
249249# ...Or locally ⚠️
250- bigcodebench.evaluate --subset [complete| instruct] --samples samples-calibrated.jsonl
250+ bigcodebench.evaluate --subset [complete| instruct] --samples samples-sanitized- calibrated.jsonl
251251# ...If the ground truth is not working locally
252- bigcodebench.evaluate --subset [complete| instruct] --samples samples-calibrated.jsonl --no-gt
252+ bigcodebench.evaluate --subset [complete| instruct] --samples samples-sanitized- calibrated --no-gt
253253` ` `
254254
255255> [! Tip]
@@ -303,7 +303,7 @@ Here are some tips to speed up the evaluation:
303303You can inspect the failed samples by using the following command:
304304
305305```bash
306- bigcodebench.inspect --eval-results sample-sanitized_eval_results .json --in-place
306+ bigcodebench.inspect --eval-results sample-sanitized-calibrated_eval_results .json --in-place
307307```
308308
309309## Full Script
0 commit comments