|
| 1 | +## **SciCode Evaluation using `inspect_ai`** |
| 2 | + |
| 3 | +### 1. Set Up Your API Keys |
| 4 | + |
| 5 | +Users can follow [`inspect_ai`'s official documentation](https://inspect.ai-safety-institute.org.uk/#getting-started) to setup correpsonding API keys depending on the types of models they would like to evaluate. |
| 6 | + |
| 7 | +### 2. Setup Command Line Arguments if Needed |
| 8 | + |
| 9 | +In most cases, after users setting up the key, they can directly start the SciCode evaluation via the following command. |
| 10 | + |
| 11 | +```bash |
| 12 | +inspect eval scicode.py --model <your_model> --temperature 0 |
| 13 | +``` |
| 14 | + |
| 15 | +However, there are some additional command line arguments that could be useful as well. |
| 16 | + |
| 17 | +- `--max_connections`: Maximum amount of API connections to the evaluated model. |
| 18 | +- `--limit`: Limit of the number of samples to evaluate in the SciCode dataset. |
| 19 | +- `-T input_path=<another_input_json_file>`: This is useful when user wants to change to another json dataset (e.g., the dev set). |
| 20 | +- `-T output_dir=<your_output_dir>`: This changes the default output directory (`./tmp`). |
| 21 | +- `-T with_background=True/False`: Whether to include problem background. |
| 22 | +- `-T mode=normal/gold/dummy`: This provides two additional modes for sanity checks. |
| 23 | + - `normal` mode is the standard mode to evaluate a model |
| 24 | + - `gold` mode can only be used on the dev set which loads the gold answer |
| 25 | + - `dummy` mode does not call any real LLMs and generates some dummy outputs |
| 26 | + |
| 27 | +For example, user can run five sames on the dev set with background as |
| 28 | + |
| 29 | +```bash |
| 30 | +inspect eval scicode.py \ |
| 31 | + --model openai/gpt-4o \ |
| 32 | + --temperature 0 \ |
| 33 | + --limit 5 \ |
| 34 | + -T input_path=../data/problems_dev.jsonl \ |
| 35 | + -T output_dir=./tmp/dev \ |
| 36 | + -T with_background=True \ |
| 37 | + -T mode=gold |
| 38 | +``` |
| 39 | + |
| 40 | +For more information regarding `inspect_ai`, we refer users to its [official documentation](https://inspect.ai-safety-institute.org.uk/). |
0 commit comments