scicode-bench
diff --git a/‎.github/workflows/tests.yaml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/tests.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.gitignore‎
Lines changed: 5 additions & 2 deletions b/‎.gitignore‎
Lines changed: 5 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 15 additions & 0 deletions b/‎README.md‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎eval/inspect_ai/README.md‎
Lines changed: 40 additions & 0 deletions b/‎eval/inspect_ai/README.md‎
Lines changed: 40 additions & 0 deletions
@@ -20,7 +20,7 @@ jobs:
         uses: actions/checkout@v2
       - uses: actions/setup-python@v5
         with:
-          python-version: '3.9'
+          python-version: '3.10'
       - name: Create dummy keys.cfg
         run: touch keys.cfg
       - name: Install uv
 
@@ -2,10 +2,13 @@
 keys.cfg
 **/test_result/**
 **/output/**
-**/eval_results/**
+**/eval_results*/**
 eval/logs/**
 *.h5
-
+logs/**
+**/logs/**
+**/tmp/**
+integration/**
 
 # -------
 
 
@@ -7,6 +7,8 @@ This repo contains the evaluation code for the paper "[SciCode: A Research Codin
 
 ## 🔔News
 
+**[2025-01-24]: SciCode has been integrated with [`inspect_ai`](https://inspect.ai-safety-institute.org.uk/) for easier and faster model evaluations.**
+
 **[2024-11-04]: Leaderboard is on! Check [here](https://scicode-bench.github.io/leaderboard/). We have also added Claude Sonnet 3.5 (new) results.**
 
 **[2024-10-01]: We have added OpenAI o1-mini and o1-preview results.**
@@ -54,6 +56,19 @@ SciCode sources challenging and realistic research-level coding problems across
 4. Run `eval/scripts/gencode_json.py` to generate new model outputs (see the [`eval/scripts` readme](eval/scripts/)) for more information
 5. Run `eval/scripts/test_generated_code.py` to evaluate the unittests
 
+
+## Instructions to evaluate a new model using `inspect_ai` (recommended)
+
+Scicode has been integrated with `inspect_ai` for easier and faster model evaluation, compared with the methods above. You need to run the first three steps in the [above section](#instructions-to-evaluate-a-new-model), and then go to the `eval/inspect_ai` directory, setup correspoinding API key, and run the following command:
+
+```bash
+cd eval/inspect_ai
+export OPENAI_API_KEY=your-openai-api-key
+inspect eval scicode.py --model openai/gpt-4o --temperature 0
+```
+
+For more detailed information of using `inspect_ai`, see [`eval/inspect_ai` readme](eval/inspect_ai/)
+
 ## More information and FAQ
 
 More information, including a [FAQ section](https://scicode-bench.github.io/faq/), is provided on our [website](https://scicode-bench.github.io/).
 
@@ -0,0 +1,40 @@
+## **SciCode Evaluation using `inspect_ai`**
+
+### 1. Set Up Your API Keys
+
+Users can follow [`inspect_ai`'s official documentation](https://inspect.ai-safety-institute.org.uk/#getting-started) to setup correpsonding API keys depending on the types of models they would like to evaluate.
+
+### 2. Setup Command Line Arguments if Needed
+
+In most cases, after users setting up the key, they can directly start the SciCode evaluation via the following command.
+
+```bash
+inspect eval scicode.py --model <your_model> --temperature 0
+```
+
+However, there are some additional command line arguments that could be useful as well.
+
+- `--max_connections`: Maximum amount of API connections to the evaluated model.
+- `--limit`: Limit of the number of samples to evaluate in the SciCode dataset.
+- `-T input_path=<another_input_json_file>`: This is useful when user wants to change to another json dataset (e.g., the dev set).
+- `-T output_dir=<your_output_dir>`: This changes the default output directory (`./tmp`).
+- `-T with_background=True/False`: Whether to include problem background.
+- `-T mode=normal/gold/dummy`: This provides two additional modes for sanity checks.
+    - `normal` mode is the standard mode to evaluate a model
+    - `gold` mode can only be used on the dev set which loads the gold answer
+    - `dummy` mode does not call any real LLMs and generates some dummy outputs
+
+For example, user can run five sames on the dev set with background as
+
+```bash
+inspect eval scicode.py \
+    --model openai/gpt-4o \
+    --temperature 0 \
+    --limit 5 \
+    -T input_path=../data/problems_dev.jsonl \
+    -T output_dir=./tmp/dev \
+    -T with_background=True \
+    -T mode=gold
+```
+
+For more information regarding `inspect_ai`, we refer users to its [official documentation](https://inspect.ai-safety-institute.org.uk/).