File tree Expand file tree Collapse file tree 3 files changed +26
-7
lines changed Expand file tree Collapse file tree 3 files changed +26
-7
lines changed Original file line number Diff line number Diff line change @@ -34,6 +34,7 @@ SciCode sources challenging and realistic research-level coding problems across
3434| 🥇 OpenAI o1-preview | <div align =" center " >** 7.7** </div > | <div align =" center " style =" color :grey " >28.5</div > |
3535| 🥈 Claude3.5-Sonnet | <div align =" center " >** 4.6** </div > | <div align =" center " style =" color :grey " >26.0</div > |
3636| 🥉 Claude3.5-Sonnet (new) | <div align =" center " >** 4.6** </div > | <div align =" center " style =" color :grey " >25.3</div > |
37+ | Deepseek-v3 | <div align =" center " >** 3.1** </div > | <div align =" center " style =" color :grey " >23.7</div > |
3738| Deepseek-Coder-v2 | <div align =" center " >** 3.1** </div > | <div align =" center " style =" color :grey " >21.2</div > |
3839| GPT-4o | <div align =" center " >** 1.5** </div > | <div align =" center " style =" color :grey " >25.0</div > |
3940| GPT-4-Turbo | <div align =" center " >** 1.5** </div > | <div align =" center " style =" color :grey " >22.9</div > |
Original file line number Diff line number Diff line change @@ -14,10 +14,11 @@ inspect eval scicode.py --model <your_model> --temperature 0
1414
1515However, there are some additional command line arguments that could be useful as well.
1616
17- - ` --max_connections ` : Maximum amount of API connections to the evaluated model.
17+ - ` --max-connections ` : Maximum amount of API connections to the evaluated model.
1818- ` --limit ` : Limit of the number of samples to evaluate in the SciCode dataset.
1919- ` -T input_path=<another_input_json_file> ` : This is useful when user wants to change to another json dataset (e.g., the dev set).
2020- ` -T output_dir=<your_output_dir> ` : This changes the default output directory (` ./tmp ` ).
21+ - ` -T h5py_file=<your_h5py_file> ` : This is used if your h5py file is not downloaded in the recommended directory.
2122- ` -T with_background=True/False ` : Whether to include problem background.
2223- ` -T mode=normal/gold/dummy ` : This provides two additional modes for sanity checks.
2324 - ` normal ` mode is the standard mode to evaluate a model
@@ -37,6 +38,19 @@ inspect eval scicode.py \
3738 -T mode=gold
3839```
3940
41+ User can run the evaluation on ` Deepseek-v3 ` using together ai via the following command:
42+
43+ ``` bash
44+ export TOGETHER_API_KEY=< YOUR_API_KEY>
45+ inspect eval scicode.py \
46+ --model together/deepseek-ai/DeepSeek-V3 \
47+ --temperature 0 \
48+ --max-connections 2 \
49+ --max-tokens 32784 \
50+ -T output_dir=./tmp/deepseek-v3 \
51+ -T with_background=False
52+ ```
53+
4054For more information regarding ` inspect_ai ` , we refer users to its [ official documentation] ( https://inspect.ai-safety-institute.org.uk/ ) .
4155
4256### Extra: How SciCode are Evaluated Under the Hood?
Original file line number Diff line number Diff line change @@ -336,12 +336,16 @@ async def solve(state: TaskState, generate: Generate) -> TaskState:
336336 elif params ["mode" ] == "gold" :
337337 response_from_llm = generate_gold_response (state .metadata , idx + 1 )
338338 else :
339- # ===Model Generation===
340- state .user_prompt .text = prompt
341- state_copy = copy .deepcopy (state )
342- result = await generate (state = state_copy )
343- response_from_llm = result .output .completion
344- # ===Model Generation===
339+ try :
340+ # ===Model Generation===
341+ state .user_prompt .text = prompt
342+ state_copy = copy .deepcopy (state )
343+ result = await generate (state = state_copy )
344+ response_from_llm = result .output .completion
345+ # ===Model Generation===
346+ except :
347+ print (f"Failed to generate response for problem { prob_id } step { idx + 1 } ." )
348+ response_from_llm = generate_dummy_response (prompt )
345349 prompt_assistant .register_previous_response (
346350 prob_data = state .metadata ,
347351 response = response_from_llm ,
You can’t perform that action at this time.
0 commit comments