You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/en/Quickstart.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -193,6 +193,8 @@ Some datasets have specific requirements during evaluation:
193
193
***SciCode:**
194
194
***Environment Dependencies:** Before running, you need to download the runtime dependency file `test_data.h5` according to the [official instructions](https://github.com/scicode-bench/SciCode) and place it in the `scieval/dataset/SciCode/eval/data` directory.
195
195
***Evaluation Files:** By default, the framework stores the model's inference results in an `xlsx` format file for easy viewing. However, for SciCode, the output length of some models, such as `deepseek-R1`, may exceed the cell length limit of `xlsx`. In this case, you need to set the environment variable `PRED_FORMAT` to `json` or `tsv` (currently only these three formats are supported).
196
+
***SGI-Bench-1.0:**
197
+
***Instructions for use:** See details at:`scieval/dataset/SGI_Bench_1_0/readme.md`
You are an expert in systematically validating and evaluating LLM-generated solutions. Your task is to rigorously analyze the correctness of a provided solution by comparing it step-by-step against the reference solution, and output **only** a structured verification list—with no additional text.
80
+
## Instructions
81
+
1. Break down the given LLM solution into individual steps and evaluate each one against the corresponding reference solution steps.
82
+
2. For each step, include the following three components:
83
+
- **solution_step**: The specific part of the LLM solution being evaluated.
84
+
- **reason**: A clear, critical explanation of whether the step contains errors, omissions, or deviations from the reference approach. Be stringent in your assessment.
85
+
- **judge**: Your verdict: either `"correct"` or `"incorrect"`.
86
+
3. If the final LLM answer is incorrect, you must identify at least one step in your analysis as incorrect.
87
+
4. Justify your judgments rigorously, pointing out even minor inaccuracies or logical flaws.
88
+
5. Do not attempt to answer the original question—your role is strictly to evaluate.
89
+
6. Output **only** a list of dictionaries in the exact format provided below. Do not include any other text or comments.
90
+
## Question
91
+
{ques_dict['question']}
92
+
## Reference Solution Steps
93
+
{newline.join(ques_dict['steps'])}
94
+
## Reference Answer
95
+
{ques_dict['answer']}
96
+
## LLM Solution Steps
97
+
{ques_dict['prediction']}
98
+
## LLM Answer
99
+
{extract_final_answer(ques_dict['prediction'])}
100
+
## Output Example
101
+
[
102
+
{{"solution_step": "step content", "reason": "reason of the judgement", "judge": "correct or incorrect"}},
103
+
{{"solution_step": "step content", "reason": "reason of the judgement", "judge": "correct or incorrect"}},
0 commit comments