|
1 | 1 |
|
| 2 | +EVAL_PROMPT = ( |
| 3 | + "You are a helpful assistant who evaluates the correctness and quality of models' outputs.\nPlease as a grading " |
| 4 | + 'expert, judge whether the final answers given by the candidates below are consistent with the standard answers, ' |
| 5 | + 'that is, whether the candidates answered correctly. \n \n Here are some evaluation criteria:\n ' |
| 6 | + "1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because " |
| 7 | + "the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the " |
| 8 | + "standard answer according to the form of the question. Don't try to answer the original question. You can assume " |
| 9 | + "that the standard answer is definitely correct.\n 2. Because the candidate's answer may be different from the " |
| 10 | + 'standard answer in the form of expression, before making a judgment, please understand the question and the ' |
| 11 | + "standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to " |
| 12 | + 'answer the original question.\n 3. Some answers may contain multiple items, such as multiple-choice questions, ' |
| 13 | + 'multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard ' |
| 14 | + 'answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate ' |
| 15 | + 'needs to answer all the corresponding options or blanks correctly to be considered correct.\n 4. Some answers ' |
| 16 | + 'may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a ' |
| 17 | + 'textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ' |
| 18 | + 'ways, but they are equivalent and correct.\n 5. If the prediction is given with \\boxed{{}}, please ignore ' |
| 19 | + "the \\boxed{{}} and only judge whether the candidate's answer is consistent with the standard answer.\n\n " |
| 20 | + 'Please judge whether the following answers are consistent with the standard answer based on the above criteria. ' |
| 21 | + 'Grade the predicted answer of this new question as one of:\n A: CORRECT \n B: INCORRECT\n Just return ' |
| 22 | + "the letters \"A\" or \"B\", with no text around it.\n\n Here is your task. Simply reply with either CORRECT, " |
| 23 | + "INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer." |
| 24 | + '\n\n\n <Original Question Begin>: \n\n{question}\n\n<Original Question End>\n\n\n ' |
| 25 | + '<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n\n <Predicted Answer Begin>: \n{prediction}\n' |
| 26 | + "<Predicted End>\n\n\n \n Judging the correctness of candidates' answers:\"\n" |
| 27 | +) |
| 28 | + |
2 | 29 | datasets = [ |
3 | 30 | dict( |
4 | 31 | abbr='UGD_hard', |
5 | 32 | path='./data/UGD_hard_oc.jsonl', |
6 | 33 | evaluator=dict( |
7 | 34 | type='llm_evaluator', |
8 | 35 | judge_cfg=dict(), |
| 36 | + prompt=EVAL_PROMPT, |
9 | 37 | ), |
10 | 38 | n=1, |
11 | 39 | ), |
|
0 commit comments