Skip to content

Commit 2f13380

Browse files
authored
[Fix] add eval_prompt for ugd_hard (#2376)
1 parent eaf6ef2 commit 2f13380

File tree

1 file changed

+28
-0
lines changed

1 file changed

+28
-0
lines changed

opencompass/configs/chatml_datasets/UGD_hard/UGD_hard_gen.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,39 @@
11

2+
EVAL_PROMPT = (
3+
"You are a helpful assistant who evaluates the correctness and quality of models' outputs.\nPlease as a grading "
4+
'expert, judge whether the final answers given by the candidates below are consistent with the standard answers, '
5+
'that is, whether the candidates answered correctly. \n \n Here are some evaluation criteria:\n '
6+
"1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because "
7+
"the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the "
8+
"standard answer according to the form of the question. Don't try to answer the original question. You can assume "
9+
"that the standard answer is definitely correct.\n 2. Because the candidate's answer may be different from the "
10+
'standard answer in the form of expression, before making a judgment, please understand the question and the '
11+
"standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to "
12+
'answer the original question.\n 3. Some answers may contain multiple items, such as multiple-choice questions, '
13+
'multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard '
14+
'answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate '
15+
'needs to answer all the corresponding options or blanks correctly to be considered correct.\n 4. Some answers '
16+
'may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a '
17+
'textual description, as long as the meaning expressed is the same. And some formulas are expressed in different '
18+
'ways, but they are equivalent and correct.\n 5. If the prediction is given with \\boxed{{}}, please ignore '
19+
"the \\boxed{{}} and only judge whether the candidate's answer is consistent with the standard answer.\n\n "
20+
'Please judge whether the following answers are consistent with the standard answer based on the above criteria. '
21+
'Grade the predicted answer of this new question as one of:\n A: CORRECT \n B: INCORRECT\n Just return '
22+
"the letters \"A\" or \"B\", with no text around it.\n\n Here is your task. Simply reply with either CORRECT, "
23+
"INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer."
24+
'\n\n\n <Original Question Begin>: \n\n{question}\n\n<Original Question End>\n\n\n '
25+
'<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n\n <Predicted Answer Begin>: \n{prediction}\n'
26+
"<Predicted End>\n\n\n \n Judging the correctness of candidates' answers:\"\n"
27+
)
28+
229
datasets = [
330
dict(
431
abbr='UGD_hard',
532
path='./data/UGD_hard_oc.jsonl',
633
evaluator=dict(
734
type='llm_evaluator',
835
judge_cfg=dict(),
36+
prompt=EVAL_PROMPT,
937
),
1038
n=1,
1139
),

0 commit comments

Comments
 (0)