-
Notifications
You must be signed in to change notification settings - Fork 1
Description
I tried Llama-3-8B-Instruct for MathBench, the summary is:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dataset version metric mode llama-3-8b-instruct-hf
######## MathBench Application Accuracy ######## - - - -
mathbench-college-single_choice_cn 783703 acc_1 gen 38.00
mathbench-college-single_choice_en b0fb1b acc_1 gen 40.00
mathbench-high-single_choice_cn 783703 acc_1 gen 39.33
mathbench-high-single_choice_en b0fb1b acc_1 gen 38.67
mathbench-middle-single_choice_cn 783703 acc_1 gen 49.33
mathbench-middle-single_choice_en b0fb1b acc_1 gen 28.67
mathbench-primary-cloze_cn ea47a6 accuracy gen 63.33
mathbench-primary-cloze_en bcc9c6 accuracy gen 71.33
mathbench-arithmetic-cloze_en bcc9c6 accuracy gen 52.67
######## MathBench Application CircularEval ######## - - - -
mathbench-college-single_choice_cn 783703 perf_4 gen 9.33
mathbench-college-single_choice_en b0fb1b perf_4 gen 15.33
mathbench-high-single_choice_cn 783703 perf_4 gen 8.67
mathbench-high-single_choice_en b0fb1b perf_4 gen 12.67
mathbench-middle-single_choice_cn 783703 perf_4 gen 23.33
mathbench-middle-single_choice_en b0fb1b perf_4 gen 9.33
######## MathBench Knowledge CircularEval ######## - - - -
mathbench-college_knowledge-single_choice_cn 783703 perf_4 gen 52.85
mathbench-college_knowledge-single_choice_en b0fb1b perf_4 gen 66.77
mathbench-high_knowledge-single_choice_cn 783703 perf_4 gen 35.11
mathbench-high_knowledge-single_choice_en b0fb1b perf_4 gen 58.72
mathbench-middle_knowledge-single_choice_cn 783703 perf_4 gen 41.62
mathbench-middle_knowledge-single_choice_en b0fb1b perf_4 gen 64.57
mathbench-primary_knowledge-single_choice_cn 783703 perf_4 gen 37.98
mathbench-primary_knowledge-single_choice_en b0fb1b perf_4 gen 67.89
######## MathBench Knowledge Accuracy ######## - - - -
mathbench-college_knowledge-single_choice_cn 783703 acc_1 gen 71.52
mathbench-college_knowledge-single_choice_en b0fb1b acc_1 gen 77.22
mathbench-high_knowledge-single_choice_cn 783703 acc_1 gen 60.00
mathbench-high_knowledge-single_choice_en b0fb1b acc_1 gen 75.09
mathbench-middle_knowledge-single_choice_cn 783703 acc_1 gen 64.97
mathbench-middle_knowledge-single_choice_en b0fb1b acc_1 gen 81.71
mathbench-primary_knowledge-single_choice_cn 783703 acc_1 gen 64.42
mathbench-primary_knowledge-single_choice_en b0fb1b acc_1 gen 82.57
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
I would like to ask how to calculate the model performance, for example, the Primary Application performance you gave for Llama-3-8B-Instruct is 71.0; but here the summary is 71.33 for primary-cloze_en and 63.33 for primary-cloze_cn; the average of these two is smaller...
the Primary Theory performance you gave for Llama-3-8B-Instruct is 60.2; but here the summary is 67.89 for single_choice_en and 37.98 for single_choice_cn; the average of these two is much smaller...
and same with other performance for Llama-3-8B-Instruct.