Skip to content

Questions about the model performance #21

@Datoow

Description

@Datoow

I tried Llama-3-8B-Instruct for MathBench, the summary is:

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dataset version metric mode llama-3-8b-instruct-hf


######## MathBench Application Accuracy ######## - - - -
mathbench-college-single_choice_cn 783703 acc_1 gen 38.00
mathbench-college-single_choice_en b0fb1b acc_1 gen 40.00
mathbench-high-single_choice_cn 783703 acc_1 gen 39.33
mathbench-high-single_choice_en b0fb1b acc_1 gen 38.67
mathbench-middle-single_choice_cn 783703 acc_1 gen 49.33
mathbench-middle-single_choice_en b0fb1b acc_1 gen 28.67
mathbench-primary-cloze_cn ea47a6 accuracy gen 63.33
mathbench-primary-cloze_en bcc9c6 accuracy gen 71.33

mathbench-arithmetic-cloze_en bcc9c6 accuracy gen 52.67
######## MathBench Application CircularEval ######## - - - -
mathbench-college-single_choice_cn 783703 perf_4 gen 9.33
mathbench-college-single_choice_en b0fb1b perf_4 gen 15.33
mathbench-high-single_choice_cn 783703 perf_4 gen 8.67
mathbench-high-single_choice_en b0fb1b perf_4 gen 12.67
mathbench-middle-single_choice_cn 783703 perf_4 gen 23.33
mathbench-middle-single_choice_en b0fb1b perf_4 gen 9.33
######## MathBench Knowledge CircularEval ######## - - - -
mathbench-college_knowledge-single_choice_cn 783703 perf_4 gen 52.85
mathbench-college_knowledge-single_choice_en b0fb1b perf_4 gen 66.77
mathbench-high_knowledge-single_choice_cn 783703 perf_4 gen 35.11
mathbench-high_knowledge-single_choice_en b0fb1b perf_4 gen 58.72
mathbench-middle_knowledge-single_choice_cn 783703 perf_4 gen 41.62
mathbench-middle_knowledge-single_choice_en b0fb1b perf_4 gen 64.57
mathbench-primary_knowledge-single_choice_cn 783703 perf_4 gen 37.98
mathbench-primary_knowledge-single_choice_en b0fb1b perf_4 gen 67.89

######## MathBench Knowledge Accuracy ######## - - - -
mathbench-college_knowledge-single_choice_cn 783703 acc_1 gen 71.52
mathbench-college_knowledge-single_choice_en b0fb1b acc_1 gen 77.22
mathbench-high_knowledge-single_choice_cn 783703 acc_1 gen 60.00
mathbench-high_knowledge-single_choice_en b0fb1b acc_1 gen 75.09
mathbench-middle_knowledge-single_choice_cn 783703 acc_1 gen 64.97
mathbench-middle_knowledge-single_choice_en b0fb1b acc_1 gen 81.71
mathbench-primary_knowledge-single_choice_cn 783703 acc_1 gen 64.42
mathbench-primary_knowledge-single_choice_en b0fb1b acc_1 gen 82.57
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

I would like to ask how to calculate the model performance, for example, the Primary Application performance you gave for Llama-3-8B-Instruct is 71.0; but here the summary is 71.33 for primary-cloze_en and 63.33 for primary-cloze_cn; the average of these two is smaller...
the Primary Theory performance you gave for Llama-3-8B-Instruct is 60.2; but here the summary is 67.89 for single_choice_en and 37.98 for single_choice_cn; the average of these two is much smaller...

and same with other performance for Llama-3-8B-Instruct.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions