Skip to content

[Bug] GPQA score low with thinking enable #2356

@qianyp18

Description

@qianyp18

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

env: H20
inference: sglang.v0.5.6
model: Deepseek V3.2
opencompass: 0.5.1.post1

Reproduces the problem - code/configuration sample

None

Reproduces the problem - command or script

python3 run.py --models my_api.py --datasets gpqa_gen.py --debug

dataset version metric mode opencompass.models.OpenAISDK_opencompass-0.5.1.post1_Deepseek_V32
GPQA_diamond 5aeece accuracy gen 79.29

python3 run.py --models my_think_api.py --datasets gpqa_gen.py --debug

dataset version metric mode opencompass.models.OpenAISDK_opencompass-0.5.1.post1_Deepseek_V32
GPQA_diamond 5aeece accuracy gen 76.26

For the result of thinking, after manual review, the score should be 85.85 ;
I checked the evaluation in gpqa.py and found that the regular expression is unreasonable in some scenarios: it should take the last result, but instead it takes the first matched one.

Image

For example, "I'll answer: ANSWER: C" will return A instead of C.

20251210_161518_h20_gpqa.tar.gz
20251210_211428_h20_gpqa_thinking_32k.tar.gz

Reproduces the problem - error message

None

Other information

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions