[Bug] GPQA score low with thinking enable

### Prerequisite

- [x] I have searched [Issues](https://github.com/open-compass/opencompass/issues/) and [Discussions](https://github.com/open-compass/opencompass/discussions) but cannot get the expected help.
- [x] The bug has not been fixed in the [latest version](https://github.com/open-compass/opencompass).

### Type

I'm evaluating with the officially supported tasks/models/datasets.

### Environment

env:                     H20
inference:           sglang.v0.5.6
model:                Deepseek V3.2
opencompass:  0.5.1.post1

### Reproduces the problem - code/configuration sample

None

### Reproduces the problem - command or script

python3 run.py --models my_api.py --datasets gpqa_gen.py --debug
| dataset | version | metric | mode | opencompass.models.OpenAISDK_opencompass-0.5.1.post1_Deepseek_V32 |
|----- | ----- | ----- | ----- | -----|
| GPQA_diamond | 5aeece | accuracy | gen | 79.29 |

python3 run.py --models my_think_api.py --datasets gpqa_gen.py --debug
| dataset | version | metric | mode | opencompass.models.OpenAISDK_opencompass-0.5.1.post1_Deepseek_V32 |
|----- | ----- | ----- | ----- | -----|
| GPQA_diamond | 5aeece | accuracy | gen | 76.26 |

For the result of thinking, after manual review, the score should be 85.85 ；
I checked the evaluation in gpqa.py and found that the regular expression is unreasonable in some scenarios: it should take the last result, but instead it takes the first matched one.

![Image](https://github.com/user-attachments/assets/85bad6de-11be-4b04-af5f-a5e9037ee88f)

For example, "I'll answer: ANSWER: C" will return A instead of C.

[20251210_161518_h20_gpqa.tar.gz](https://github.com/user-attachments/files/24097795/20251210_161518_h20_gpqa.tar.gz)
[20251210_211428_h20_gpqa_thinking_32k.tar.gz](https://github.com/user-attachments/files/24097796/20251210_211428_h20_gpqa_thinking_32k.tar.gz)

### Reproduces the problem - error message

None

### Other information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] GPQA score low with thinking enable #2356

Prerequisite

Type

Environment

Reproduces the problem - code/configuration sample

Reproduces the problem - command or script

Reproduces the problem - error message

Other information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] GPQA score low with thinking enable #2356

Description

Prerequisite

Type

Environment

Reproduces the problem - code/configuration sample

Reproduces the problem - command or script

Reproduces the problem - error message

Other information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions