-
Notifications
You must be signed in to change notification settings - Fork 719
Open
Description
Prerequisite
- I have searched Issues and Discussions but cannot get the expected help.
- The bug has not been fixed in the latest version.
Type
I'm evaluating with the officially supported tasks/models/datasets.
Environment
env: H20
inference: sglang.v0.5.6
model: Deepseek V3.2
opencompass: 0.5.1.post1
Reproduces the problem - code/configuration sample
None
Reproduces the problem - command or script
python3 run.py --models my_api.py --datasets gpqa_gen.py --debug
| dataset | version | metric | mode | opencompass.models.OpenAISDK_opencompass-0.5.1.post1_Deepseek_V32 |
|---|---|---|---|---|
| GPQA_diamond | 5aeece | accuracy | gen | 79.29 |
python3 run.py --models my_think_api.py --datasets gpqa_gen.py --debug
| dataset | version | metric | mode | opencompass.models.OpenAISDK_opencompass-0.5.1.post1_Deepseek_V32 |
|---|---|---|---|---|
| GPQA_diamond | 5aeece | accuracy | gen | 76.26 |
For the result of thinking, after manual review, the score should be 85.85 ;
I checked the evaluation in gpqa.py and found that the regular expression is unreasonable in some scenarios: it should take the last result, but instead it takes the first matched one.
For example, "I'll answer: ANSWER: C" will return A instead of C.
20251210_161518_h20_gpqa.tar.gz
20251210_211428_h20_gpqa_thinking_32k.tar.gz
Reproduces the problem - error message
None
Other information
No response
Metadata
Metadata
Assignees
Labels
No labels
