Inconsistency in benchmark?

This might be a misunderstanding on my part, please correct me if I'm wrong.

I'm seeing here in the standard GSM8K dataset (http://platinum-bench.csail.mit.edu/inspect?model=gemini-2.5-pro-exp-03-25&dataset=gsm8k) that many of the LLMs are getting this problem wrong:

<img width="809" alt="Image" src="https://github.com/user-attachments/assets/a0d391f9-2f95-4eda-b68a-401e80076d0a" />

It looks to me like the problem's solution is incorrect. 80 flagstones would be 80 * 75 = 6,000 pounds, which would require 3 trucks (each carries 2,000 pounds).

However, in `GSM8K-Platinum`, I also see this problem come up with the same listed solution:

<img width="784" alt="Image" src="https://github.com/user-attachments/assets/fd3ea5c4-b5f8-48c4-b178-9a341e56dac1" />

That being said, some of the models (like Gemini 2.5 Pro) are not listed (presumably due to having 0 errors?)

Thanks for your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency in benchmark? #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistency in benchmark? #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions