Skip to content

Rerunning LLM benchmarks #37

@ml-evs

Description

@ml-evs

As expected, several new SoTA LLMs have been released since our preprint. For resubmission, I would like to benchmark:

  • Gemini 2.5
  • OpenAI gpt-4.1
  • Claude Sonnet 4

I also want to try and be a bit more lenient in the benchmarking, i.e., running the prediction prompt multiple times and averaging.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions