-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
As expected, several new SoTA LLMs have been released since our preprint. For resubmission, I would like to benchmark:
- Gemini 2.5
- OpenAI gpt-4.1
- Claude Sonnet 4
I also want to try and be a bit more lenient in the benchmarking, i.e., running the prediction prompt multiple times and averaging.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels