Rerunning LLM benchmarks

As expected, several new SoTA LLMs have been released since our preprint. For resubmission, I would like to benchmark:

- [ ] Gemini 2.5
- [ ] OpenAI gpt-4.1
- [ ] Claude Sonnet 4

I also want to try and be a bit more lenient in the benchmarking, i.e., running the prediction prompt multiple times and averaging.