You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here is the score we were able to reproduce without tool use, and we encourage you to try reproducing it as well!
198
201
We’ve observed that the numbers may vary slightly across runs, so feel free to run the evaluation multiple times to get a sense of the variance.
199
-
For a quick correctness check, we recommend starting with the low reasoning effort setting (120b-low), which should complete within minutes.
202
+
For a quick correctness check, we recommend starting with the low reasoning effort setting (`--reasoning-effort low`), which should complete within minutes.
0 commit comments