New eval_group pipeline to run predefined sets of benchmarks like artificial intellegence index #550
Kipok
announced in
Announcements
Replies: 1 comment
-
|
Example command to reproduce aai scores for qwen3-non-reasoning output will be in Example command to reproduce aai scores for qwen3-with-reasoning output will be in there is a bit of mismatch in lcb scores which we will debug, but overall things match quite well |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
just merged a #549 that adds a new eval_group pipeline that can be used to run predefined groups of benchmarks (you can provide your own config there). The main goal right now is to make it easy to reproduce AAI scores, but we can add more groups in the future. And this might be quite useful for running experiments if you measure many benchmarks and want to get some kind of aggregate score and have an easy way to submit all of them.
In addition to that, there are a few important changes added in that PR:
cpu_partition: cputo your cluster (assumingcpuis the name of cpu partition if you have one)split=test_v5_2408_2502++remove_thinking=True. This is already done in evaluation by default, but here you can do it in the generation directly. This is mostly useful for running LLM-as-a-judge on the output of ns generate, e.g. for hle benchmark, where we don't want to show the thinking part.++prompt_suffixparameter tons generatewhich is a quick way to turn thinking on / off for qwen3 models. We will add a more generalchat_template_kwargssupport in the future.Beta Was this translation helpful? Give feedback.
All reactions