New eval_group pipeline to run predefined sets of benchmarks like artificial intellegence index #550

Kipok · 2025-07-03T22:44:58Z

Kipok
Jul 3, 2025
Maintainer

just merged a #549 that adds a new eval_group pipeline that can be used to run predefined groups of benchmarks (you can provide your own config there). The main goal right now is to make it easy to reproduce AAI scores, but we can add more groups in the future. And this might be quite useful for running experiments if you measure many benchmarks and want to get some kind of aggregate score and have an easy way to submit all of them.

In addition to that, there are a few important changes added in that PR:

Adds cpu_partition parameter in cluster config - this is very useful as we'd automatically set it for any jobs not requesting gpus. Just add cpu_partition: cpu to your cluster (assuming cpu is the name of cpu partition if you have one)
Updates default split for lcb - results won't match anymore with previous evals, but this is the current split in the LB. If you want to get previous behavior add split=test_v5_2408_2502
Adds an option to remove thinking trace from the generation with ++remove_thinking=True. This is already done in evaluation by default, but here you can do it in the generation directly. This is mostly useful for running LLM-as-a-judge on the output of ns generate, e.g. for hle benchmark, where we don't want to show the thinking part.
Adds ++prompt_suffix parameter to ns generate which is a quick way to turn thinking on / off for qwen3 models. We will add a more general chat_template_kwargs support in the future.

Kipok · 2025-07-03T22:45:44Z

Kipok
Jul 3, 2025
Maintainer Author

Example command to reproduce aai scores for qwen3-non-reasoning

ns eval_group \
    --cluster slurm \
    --eval_config aai \
    --output_dir /workspace/test-aai/q3-8b-noreason \
    --server_type sglang \
    --server_gpus 2 \
    --model /hf_models/Qwen3-8B \
    --judge_server_type sglang \
    --judge_model /hf_models/Qwen2.5-32B-Instruct \
    --judge_server_gpus 8 \
    ++prompt_suffix="' /no_think'" \
    ++inference.tokens_to_generate=4096

output will be in /workspace/test-aai/q3-8b-noreason/metrics.json and should look like this

{
  "overall_score": 51.84048514392376,
  "math_score": 82.46666666666667,
  "code_score": 37.72486772486772,
  "mmlu_pro": 72.49833776595744,
  "hle": 4.216867469879518,
  "gpqa": 54.04040404040404,
  "aime24": 70.0,
  "math500": 94.93333333333332,
  "scicode": 22.222222222222232,
  "livecodebench": 53.22751322751321
}

Example command to reproduce aai scores for qwen3-with-reasoning

ns eval_group \
    --cluster slurm \
    --eval_config aai \
    --output_dir /workspace/test-aai/q3-8b-reason \
    --server_type sglang \
    --server_gpus 8 \
    --model /hf_models/Qwen3-8B \
    --judge_server_type sglang \
    --judge_model /hf_models/Qwen2.5-32B-Instruct \
    --judge_server_gpus 8 \
    ++inference.tokens_to_generate=32768 \
    --server_args="--context-len 131072"

output will be in /workspace/test-aai/q3-8b-reason/metrics.json and should look like this

{
  "overall_score": 37.936420611106186,
  "math_score": 55.300000000000004,
  "code_score": 21.729497354497354,
  "mmlu_pro": 64.83543882978724,
  "hle": 4.309545875810936,
  "gpqa": 42.92929292929293,
  "aime24": 26.666666666666668,
  "math500": 83.93333333333334,
  "scicode": 16.898148148148145,
  "livecodebench": 26.56084656084656
}

there is a bit of mismatch in lcb scores which we will debug, but overall things match quite well

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New eval_group pipeline to run predefined sets of benchmarks like artificial intellegence index #550

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

New eval_group pipeline to run predefined sets of benchmarks like artificial intellegence index #550

Uh oh!

Uh oh!

Kipok Jul 3, 2025 Maintainer

Replies: 1 comment

Uh oh!

Kipok Jul 3, 2025 Maintainer Author

Kipok
Jul 3, 2025
Maintainer

Kipok
Jul 3, 2025
Maintainer Author