Skip to content

Commit 92fa45d

Browse files
authored
[Benchmark]Add aime26 and SLA auto tune (#1230)
* update * update * update
1 parent ce1dd98 commit 92fa45d

File tree

20 files changed

+722
-298
lines changed

20 files changed

+722
-298
lines changed

Makefile

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -23,18 +23,21 @@ default: install
2323
# PARAMETERS:
2424
# BENCHMARK Specific benchmark name (e.g. gsm8k, mmlu).
2525
# Omit to process ALL registered benchmarks.
26-
# FORCE=1 Force re-translate even when a translation already exists.
27-
# Only applies to docs-translate.
26+
# FORCE=1 Force recompute/re-translate even when data already exists.
27+
# Applies to docs-update, docs-update-stats, and docs-translate.
2828
# WORKERS Parallel worker count for update / translate (default: 4).
2929
#
3030
# COMMON USAGE:
3131
# make docs # Full pipeline: translate → generate → build HTML
3232
# make docs-update # Update metadata for ALL benchmarks
3333
# make docs-update BENCHMARK=gsm8k # Update metadata for ONE benchmark
34+
# make docs-update BENCHMARK="gsm8k mmlu" # Update metadata for MULTIPLE benchmarks
3435
# make docs-update-stats # Update metadata + stats for ALL benchmarks
3536
# make docs-update-stats BENCHMARK=gsm8k # Update metadata + stats for ONE benchmark
37+
# make docs-update-stats BENCHMARK="gsm8k mmlu" # Update metadata + stats for MULTIPLE benchmarks
3638
# make docs-translate # Translate only untranslated benchmarks (ALL)
3739
# make docs-translate BENCHMARK=gsm8k # Translate ONE benchmark (skip if done)
40+
# make docs-translate BENCHMARK="gsm8k mmlu" # Translate MULTIPLE benchmarks
3841
# make docs-translate FORCE=1 # Force re-translate ALL benchmarks
3942
# make docs-translate BENCHMARK=gsm8k FORCE=1 # Force re-translate ONE benchmark
4043
# make docs-generate # Regenerate .md files from persisted JSON data
@@ -44,12 +47,13 @@ default: install
4447
# ============================================================================
4548

4649
# Parameters
50+
# BENCHMARK: one or more benchmark names, space-separated (e.g. BENCHMARK="gsm8k mmlu")
4751
BENCHMARK ?=
4852
FORCE ?=
4953
WORKERS ?= 4
5054

5155
# Internal helpers
52-
# When BENCHMARK is set: pass it as positional arg; otherwise use --all flag
56+
# When BENCHMARK is set: pass name(s) as positional args; otherwise use --all flag
5357
_BENCH_ARGS = $(if $(BENCHMARK),$(BENCHMARK),--all)
5458
# When FORCE is non-empty (e.g. FORCE=1): append --force flag
5559
_FORCE_FLAG = $(if $(FORCE),--force,)
@@ -61,11 +65,11 @@ docs: docs-translate docs-generate
6165

6266
.PHONY: docs-update
6367
docs-update:
64-
python -m evalscope.cli.cli benchmark-info $(_BENCH_ARGS) --update --workers $(WORKERS)
68+
python -m evalscope.cli.cli benchmark-info $(_BENCH_ARGS) --update $(_FORCE_FLAG) --workers $(WORKERS)
6569

6670
.PHONY: docs-update-stats
6771
docs-update-stats:
68-
python -m evalscope.cli.cli benchmark-info $(_BENCH_ARGS) --update --compute-stats --workers $(WORKERS)
72+
python -m evalscope.cli.cli benchmark-info $(_BENCH_ARGS) --update --compute-stats $(_FORCE_FLAG) --workers $(WORKERS)
6973

7074
.PHONY: docs-translate
7175
docs-translate:

docs/en/benchmarks/aime24.md

Lines changed: 13 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -26,29 +26,28 @@ AIME 2024 (American Invitational Mathematics Examination 2024) is a benchmark ba
2626
- Answers should be formatted within `\boxed{}` for proper extraction
2727
- Only integer answers are accepted (matching AIME format)
2828
- Problems are significantly harder than GSM8K or standard MATH benchmark
29-
- Reference solutions available in metadata for analysis
3029

3130

3231
## Properties
3332

3433
| Property | Value |
3534
|----------|-------|
3635
| **Benchmark Name** | `aime24` |
37-
| **Dataset ID** | [HuggingFaceH4/aime_2024](https://modelscope.cn/datasets/HuggingFaceH4/aime_2024/summary) |
36+
| **Dataset ID** | [evalscope/aime24](https://modelscope.cn/datasets/evalscope/aime24/summary) |
3837
| **Paper** | N/A |
3938
| **Tags** | `Math`, `Reasoning` |
4039
| **Metrics** | `acc` |
4140
| **Default Shots** | 0-shot |
42-
| **Evaluation Split** | `train` |
41+
| **Evaluation Split** | `test` |
4342

4443

4544
## Data Statistics
4645

4746
| Metric | Value |
4847
|--------|-------|
4948
| Total Samples | 30 |
50-
| Prompt Length (Mean) | 405.33 chars |
51-
| Prompt Length (Min/Max) | 185 / 1009 chars |
49+
| Prompt Length (Mean) | 462.33 chars |
50+
| Prompt Length (Min/Max) | 242 / 1066 chars |
5251

5352
## Sample Example
5453

@@ -58,28 +57,26 @@ AIME 2024 (American Invitational Mathematics Examination 2024) is a benchmark ba
5857
{
5958
"input": [
6059
{
61-
"id": "2f896eed",
62-
"content": "Every morning Aya goes for a $9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the walk takes her 4 hours, including $t$ minutes spent in the coffee shop. When she wa ... [TRUNCATED] ... e coffee shop. Suppose Aya walks at $s+\\frac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop.\nPlease reason step by step, and put your final answer within \\boxed{}."
60+
"id": "32187d6f",
61+
"content": "\nSolve the following math problem step by step. Put your answer inside \\boxed{}.\n\nEvery morning Aya goes for a $9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the w ... [TRUNCATED 164 chars] ... g $t$ minutes spent in the coffee shop. Suppose Aya walks at $s+\\frac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop.\n\nRemember to put your answer inside \\boxed{}."
6362
}
6463
],
65-
"target": "204",
64+
"target": "\\boxed{204}",
6665
"id": 0,
67-
"group_id": 0,
68-
"metadata": {
69-
"problem_id": 60,
70-
"solution": "$\\frac{9}{s} + t = 4$ in hours and $\\frac{9}{s+2} + t = 2.4$ in hours.\nSubtracting the second equation from the first, we get, \n$\\frac{9}{s} - \\frac{9}{s+2} = 1.6$\nMultiplying by $(s)(s+2)$, we get \n$9s+18-9s=18=1.6s^{2} + 3.2s$\nMultiplying b ... [TRUNCATED] ... s = 2.5$. Now, $2.5+0.5 = 3$. Taking $\\frac{9}{3} = 3$, we find that it will take three hours for the 9 kilometers to be traveled. The t minutes spent at the coffeeshop can be written as $144-48(2.5)$, so t = 24. $180 + 24 = 204$. -sepehr2010"
71-
}
66+
"group_id": 0
7267
}
7368
```
7469

75-
*Note: Some content was truncated for display.*
76-
7770
## Prompt Template
7871

7972
**Prompt Template:**
8073
```text
74+
75+
Solve the following math problem step by step. Put your answer inside \boxed{{}}.
76+
8177
{question}
82-
Please reason step by step, and put your final answer within \boxed{{}}.
78+
79+
Remember to put your answer inside \boxed{{}}.
8380
```
8481

8582
## Usage

docs/en/benchmarks/aime25.md

Lines changed: 6 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -25,15 +25,14 @@ AIME 2025 (American Invitational Mathematics Examination 2025) is a benchmark ba
2525
- Default configuration uses **0-shot** evaluation
2626
- Answers should be formatted within `\boxed{}` for proper extraction
2727
- Uses LLM-as-judge for mathematical equivalence checking
28-
- Subsets: AIME2025-I, AIME2025-II
2928

3029

3130
## Properties
3231

3332
| Property | Value |
3433
|----------|-------|
3534
| **Benchmark Name** | `aime25` |
36-
| **Dataset ID** | [opencompass/AIME2025](https://modelscope.cn/datasets/opencompass/AIME2025/summary) |
35+
| **Dataset ID** | [evalscope/aime25](https://modelscope.cn/datasets/evalscope/aime25/summary) |
3736
| **Paper** | N/A |
3837
| **Tags** | `Math`, `Reasoning` |
3938
| **Metrics** | `acc` |
@@ -46,26 +45,19 @@ AIME 2025 (American Invitational Mathematics Examination 2025) is a benchmark ba
4645
| Metric | Value |
4746
|--------|-------|
4847
| Total Samples | 30 |
49-
| Prompt Length (Mean) | 491.07 chars |
50-
| Prompt Length (Min/Max) | 212 / 1119 chars |
51-
52-
**Per-Subset Statistics:**
53-
54-
| Subset | Samples | Prompt Mean | Prompt Min | Prompt Max |
55-
|--------|---------|-------------|------------|------------|
56-
| `AIME2025-I` | 15 | 476.07 | 212 | 768 |
57-
| `AIME2025-II` | 15 | 506.07 | 234 | 1119 |
48+
| Prompt Length (Mean) | 604.93 chars |
49+
| Prompt Length (Min/Max) | 208 / 1862 chars |
5850

5951
## Sample Example
6052

61-
**Subset**: `AIME2025-I`
53+
**Subset**: `default`
6254

6355
```json
6456
{
6557
"input": [
6658
{
67-
"id": "99875e2d",
68-
"content": "\nSolve the following math problem step by step. Put your answer inside \\boxed{}.\n\nFind the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$.\n\nRemember to put your answer inside \\boxed{}."
59+
"id": "bff66863",
60+
"content": "\nSolve the following math problem step by step. Put your answer inside \\boxed{}.\n\nFind the sum of all integer bases $b>9$ for which $17_b$ is a divisor of $97_b.$\n\nRemember to put your answer inside \\boxed{}."
6961
}
7062
],
7163
"target": "70",
@@ -110,11 +102,6 @@ task_cfg = TaskConfig(
110102
api_url='OPENAI_API_COMPAT_URL',
111103
api_key='EMPTY_TOKEN',
112104
datasets=['aime25'],
113-
dataset_args={
114-
'aime25': {
115-
# subset_list: ['AIME2025-I', 'AIME2025-II'] # optional, evaluate specific subsets
116-
}
117-
},
118105
limit=10, # Remove this line for formal evaluation
119106
)
120107

docs/en/benchmarks/aime26.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# AIME-2026
2+
3+
4+
## Overview
5+
6+
AIME 2026 (American Invitational Mathematics Examination 2026) is a benchmark based on problems from the prestigious AIME competition, one of the most challenging high school mathematics contests in the United States. It tests advanced mathematical reasoning and problem-solving skills.
7+
8+
## Task Description
9+
10+
- **Task Type**: Competition Mathematics Problem Solving
11+
- **Input**: AIME-level mathematical problem
12+
- **Output**: Integer answer (0-999) with step-by-step reasoning
13+
- **Difficulty**: Advanced high school / early undergraduate level
14+
15+
## Key Features
16+
17+
- Problems from AIME I and AIME II 2026 competitions
18+
- Answers are always integers between 0 and 999
19+
- Requires creative mathematical reasoning and problem-solving
20+
- Topics: algebra, geometry, number theory, combinatorics, probability
21+
- Represents top-tier high school mathematics competition difficulty
22+
23+
## Evaluation Notes
24+
25+
- Default configuration uses **0-shot** evaluation
26+
- Answers should be formatted within `\boxed{}` for proper extraction
27+
- Uses LLM-as-judge for mathematical equivalence checking
28+
29+
30+
## Properties
31+
32+
| Property | Value |
33+
|----------|-------|
34+
| **Benchmark Name** | `aime26` |
35+
| **Dataset ID** | [evalscope/aime26](https://modelscope.cn/datasets/evalscope/aime26/summary) |
36+
| **Paper** | N/A |
37+
| **Tags** | `Math`, `Reasoning` |
38+
| **Metrics** | `acc` |
39+
| **Default Shots** | 0-shot |
40+
| **Evaluation Split** | `test` |
41+
42+
43+
## Data Statistics
44+
45+
| Metric | Value |
46+
|--------|-------|
47+
| Total Samples | 30 |
48+
| Prompt Length (Mean) | 517.7 chars |
49+
| Prompt Length (Min/Max) | 263 / 1026 chars |
50+
51+
## Sample Example
52+
53+
**Subset**: `default`
54+
55+
```json
56+
{
57+
"input": [
58+
{
59+
"id": "a6d469bc",
60+
"content": "\nSolve the following math problem step by step. Put your answer inside \\boxed{}.\n\nPatrick started walking at a constant rate along a straight road from school to the park. One hour after Patrick left, Tanya started running along the same road ... [TRUNCATED 270 chars] ... ya ran, and all three arrived at the park at the same time. The distance from the school to the park is $\\frac{m}{n}$ miles, where $m$ and $n$ are relatively prime positive integers. Find $m + n$.\n\nRemember to put your answer inside \\boxed{}."
61+
}
62+
],
63+
"target": "277",
64+
"id": 0,
65+
"group_id": 0
66+
}
67+
```
68+
69+
## Prompt Template
70+
71+
**Prompt Template:**
72+
```text
73+
74+
Solve the following math problem step by step. Put your answer inside \boxed{{}}.
75+
76+
{question}
77+
78+
Remember to put your answer inside \boxed{{}}.
79+
```
80+
81+
## Usage
82+
83+
### Using CLI
84+
85+
```bash
86+
evalscope eval \
87+
--model YOUR_MODEL \
88+
--api-url OPENAI_API_COMPAT_URL \
89+
--api-key EMPTY_TOKEN \
90+
--datasets aime26 \
91+
--limit 10 # Remove this line for formal evaluation
92+
```
93+
94+
### Using Python
95+
96+
```python
97+
from evalscope import run_task
98+
from evalscope.config import TaskConfig
99+
100+
task_cfg = TaskConfig(
101+
model='YOUR_MODEL',
102+
api_url='OPENAI_API_COMPAT_URL',
103+
api_key='EMPTY_TOKEN',
104+
datasets=['aime26'],
105+
limit=10, # Remove this line for formal evaluation
106+
)
107+
108+
run_task(task_cfg=task_cfg)
109+
```
110+
111+

docs/en/get_started/supported_dataset/llm.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ Below is the list of supported LLM benchmarks. Click on a benchmark name for det
77
| `aa_lcr` | [AA-LCR](../../benchmarks/aa_lcr.md) | `Knowledge`, `LongContext`, `Reasoning` |
88
| `aime24` | [AIME-2024](../../benchmarks/aime24.md) | `Math`, `Reasoning` |
99
| `aime25` | [AIME-2025](../../benchmarks/aime25.md) | `Math`, `Reasoning` |
10+
| `aime26` | [AIME-2026](../../benchmarks/aime26.md) | `Math`, `Reasoning` |
1011
| `alpaca_eval` | [AlpacaEval2.0](../../benchmarks/alpaca_eval.md) | `Arena`, `InstructionFollowing` |
1112
| `amc` | [AMC](../../benchmarks/amc.md) | `Math`, `Reasoning` |
1213
| `anat_em` | [AnatEM](../../benchmarks/anat_em.md) | `Knowledge`, `NER` |
@@ -116,6 +117,7 @@ Below is the list of supported LLM benchmarks. Click on a benchmark name for det
116117
../../benchmarks/aa_lcr.md
117118
../../benchmarks/aime24.md
118119
../../benchmarks/aime25.md
120+
../../benchmarks/aime26.md
119121
../../benchmarks/alpaca_eval.md
120122
../../benchmarks/amc.md
121123
../../benchmarks/anat_em.md

docs/en/user_guides/stress_test/sla_auto_tune.md

Lines changed: 59 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,12 +11,15 @@ The SLA (Service Level Agreement) auto-tuning feature allows users to define ser
1111

1212
## Parameter Description
1313

14-
See [Parameter Description](./parameters.md#sla-settings) for details.
15-
16-
Main parameters:
17-
- `--sla-auto-tune`: Enable auto-tuning.
18-
- `--sla-variable`: Adjustment variable, `parallel` or `rate`.
19-
- `--sla-params`: Define SLA rules.
14+
| Parameter | Type | Description | Default |
15+
|------|------|------|--------|
16+
| `--sla-auto-tune` | `bool` | Whether to enable SLA auto-tuning mode | `False` |
17+
| `--sla-variable` | `str` | Variable for auto-tuning<br>Options: `parallel` (concurrency), `rate` (request rate) | `parallel` |
18+
| `--sla-params` | `str` | SLA constraint conditions, JSON string, supports multiple constraint groups (AND/OR logic), see [description below](#sla-params-logic) | `None` |
19+
| `--sla-upper-bound` | `int` | Maximum concurrency/rate limit during auto-tuning | `65536` |
20+
| `--sla-lower-bound` | `int` | Minimum concurrency/rate limit during auto-tuning | `1` |
21+
| `--sla-num-runs` | `int` | Number of repeated runs per test point (average taken to reduce fluctuation) | `3` |
22+
| `--sla-number-multiplier` | `float` | Multiplier of total requests relative to concurrency/rate per test, i.e. `number = round(parallel × N)`; defaults to `2` when not set | `None` |
2023

2124
## Supported Metrics and Operators
2225

@@ -31,6 +34,56 @@ Main parameters:
3134
| **Throughput** | `rps` | Requests per second | `>=`, `>`, `max` |
3235
| | `tps` | Tokens per second | `>=`, `>`, `max` |
3336

37+
## `--sla-params` Logic
38+
39+
`--sla-params` accepts a **JSON array**, where each element is an **object (group)**. Logic rules are as follows:
40+
41+
- **Multiple metrics within the same object**: **AND** (must all be satisfied simultaneously)
42+
- **Between different objects**: **OR** (any one group satisfied is sufficient)
43+
44+
The overall semantics are: `(Group1 ConditionA AND Group1 ConditionB) OR (Group2 ConditionC AND Group2 ConditionD) OR ...`
45+
46+
### AND Example: Satisfy TTFT and TPOT Simultaneously
47+
48+
Write multiple metrics in the **same object** to indicate they must **all** be satisfied:
49+
50+
```bash
51+
--sla-params '[{"avg_ttft": "<=2", "avg_tpot": "<=0.05"}]'
52+
```
53+
54+
Meaning: Find the maximum concurrency satisfying **`avg_ttft <= 2s` AND `avg_tpot <= 0.05s`**. Only when both metrics are met does that concurrency level pass.
55+
56+
### OR Example: Independently Evaluate Multiple TTFT Thresholds
57+
58+
Write each metric in a **different object** so each group of conditions is evaluated **independently**:
59+
60+
```bash
61+
--sla-params '[{"p99_ttft": "<0.05"}, {"p99_ttft": "<0.01"}]'
62+
```
63+
64+
Meaning: Find the maximum request rate satisfying **`p99_ttft < 0.05s`** and satisfying **`p99_ttft < 0.01s`** separately, each outputting results independently.
65+
66+
### AND + OR Combined Example
67+
68+
```bash
69+
--sla-params '[{"avg_ttft": "<=1", "avg_tpot": "<=0.05"}, {"p99_latency": "<=5"}]'
70+
```
71+
72+
Meaning:
73+
- **Group 1**: `avg_ttft <= 1s` **AND** `avg_tpot <= 0.05s` (both satisfied simultaneously)
74+
- **Group 2**: `p99_latency <= 5s`
75+
- Each group independently completes a binary search and outputs its maximum concurrency value separately.
76+
77+
### Extremum Optimization Mode
78+
79+
When the array has **only one object with only one metric**, and the operator is `max` or `min`, the tool enters extremum optimization mode and directly finds the pressure value corresponding to the optimal metric:
80+
81+
```bash
82+
--sla-params '[{"tps": "max"}]'
83+
```
84+
85+
Meaning: Find the concurrency corresponding to maximum TPS (token throughput).
86+
3487
## Workflow
3588

3689
1. **Baseline Test**: Start testing with the user-specified initial `parallel` or `rate` (recommended to set a small value, such as 1 or 2).

0 commit comments

Comments
 (0)