You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -58,28 +57,26 @@ AIME 2024 (American Invitational Mathematics Examination 2024) is a benchmark ba
58
57
{
59
58
"input": [
60
59
{
61
-
"id": "2f896eed",
62
-
"content": "Every morning Aya goes for a $9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the walk takes her 4 hours, including $t$ minutes spent in the coffee shop. When she wa ... [TRUNCATED] ... e coffee shop. Suppose Aya walks at $s+\\frac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop.\nPlease reason step by step, and put your final answer within\\boxed{}."
60
+
"id": "32187d6f",
61
+
"content": "\nSolve the following math problem step by step. Put your answer inside \\boxed{}.\n\nEvery morning Aya goes for a $9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the w ... [TRUNCATED 164 chars] ... g $t$ minutes spent in the coffee shop. Suppose Aya walks at $s+\\frac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop.\n\nRemember to put your answer inside\\boxed{}."
63
62
}
64
63
],
65
-
"target": "204",
64
+
"target": "\\boxed{204}",
66
65
"id": 0,
67
-
"group_id": 0,
68
-
"metadata": {
69
-
"problem_id": 60,
70
-
"solution": "$\\frac{9}{s} + t = 4$ in hours and $\\frac{9}{s+2} + t = 2.4$ in hours.\nSubtracting the second equation from the first, we get, \n$\\frac{9}{s} - \\frac{9}{s+2} = 1.6$\nMultiplying by $(s)(s+2)$, we get \n$9s+18-9s=18=1.6s^{2} + 3.2s$\nMultiplying b ... [TRUNCATED] ... s = 2.5$. Now, $2.5+0.5 = 3$. Taking $\\frac{9}{3} = 3$, we find that it will take three hours for the 9 kilometers to be traveled. The t minutes spent at the coffeeshop can be written as $144-48(2.5)$, so t = 24. $180 + 24 = 204$. -sepehr2010"
71
-
}
66
+
"group_id": 0
72
67
}
73
68
```
74
69
75
-
*Note: Some content was truncated for display.*
76
-
77
70
## Prompt Template
78
71
79
72
**Prompt Template:**
80
73
```text
74
+
75
+
Solve the following math problem step by step. Put your answer inside \boxed{{}}.
76
+
81
77
{question}
82
-
Please reason step by step, and put your final answer within \boxed{{}}.
"content": "\nSolve the following math problem step by step. Put your answer inside \\boxed{}.\n\nFind the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$.\n\nRemember to put your answer inside \\boxed{}."
59
+
"id": "bff66863",
60
+
"content": "\nSolve the following math problem step by step. Put your answer inside \\boxed{}.\n\nFind the sum of all integer bases $b>9$ for which $17_b$ is a divisor of $97_b.$\n\nRemember to put your answer inside \\boxed{}."
69
61
}
70
62
],
71
63
"target": "70",
@@ -110,11 +102,6 @@ task_cfg = TaskConfig(
110
102
api_url='OPENAI_API_COMPAT_URL',
111
103
api_key='EMPTY_TOKEN',
112
104
datasets=['aime25'],
113
-
dataset_args={
114
-
'aime25': {
115
-
# subset_list: ['AIME2025-I', 'AIME2025-II'] # optional, evaluate specific subsets
116
-
}
117
-
},
118
105
limit=10, # Remove this line for formal evaluation
AIME 2026 (American Invitational Mathematics Examination 2026) is a benchmark based on problems from the prestigious AIME competition, one of the most challenging high school mathematics contests in the United States. It tests advanced mathematical reasoning and problem-solving skills.
7
+
8
+
## Task Description
9
+
10
+
-**Task Type**: Competition Mathematics Problem Solving
11
+
-**Input**: AIME-level mathematical problem
12
+
-**Output**: Integer answer (0-999) with step-by-step reasoning
13
+
-**Difficulty**: Advanced high school / early undergraduate level
14
+
15
+
## Key Features
16
+
17
+
- Problems from AIME I and AIME II 2026 competitions
18
+
- Answers are always integers between 0 and 999
19
+
- Requires creative mathematical reasoning and problem-solving
20
+
- Topics: algebra, geometry, number theory, combinatorics, probability
21
+
- Represents top-tier high school mathematics competition difficulty
"content": "\nSolve the following math problem step by step. Put your answer inside \\boxed{}.\n\nPatrick started walking at a constant rate along a straight road from school to the park. One hour after Patrick left, Tanya started running along the same road ... [TRUNCATED 270 chars] ... ya ran, and all three arrived at the park at the same time. The distance from the school to the park is $\\frac{m}{n}$ miles, where $m$ and $n$ are relatively prime positive integers. Find $m + n$.\n\nRemember to put your answer inside \\boxed{}."
61
+
}
62
+
],
63
+
"target": "277",
64
+
"id": 0,
65
+
"group_id": 0
66
+
}
67
+
```
68
+
69
+
## Prompt Template
70
+
71
+
**Prompt Template:**
72
+
```text
73
+
74
+
Solve the following math problem step by step. Put your answer inside \boxed{{}}.
75
+
76
+
{question}
77
+
78
+
Remember to put your answer inside \boxed{{}}.
79
+
```
80
+
81
+
## Usage
82
+
83
+
### Using CLI
84
+
85
+
```bash
86
+
evalscope eval \
87
+
--model YOUR_MODEL \
88
+
--api-url OPENAI_API_COMPAT_URL \
89
+
--api-key EMPTY_TOKEN \
90
+
--datasets aime26 \
91
+
--limit 10 # Remove this line for formal evaluation
92
+
```
93
+
94
+
### Using Python
95
+
96
+
```python
97
+
from evalscope import run_task
98
+
from evalscope.config import TaskConfig
99
+
100
+
task_cfg = TaskConfig(
101
+
model='YOUR_MODEL',
102
+
api_url='OPENAI_API_COMPAT_URL',
103
+
api_key='EMPTY_TOKEN',
104
+
datasets=['aime26'],
105
+
limit=10, # Remove this line for formal evaluation
Copy file name to clipboardExpand all lines: docs/en/user_guides/stress_test/sla_auto_tune.md
+59-6Lines changed: 59 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,12 +11,15 @@ The SLA (Service Level Agreement) auto-tuning feature allows users to define ser
11
11
12
12
## Parameter Description
13
13
14
-
See [Parameter Description](./parameters.md#sla-settings) for details.
15
-
16
-
Main parameters:
17
-
-`--sla-auto-tune`: Enable auto-tuning.
18
-
-`--sla-variable`: Adjustment variable, `parallel` or `rate`.
19
-
-`--sla-params`: Define SLA rules.
14
+
| Parameter | Type | Description | Default |
15
+
|------|------|------|--------|
16
+
|`--sla-auto-tune`|`bool`| Whether to enable SLA auto-tuning mode |`False`|
17
+
|`--sla-variable`|`str`| Variable for auto-tuning<br>Options: `parallel` (concurrency), `rate` (request rate) |`parallel`|
18
+
|`--sla-params`|`str`| SLA constraint conditions, JSON string, supports multiple constraint groups (AND/OR logic), see [description below](#sla-params-logic)|`None`|
19
+
|`--sla-upper-bound`|`int`| Maximum concurrency/rate limit during auto-tuning |`65536`|
20
+
|`--sla-lower-bound`|`int`| Minimum concurrency/rate limit during auto-tuning |`1`|
21
+
|`--sla-num-runs`|`int`| Number of repeated runs per test point (average taken to reduce fluctuation) |`3`|
22
+
|`--sla-number-multiplier`|`float`| Multiplier of total requests relative to concurrency/rate per test, i.e. `number = round(parallel × N)`; defaults to `2` when not set |`None`|
20
23
21
24
## Supported Metrics and Operators
22
25
@@ -31,6 +34,56 @@ Main parameters:
31
34
|**Throughput**|`rps`| Requests per second |`>=`, `>`, `max`|
32
35
||`tps`| Tokens per second |`>=`, `>`, `max`|
33
36
37
+
## `--sla-params` Logic
38
+
39
+
`--sla-params` accepts a **JSON array**, where each element is an **object (group)**. Logic rules are as follows:
40
+
41
+
-**Multiple metrics within the same object**: **AND** (must all be satisfied simultaneously)
42
+
-**Between different objects**: **OR** (any one group satisfied is sufficient)
43
+
44
+
The overall semantics are: `(Group1 ConditionA AND Group1 ConditionB) OR (Group2 ConditionC AND Group2 ConditionD) OR ...`
45
+
46
+
### AND Example: Satisfy TTFT and TPOT Simultaneously
47
+
48
+
Write multiple metrics in the **same object** to indicate they must **all** be satisfied:
Meaning: Find the maximum concurrency satisfying **`avg_ttft <= 2s` AND `avg_tpot <= 0.05s`**. Only when both metrics are met does that concurrency level pass.
55
+
56
+
### OR Example: Independently Evaluate Multiple TTFT Thresholds
57
+
58
+
Write each metric in a **different object** so each group of conditions is evaluated **independently**:
Meaning: Find the maximum request rate satisfying **`p99_ttft < 0.05s`** and satisfying **`p99_ttft < 0.01s`** separately, each outputting results independently.
- Each group independently completes a binary search and outputs its maximum concurrency value separately.
76
+
77
+
### Extremum Optimization Mode
78
+
79
+
When the array has **only one object with only one metric**, and the operator is `max` or `min`, the tool enters extremum optimization mode and directly finds the pressure value corresponding to the optimal metric:
80
+
81
+
```bash
82
+
--sla-params '[{"tps": "max"}]'
83
+
```
84
+
85
+
Meaning: Find the concurrency corresponding to maximum TPS (token throughput).
86
+
34
87
## Workflow
35
88
36
89
1.**Baseline Test**: Start testing with the user-specified initial `parallel` or `rate` (recommended to set a small value, such as 1 or 2).
0 commit comments