modelscope
diff --git a/‎Makefile‎
Lines changed: 9 additions & 5 deletions b/‎Makefile‎
Lines changed: 9 additions & 5 deletions
diff --git a/‎docs/en/benchmarks/aime24.md‎
Lines changed: 13 additions & 16 deletions b/‎docs/en/benchmarks/aime24.md‎
Lines changed: 13 additions & 16 deletions
diff --git a/‎docs/en/benchmarks/aime25.md‎
Lines changed: 6 additions & 19 deletions b/‎docs/en/benchmarks/aime25.md‎
Lines changed: 6 additions & 19 deletions
diff --git a/‎docs/en/benchmarks/aime26.md‎
Lines changed: 111 additions & 0 deletions b/‎docs/en/benchmarks/aime26.md‎
Lines changed: 111 additions & 0 deletions
diff --git a/‎docs/en/get_started/supported_dataset/llm.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/en/get_started/supported_dataset/llm.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/en/user_guides/stress_test/sla_auto_tune.md‎
Lines changed: 59 additions & 6 deletions b/‎docs/en/user_guides/stress_test/sla_auto_tune.md‎
Lines changed: 59 additions & 6 deletions
@@ -23,18 +23,21 @@ default: install
 # PARAMETERS:
 #   BENCHMARK  Specific benchmark name (e.g. gsm8k, mmlu).
 #              Omit to process ALL registered benchmarks.
-#   FORCE=1    Force re-translate even when a translation already exists.
-#              Only applies to docs-translate.
+#   FORCE=1    Force recompute/re-translate even when data already exists.
+#              Applies to docs-update, docs-update-stats, and docs-translate.
 #   WORKERS    Parallel worker count for update / translate (default: 4).
 #
 # COMMON USAGE:
 #   make docs                               # Full pipeline: translate → generate → build HTML
 #   make docs-update                        # Update metadata for ALL benchmarks
 #   make docs-update BENCHMARK=gsm8k        # Update metadata for ONE benchmark
+#   make docs-update BENCHMARK="gsm8k mmlu"  # Update metadata for MULTIPLE benchmarks
 #   make docs-update-stats                  # Update metadata + stats for ALL benchmarks
 #   make docs-update-stats BENCHMARK=gsm8k  # Update metadata + stats for ONE benchmark
+#   make docs-update-stats BENCHMARK="gsm8k mmlu"  # Update metadata + stats for MULTIPLE benchmarks
 #   make docs-translate                     # Translate only untranslated benchmarks (ALL)
 #   make docs-translate BENCHMARK=gsm8k     # Translate ONE benchmark (skip if done)
+#   make docs-translate BENCHMARK="gsm8k mmlu"  # Translate MULTIPLE benchmarks
 #   make docs-translate FORCE=1             # Force re-translate ALL benchmarks
 #   make docs-translate BENCHMARK=gsm8k FORCE=1  # Force re-translate ONE benchmark
 #   make docs-generate                      # Regenerate .md files from persisted JSON data
@@ -44,12 +47,13 @@ default: install
 # ============================================================================
 
 # Parameters
+# BENCHMARK: one or more benchmark names, space-separated (e.g. BENCHMARK="gsm8k mmlu")
 BENCHMARK ?=
 FORCE     ?=
 WORKERS   ?= 4
 
 # Internal helpers
-# When BENCHMARK is set: pass it as positional arg; otherwise use --all flag
+# When BENCHMARK is set: pass name(s) as positional args; otherwise use --all flag
 _BENCH_ARGS = $(if $(BENCHMARK),$(BENCHMARK),--all)
 # When FORCE is non-empty (e.g. FORCE=1): append --force flag
 _FORCE_FLAG = $(if $(FORCE),--force,)
@@ -61,11 +65,11 @@ docs: docs-translate docs-generate
 
 .PHONY: docs-update
 docs-update:
-	python -m evalscope.cli.cli benchmark-info $(_BENCH_ARGS) --update --workers $(WORKERS)
+	python -m evalscope.cli.cli benchmark-info $(_BENCH_ARGS) --update $(_FORCE_FLAG) --workers $(WORKERS)
 
 .PHONY: docs-update-stats
 docs-update-stats:
-	python -m evalscope.cli.cli benchmark-info $(_BENCH_ARGS) --update --compute-stats --workers $(WORKERS)
+	python -m evalscope.cli.cli benchmark-info $(_BENCH_ARGS) --update --compute-stats $(_FORCE_FLAG) --workers $(WORKERS)
 
 .PHONY: docs-translate
 docs-translate:
 
@@ -26,29 +26,28 @@ AIME 2024 (American Invitational Mathematics Examination 2024) is a benchmark ba
 - Answers should be formatted within `\boxed{}` for proper extraction
 - Only integer answers are accepted (matching AIME format)
 - Problems are significantly harder than GSM8K or standard MATH benchmark
-- Reference solutions available in metadata for analysis
 
 
 ## Properties
 
 | Property | Value |
 |----------|-------|
 | **Benchmark Name** | `aime24` |
-| **Dataset ID** | [HuggingFaceH4/aime_2024](https://modelscope.cn/datasets/HuggingFaceH4/aime_2024/summary) |
+| **Dataset ID** | [evalscope/aime24](https://modelscope.cn/datasets/evalscope/aime24/summary) |
 | **Paper** | N/A |
 | **Tags** | `Math`, `Reasoning` |
 | **Metrics** | `acc` |
 | **Default Shots** | 0-shot |
-| **Evaluation Split** | `train` |
+| **Evaluation Split** | `test` |
 
 
 ## Data Statistics
 
 | Metric | Value |
 |--------|-------|
 | Total Samples | 30 |
-| Prompt Length (Mean) | 405.33 chars |
-| Prompt Length (Min/Max) | 185 / 1009 chars |
+| Prompt Length (Mean) | 462.33 chars |
+| Prompt Length (Min/Max) | 242 / 1066 chars |
 
 ## Sample Example
 
@@ -58,28 +57,26 @@ AIME 2024 (American Invitational Mathematics Examination 2024) is a benchmark ba
 {
   "input": [
     {
-      "id": "2f896eed",
-      "content": "Every morning Aya goes for a $9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the walk takes her 4 hours, including $t$ minutes spent in the coffee shop. When she wa ... [TRUNCATED] ... e coffee shop. Suppose Aya walks at $s+\\frac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop.\nPlease reason step by step, and put your final answer within \\boxed{}."
+      "id": "32187d6f",
+      "content": "\nSolve the following math problem step by step. Put your answer inside \\boxed{}.\n\nEvery morning Aya goes for a $9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the w ... [TRUNCATED 164 chars] ... g $t$ minutes spent in the coffee shop. Suppose Aya walks at $s+\\frac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop.\n\nRemember to put your answer inside \\boxed{}."
     }
   ],
-  "target": "204",
+  "target": "\\boxed{204}",
   "id": 0,
-  "group_id": 0,
-  "metadata": {
-    "problem_id": 60,
-    "solution": "$\\frac{9}{s} + t = 4$ in hours and $\\frac{9}{s+2} + t = 2.4$ in hours.\nSubtracting the second equation from the first, we get, \n$\\frac{9}{s} - \\frac{9}{s+2} = 1.6$\nMultiplying by $(s)(s+2)$, we get \n$9s+18-9s=18=1.6s^{2} + 3.2s$\nMultiplying b ... [TRUNCATED] ... s = 2.5$. Now, $2.5+0.5 = 3$. Taking $\\frac{9}{3} = 3$, we find that it will take three hours for the 9 kilometers to be traveled. The t minutes spent at the coffeeshop can be written as $144-48(2.5)$, so t = 24. $180 + 24 = 204$. -sepehr2010"
-  }
+  "group_id": 0
 }
 ```
 
-*Note: Some content was truncated for display.*
-
 ## Prompt Template
 
 **Prompt Template:**
 ```text
+
+Solve the following math problem step by step. Put your answer inside \boxed{{}}.
+
 {question}
-Please reason step by step, and put your final answer within \boxed{{}}.
+
+Remember to put your answer inside \boxed{{}}.
 ```
 
 ## Usage
 
@@ -25,15 +25,14 @@ AIME 2025 (American Invitational Mathematics Examination 2025) is a benchmark ba
 - Default configuration uses **0-shot** evaluation
 - Answers should be formatted within `\boxed{}` for proper extraction
 - Uses LLM-as-judge for mathematical equivalence checking
-- Subsets: AIME2025-I, AIME2025-II
 
 
 ## Properties
 
 | Property | Value |
 |----------|-------|
 | **Benchmark Name** | `aime25` |
-| **Dataset ID** | [opencompass/AIME2025](https://modelscope.cn/datasets/opencompass/AIME2025/summary) |
+| **Dataset ID** | [evalscope/aime25](https://modelscope.cn/datasets/evalscope/aime25/summary) |
 | **Paper** | N/A |
 | **Tags** | `Math`, `Reasoning` |
 | **Metrics** | `acc` |
@@ -46,26 +45,19 @@ AIME 2025 (American Invitational Mathematics Examination 2025) is a benchmark ba
 | Metric | Value |
 |--------|-------|
 | Total Samples | 30 |
-| Prompt Length (Mean) | 491.07 chars |
-| Prompt Length (Min/Max) | 212 / 1119 chars |
-
-**Per-Subset Statistics:**
-
-| Subset | Samples | Prompt Mean | Prompt Min | Prompt Max |
-|--------|---------|-------------|------------|------------|
-| `AIME2025-I` | 15 | 476.07 | 212 | 768 |
-| `AIME2025-II` | 15 | 506.07 | 234 | 1119 |
+| Prompt Length (Mean) | 604.93 chars |
+| Prompt Length (Min/Max) | 208 / 1862 chars |
 
 ## Sample Example
 
-**Subset**: `AIME2025-I`
+**Subset**: `default`
 
 ```json
 {
   "input": [
     {
-      "id": "99875e2d",
-      "content": "\nSolve the following math problem step by step. Put your answer inside \\boxed{}.\n\nFind the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$.\n\nRemember to put your answer inside \\boxed{}."
+      "id": "bff66863",
+      "content": "\nSolve the following math problem step by step. Put your answer inside \\boxed{}.\n\nFind the sum of all integer bases $b>9$ for which $17_b$ is a divisor of $97_b.$\n\nRemember to put your answer inside \\boxed{}."
     }
   ],
   "target": "70",
@@ -110,11 +102,6 @@ task_cfg = TaskConfig(
     api_url='OPENAI_API_COMPAT_URL',
     api_key='EMPTY_TOKEN',
     datasets=['aime25'],
-    dataset_args={
-        'aime25': {
-            # subset_list: ['AIME2025-I', 'AIME2025-II']  # optional, evaluate specific subsets
-        }
-    },
     limit=10,  # Remove this line for formal evaluation
 )
 
 
@@ -0,0 +1,111 @@
+# AIME-2026
+
+
+## Overview
+
+AIME 2026 (American Invitational Mathematics Examination 2026) is a benchmark based on problems from the prestigious AIME competition, one of the most challenging high school mathematics contests in the United States. It tests advanced mathematical reasoning and problem-solving skills.
+
+## Task Description
+
+- **Task Type**: Competition Mathematics Problem Solving
+- **Input**: AIME-level mathematical problem
+- **Output**: Integer answer (0-999) with step-by-step reasoning
+- **Difficulty**: Advanced high school / early undergraduate level
+
+## Key Features
+
+- Problems from AIME I and AIME II 2026 competitions
+- Answers are always integers between 0 and 999
+- Requires creative mathematical reasoning and problem-solving
+- Topics: algebra, geometry, number theory, combinatorics, probability
+- Represents top-tier high school mathematics competition difficulty
+
+## Evaluation Notes
+
+- Default configuration uses **0-shot** evaluation
+- Answers should be formatted within `\boxed{}` for proper extraction
+- Uses LLM-as-judge for mathematical equivalence checking
+
+
+## Properties
+
+| Property | Value |
+|----------|-------|
+| **Benchmark Name** | `aime26` |
+| **Dataset ID** | [evalscope/aime26](https://modelscope.cn/datasets/evalscope/aime26/summary) |
+| **Paper** | N/A |
+| **Tags** | `Math`, `Reasoning` |
+| **Metrics** | `acc` |
+| **Default Shots** | 0-shot |
+| **Evaluation Split** | `test` |
+
+
+## Data Statistics
+
+| Metric | Value |
+|--------|-------|
+| Total Samples | 30 |
+| Prompt Length (Mean) | 517.7 chars |
+| Prompt Length (Min/Max) | 263 / 1026 chars |
+
+## Sample Example
+
+**Subset**: `default`
+
+```json
+{
+  "input": [
+    {
+      "id": "a6d469bc",
+      "content": "\nSolve the following math problem step by step. Put your answer inside \\boxed{}.\n\nPatrick started walking at a constant rate along a straight road from school to the park. One hour after Patrick left, Tanya started running along the same road ... [TRUNCATED 270 chars] ... ya ran, and all three arrived at the park at the same time. The distance from the school to the park is $\\frac{m}{n}$ miles, where $m$ and $n$ are relatively prime positive integers. Find $m + n$.\n\nRemember to put your answer inside \\boxed{}."
+    }
+  ],
+  "target": "277",
+  "id": 0,
+  "group_id": 0
+}
+```
+
+## Prompt Template
+
+**Prompt Template:**
+```text
+
+Solve the following math problem step by step. Put your answer inside \boxed{{}}.
+
+{question}
+
+Remember to put your answer inside \boxed{{}}.
+```
+
+## Usage
+
+### Using CLI
+
+```bash
+evalscope eval \
+    --model YOUR_MODEL \
+    --api-url OPENAI_API_COMPAT_URL \
+    --api-key EMPTY_TOKEN \
+    --datasets aime26 \
+    --limit 10  # Remove this line for formal evaluation
+```
+
+### Using Python
+
+```python
+from evalscope import run_task
+from evalscope.config import TaskConfig
+
+task_cfg = TaskConfig(
+    model='YOUR_MODEL',
+    api_url='OPENAI_API_COMPAT_URL',
+    api_key='EMPTY_TOKEN',
+    datasets=['aime26'],
+    limit=10,  # Remove this line for formal evaluation
+)
+
+run_task(task_cfg=task_cfg)
+```
+
+
@@ -7,6 +7,7 @@ Below is the list of supported LLM benchmarks. Click on a benchmark name for det
 | `aa_lcr` | [AA-LCR](../../benchmarks/aa_lcr.md) | `Knowledge`, `LongContext`, `Reasoning` |
 | `aime24` | [AIME-2024](../../benchmarks/aime24.md) | `Math`, `Reasoning` |
 | `aime25` | [AIME-2025](../../benchmarks/aime25.md) | `Math`, `Reasoning` |
+| `aime26` | [AIME-2026](../../benchmarks/aime26.md) | `Math`, `Reasoning` |
 | `alpaca_eval` | [AlpacaEval2.0](../../benchmarks/alpaca_eval.md) | `Arena`, `InstructionFollowing` |
 | `amc` | [AMC](../../benchmarks/amc.md) | `Math`, `Reasoning` |
 | `anat_em` | [AnatEM](../../benchmarks/anat_em.md) | `Knowledge`, `NER` |
@@ -116,6 +117,7 @@ Below is the list of supported LLM benchmarks. Click on a benchmark name for det
 ../../benchmarks/aa_lcr.md
 ../../benchmarks/aime24.md
 ../../benchmarks/aime25.md
+../../benchmarks/aime26.md
 ../../benchmarks/alpaca_eval.md
 ../../benchmarks/amc.md
 ../../benchmarks/anat_em.md
 
@@ -11,12 +11,15 @@ The SLA (Service Level Agreement) auto-tuning feature allows users to define ser
 
 ## Parameter Description
 
-See [Parameter Description](./parameters.md#sla-settings) for details.
-
-Main parameters:
-- `--sla-auto-tune`: Enable auto-tuning.
-- `--sla-variable`: Adjustment variable, `parallel` or `rate`.
-- `--sla-params`: Define SLA rules.
+| Parameter | Type | Description | Default |
+|------|------|------|--------|
+| `--sla-auto-tune` | `bool` | Whether to enable SLA auto-tuning mode | `False` |
+| `--sla-variable` | `str` | Variable for auto-tuning<br>Options: `parallel` (concurrency), `rate` (request rate) | `parallel` |
+| `--sla-params` | `str` | SLA constraint conditions, JSON string, supports multiple constraint groups (AND/OR logic), see [description below](#sla-params-logic) | `None` |
+| `--sla-upper-bound` | `int` | Maximum concurrency/rate limit during auto-tuning | `65536` |
+| `--sla-lower-bound` | `int` | Minimum concurrency/rate limit during auto-tuning | `1` |
+| `--sla-num-runs` | `int` | Number of repeated runs per test point (average taken to reduce fluctuation) | `3` |
+| `--sla-number-multiplier` | `float` | Multiplier of total requests relative to concurrency/rate per test, i.e. `number = round(parallel × N)`; defaults to `2` when not set | `None` |
 
 ## Supported Metrics and Operators
 
@@ -31,6 +34,56 @@ Main parameters:
 | **Throughput** | `rps` | Requests per second | `>=`, `>`, `max` |
 | | `tps` | Tokens per second | `>=`, `>`, `max` |
 
+## `--sla-params` Logic
+
+`--sla-params` accepts a **JSON array**, where each element is an **object (group)**. Logic rules are as follows:
+
+- **Multiple metrics within the same object**: **AND** (must all be satisfied simultaneously)
+- **Between different objects**: **OR** (any one group satisfied is sufficient)
+
+The overall semantics are: `(Group1 ConditionA AND Group1 ConditionB) OR (Group2 ConditionC AND Group2 ConditionD) OR ...`
+
+### AND Example: Satisfy TTFT and TPOT Simultaneously
+
+Write multiple metrics in the **same object** to indicate they must **all** be satisfied:
+
+```bash
+--sla-params '[{"avg_ttft": "<=2", "avg_tpot": "<=0.05"}]'
+```
+
+Meaning: Find the maximum concurrency satisfying **`avg_ttft <= 2s` AND `avg_tpot <= 0.05s`**. Only when both metrics are met does that concurrency level pass.
+
+### OR Example: Independently Evaluate Multiple TTFT Thresholds
+
+Write each metric in a **different object** so each group of conditions is evaluated **independently**:
+
+```bash
+--sla-params '[{"p99_ttft": "<0.05"}, {"p99_ttft": "<0.01"}]'
+```
+
+Meaning: Find the maximum request rate satisfying **`p99_ttft < 0.05s`** and satisfying **`p99_ttft < 0.01s`** separately, each outputting results independently.
+
+### AND + OR Combined Example
+
+```bash
+--sla-params '[{"avg_ttft": "<=1", "avg_tpot": "<=0.05"}, {"p99_latency": "<=5"}]'
+```
+
+Meaning:
+- **Group 1**: `avg_ttft <= 1s` **AND** `avg_tpot <= 0.05s` (both satisfied simultaneously)
+- **Group 2**: `p99_latency <= 5s`
+- Each group independently completes a binary search and outputs its maximum concurrency value separately.
+
+### Extremum Optimization Mode
+
+When the array has **only one object with only one metric**, and the operator is `max` or `min`, the tool enters extremum optimization mode and directly finds the pressure value corresponding to the optimal metric:
+
+```bash
+--sla-params '[{"tps": "max"}]'
+```
+
+Meaning: Find the concurrency corresponding to maximum TPS (token throughput).
+
 ## Workflow
 
 1. **Baseline Test**: Start testing with the user-specified initial `parallel` or `rate` (recommended to set a small value, such as 1 or 2).