Skip to content

Commit 537470c

Browse files
committed
polish doc
1 parent 76b53a6 commit 537470c

File tree

2 files changed

+75
-48
lines changed

2 files changed

+75
-48
lines changed

docs/sphinx_doc/source/tutorial/metrics_reference.md

Lines changed: 37 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,17 @@ In the following, metrics are categorized by their source component (where they
2020

2121
Explorer metrics track performance during the rollout phase where the model generates responses, including rollout metrics (`rollout/`), eval metrics (`eval/`), and some time metrics (`time/`).
2222

23-
#### Metric Aggregation Levels
23+
24+
#### Rollout Metrics (`rollout/`)
25+
26+
Rollout metrics track performance during the rollout phase where the model generates responses.
27+
28+
- **Format**: `rollout/{metric_name}/{statistic}`
29+
- **Examples**:
30+
- `rollout/accuracy/mean`: Average accuracy of generated responses
31+
- `rollout/format_score/mean`: Average format correctness score
32+
33+
**Metric Aggregation Process**:
2434

2535
Consider an exploration step with `batch_size` tasks, where each task has `repeat_times` runs. Rollout metrics (e.g., `rollout/`) are computed and aggregated at different levels:
2636

@@ -60,13 +70,28 @@ graph TD
6070
end
6171
```
6272

73+
#### Eval Metrics (`eval/`) and Benchmark Metrics (`bench/`)
74+
75+
Evaluation metrics measure model performance on held-out evaluation tasks. These metrics are computed during periodic evaluation runs.
76+
77+
- **Format**: `eval/{task_name}/{metric_name}/{statistic}` or `bench/{task_name}/{metric_name}/{statistic}`
78+
- **Examples**:
79+
- `eval/gsm8k-eval/accuracy/mean@4`: Mean accuracy across repeat_times=4 runs
80+
- `bench/gsm8k-eval/accuracy/best@4`: Best accuracy value across repeat_times=4 runs
81+
82+
- **Note**:
83+
- Eval and bench metrics are computed in the same way, the only difference is the prefix of the metric name.
84+
- By default, only the *mean* of the metric is returned. If you want to return detailed statistics, you can set `monitor.detailed_stats` to `True` in the config.
85+
6386
Consider an evaluation step with `len(eval_taskset)` tasks, where each task has `repeat_times` runs. Evaluation metrics (e.g., `eval/`, `bench/`) are computed and aggregated at different levels:
6487

6588
- **Task level**: Task-level metrics include (e.g., `mean@4`, `std@4`, `best@2`, `worst@2`) that are computed from k runs of the task.
6689

6790
- **Step level**: By default, we report the mean of the metric across all evaluation tasks. For example, `mean@k`, `std@k`, `best@k`, `worst@k` of the metrics across all evaluation tasks are reported. If you want to return detailed statistics, including mean, std, min, max, you can set `monitor.detailed_stats` to `True` in the config.
6891

69-
The following diagram illustrates the aggregation process on a dummy dataset with three tasks for evaluation metrics. By default, `mean@k`, `std@k`, `best@k`, `worst@k` of the metrics across all evaluation tasks are reported. you may configure `monitor.detailed_stats` to `True` in the config to return detailed statistics, including mean, std, min, max, e.g., `eval/dummy/accuracy/mean@2/mean=0.83`, `eval/dummy/accuracy/mean@2/std=0.062`, `eval/dummy/accuracy/mean@2/max=0.9`, and `eval/dummy/accuracy/mean@2/min=0.75`.
92+
**Metric Aggregation Process**:
93+
94+
The following diagram illustrates the aggregation process on a dummy dataset with three tasks for evaluation metrics. By default, `mean@k`, `std@k`, `best@k`, `worst@k` of the metrics across all evaluation tasks are reported. You can set `monitor.detailed_stats` to `True` in the config to return detailed statistics.
7095

7196
```mermaid
7297
graph TD
@@ -98,28 +123,17 @@ graph TD
98123
end
99124
```
100125

126+
When you set `monitor.detailed_stats` to `True`, you will get detailed statistics including mean, std, min, max, e.g., `eval/dummy/accuracy/mean@2/mean=0.83`, `eval/dummy/accuracy/mean@2/std=0.062`, `eval/dummy/accuracy/mean@2/max=0.9`, and `eval/dummy/accuracy/mean@2/min=0.75`:
101127

102-
#### Rollout Metrics (`rollout/`)
103-
104-
Rollout metrics track performance during the rollout phase where the model generates responses.
105-
106-
- **Format**: `rollout/{metric_name}/{statistic}`
107-
- **Examples**:
108-
- `rollout/accuracy/mean`: Average accuracy of generated responses
109-
- `rollout/format_score/mean`: Average format correctness score
110-
111-
#### Eval Metrics (`eval/`) and Benchmark Metrics (`bench/`)
112-
113-
Evaluation metrics measure model performance on held-out evaluation tasks. These metrics are computed during periodic evaluation runs.
114-
115-
- **Format**: `eval/{task_name}/{metric_name}/{statistic}` or `bench/{task_name}/{metric_name}/{statistic}`
116-
- **Examples**:
117-
- `eval/gsm8k-eval/accuracy/mean@4`: Mean accuracy across repeat_times=4 runs
118-
- `bench/gsm8k-eval/accuracy/best@4`: Best accuracy value across repeat_times=4 runs
119-
120-
- **Note**:
121-
- Eval and bench metrics are computed in the same way, the only difference is the prefix of the metric name.
122-
- By default, only the *mean* of the metric is returned. If you want to return detailed statistics, you can set `monitor.detailed_stats` to `True` in the config.
128+
```mermaid
129+
graph TD
130+
subgraph Step_Metrics_Summary["Detailed Statistics"]
131+
Stat1["eval/dummy/accuracy/mean@2/mean=0.83<br/>eval/dummy/accuracy/mean@2/std=0.062<br/>eval/dummy/accuracy/mean@2/max=0.9<br/>eval/dummy/accuracy/mean@2/min=0.75"]
132+
Stat2["eval/dummy/accuracy/std@2/mean=0.083<br/>eval/dummy/accuracy/std@2/std=0.047<br/>eval/dummy/accuracy/std@2/max=0.15<br/>eval/dummy/accuracy/std@2/min=0.05"]
133+
Step_Metrics --> Stat1
134+
Step_Metrics --> Stat2
135+
end
136+
```
123137

124138

125139
#### Time Metrics (`time/`)

docs/sphinx_doc/source_zh/tutorial/metrics_reference.md

Lines changed: 38 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,17 @@
2020

2121
探索器指标跟踪模型生成响应的 rollout 阶段的性能,包括 rollout 指标(`rollout/`)、评估指标(`eval/`)和一些时间指标(`time/`)。
2222

23-
#### 指标聚合级别
23+
24+
#### Rollout 指标(`rollout/`
25+
26+
Rollout 指标跟踪模型生成响应的 rollout 阶段的性能。
27+
28+
- **格式**`rollout/{metric_name}/{statistic}`
29+
- **示例**
30+
- `rollout/accuracy/mean`:生成响应的平均准确率
31+
- `rollout/format_score/mean`:平均格式正确性分数
32+
33+
**指标聚合过程**
2434

2535
考虑一个包含 `batch_size` 个任务的探索步骤,其中每个任务有 `repeat_times` 次运行。Rollout 指标(例如,`rollout/`)在不同级别计算和聚合:
2636

@@ -60,13 +70,28 @@ graph TD
6070
end
6171
```
6272

73+
#### 评估指标(`eval/`)和基准测试指标(`bench/`
74+
75+
评估指标衡量模型在保留的评估任务上的性能。这些指标在定期评估运行期间计算。
76+
77+
- **格式**`eval/{task_name}/{metric_name}/{statistic}``bench/{task_name}/{metric_name}/{statistic}`
78+
- **示例**
79+
- `eval/gsm8k-eval/accuracy/mean@4`:跨 repeat_times=4 次运行的平均准确率
80+
- `bench/gsm8k-eval/accuracy/best@4`:跨 repeat_times=4 次运行的最佳准确率值
81+
82+
- **注意**
83+
- Eval 和 bench 指标的计算方式相同,唯一的区别是指标名称的前缀。
84+
- 默认情况下,只返回指标的*平均值*。如果你想返回详细统计信息,可以在配置中将 `monitor.detailed_stats` 设置为 `True`
85+
6386
考虑一个包含 `len(eval_taskset)` 个任务的评估步骤,其中每个任务有 `repeat_times` 次运行。评估指标(例如,`eval/``bench/`)在不同级别计算和聚合:
6487

6588
- **任务级别**:任务级别指标包括(例如,`mean@4``std@4``best@2``worst@2`),这些指标是从任务的 k 次运行中计算的。
6689

6790
- **步骤级别**:默认情况下,我们报告所有评估任务中指标的平均值。例如,报告所有评估任务中指标的 `mean@k``std@k``best@k``worst@k`。如果你想返回详细统计信息,包括 mean、std、min、max,可以在配置中将 `monitor.detailed_stats` 设置为 `True`
6891

69-
以下图表说明了在包含三个任务的虚拟数据集上评估指标的聚合过程。默认情况下,报告所有评估任务中指标的 `mean@k``std@k``best@k``worst@k`。你可以在配置中将 `monitor.detailed_stats` 设置为 `True` 以返回详细统计信息,包括 mean、std、min、max,例如 `eval/dummy/accuracy/mean@2/mean=0.83``eval/dummy/accuracy/mean@2/std=0.062``eval/dummy/accuracy/mean@2/max=0.9``eval/dummy/accuracy/mean@2/min=0.75`
92+
**指标聚合过程**
93+
94+
以下图表说明了在包含三个任务的虚拟数据集上评估指标的聚合过程。默认情况下,报告所有评估任务中指标的 `mean@k``std@k``best@k``worst@k`。你可以在配置中将 `monitor.detailed_stats` 设置为 `True` 以返回详细统计信息。
7095

7196
```mermaid
7297
graph TD
@@ -97,29 +122,17 @@ graph TD
97122
Task3_Metric --> Step_Metrics
98123
end
99124
```
125+
当你将 `monitor.detailed_stats` 设置为 `True` 时,,你会得到详细的信息,包括 mean、std、min、max,例如 `eval/dummy/accuracy/mean@2/mean=0.83``eval/dummy/accuracy/mean@2/std=0.062``eval/dummy/accuracy/mean@2/max=0.9``eval/dummy/accuracy/mean@2/min=0.75`。:
100126

101-
102-
#### Rollout 指标(`rollout/`
103-
104-
Rollout 指标跟踪模型生成响应的 rollout 阶段的性能。
105-
106-
- **格式**`rollout/{metric_name}/{statistic}`
107-
- **示例**
108-
- `rollout/accuracy/mean`:生成响应的平均准确率
109-
- `rollout/format_score/mean`:平均格式正确性分数
110-
111-
#### 评估指标(`eval/`)和基准测试指标(`bench/`
112-
113-
评估指标衡量模型在保留的评估任务上的性能。这些指标在定期评估运行期间计算。
114-
115-
- **格式**`eval/{task_name}/{metric_name}/{statistic}``bench/{task_name}/{metric_name}/{statistic}`
116-
- **示例**
117-
- `eval/gsm8k-eval/accuracy/mean@4`:跨 repeat_times=4 次运行的平均准确率
118-
- `bench/gsm8k-eval/accuracy/best@4`:跨 repeat_times=4 次运行的最佳准确率值
119-
120-
- **注意**
121-
- Eval 和 bench 指标的计算方式相同,唯一的区别是指标名称的前缀。
122-
- 默认情况下,只返回指标的*平均值*。如果你想返回详细统计信息,可以在配置中将 `monitor.detailed_stats` 设置为 `True`
127+
```mermaid
128+
graph TD
129+
subgraph Step_Metrics_Summary["Detailed Statistics"]
130+
Stat1["eval/dummy/accuracy/mean@2/mean=0.83<br/>eval/dummy/accuracy/mean@2/std=0.062<br/>eval/dummy/accuracy/mean@2/max=0.9<br/>eval/dummy/accuracy/mean@2/min=0.75"]
131+
Stat2["eval/dummy/accuracy/std@2/mean=0.083<br/>eval/dummy/accuracy/std@2/std=0.047<br/>eval/dummy/accuracy/std@2/max=0.15<br/>eval/dummy/accuracy/std@2/min=0.05"]
132+
Step_Metrics --> Stat1
133+
Step_Metrics --> Stat2
134+
end
135+
```
123136

124137

125138
#### 时间指标(`time/`
@@ -165,11 +178,11 @@ graph TD
165178
Run1_2["Run 2<br/>reward_mean: 0.8"]
166179
Run2_1["Run 3<br/>reward_mean: 0.9"]
167180
Run2_2["Run 4<br/>reward_mean: 0.9"]
181+
end
168182
Run1_1 --> Task1_Metric["rollout/accuracy<br/>= 0.85"]
169183
Run1_2 --> Task1_Metric
170184
Run2_1 --> Task1_Metric
171185
Run2_2 --> Task1_Metric
172-
end
173186
end
174187
```
175188

0 commit comments

Comments
 (0)