Skip to content

Commit 397caaa

Browse files
committed
improve doc
1 parent 537470c commit 397caaa

File tree

2 files changed

+15
-16
lines changed

2 files changed

+15
-16
lines changed

docs/sphinx_doc/source/tutorial/metrics_reference.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -34,12 +34,11 @@ Rollout metrics track performance during the rollout phase where the model gener
3434

3535
Consider an exploration step with `batch_size` tasks, where each task has `repeat_times` runs. Rollout metrics (e.g., `rollout/`) are computed and aggregated at different levels:
3636

37-
- **Task level**: Metrics aggregated across `repeat_times` runs of the same task. For example, `rollout/accuracy` is the average accuracy of all runs of the task.
37+
- From *run level* to *task level*: In `calculate_task_level_metrics` function, metrics are aggregated across `repeat_times` runs of the same task. For example, `rollout/accuracy` is the average accuracy of all runs of the task.
3838

39-
- **Step level**: Metrics are reported at the step level. For example, `rollout/accuracy/mean`, `rollout/accuracy/max`, `rollout/accuracy/min` are the average, max, and min accuracy (`rollout/accuracy`) of all tasks in the step.
39+
- From *task level* to *step level*: In `gather_metrics` function, metrics are aggregated across all tasks in the step. For example, `rollout/accuracy/mean`, `rollout/accuracy/max`, `rollout/accuracy/min` are the average, max, and min accuracy (`rollout/accuracy`) of all tasks in the step.
4040

4141
The following diagram illustrates the aggregation process for rollout metrics:
42-
4342
```mermaid
4443
graph TD
4544
subgraph Step["Batch_size=3 Tasks"]
@@ -49,7 +48,6 @@ graph TD
4948
Run1_1 --> Task1_Metric["rollout/accuracy<br/>= 0.85"]
5049
Run1_2 --> Task1_Metric
5150
end
52-
5351
subgraph Task2["Task 2 (repeat_times=2)"]
5452
Run2_1["Run 1<br/>accuracy: 0.6"]
5553
Run2_2["Run 2<br/>accuracy: 0.9"]
@@ -64,7 +62,7 @@ graph TD
6462
Run3_2 --> Task3_Metric
6563
end
6664
67-
Task1_Metric --> Step_Metrics["Step Level Metrics<br/>rollout/accuracy/mean=0.83<br/>rollout/accuracy/max=0.9<br/>rollout/accuracy/min=0.6"]
65+
Task1_Metric --> Step_Metrics["Step Level Metrics<br/>rollout/accuracy/mean=0.83<br/>rollout/accuracy/max=0.9<br/>rollout/accuracy/min=0.75"]
6866
Task2_Metric --> Step_Metrics
6967
Task3_Metric --> Step_Metrics
7068
end
@@ -83,13 +81,14 @@ Evaluation metrics measure model performance on held-out evaluation tasks. These
8381
- Eval and bench metrics are computed in the same way, the only difference is the prefix of the metric name.
8482
- By default, only the *mean* of the metric is returned. If you want to return detailed statistics, you can set `monitor.detailed_stats` to `True` in the config.
8583

84+
**Metric Aggregation Process**:
85+
8686
Consider an evaluation step with `len(eval_taskset)` tasks, where each task has `repeat_times` runs. Evaluation metrics (e.g., `eval/`, `bench/`) are computed and aggregated at different levels:
8787

88-
- **Task level**: Task-level metrics include (e.g., `mean@4`, `std@4`, `best@2`, `worst@2`) that are computed from k runs of the task.
88+
- From *run level* to *task level*: In `calculate_task_level_metrics` function, metrics are aggregated across `repeat_times` runs of the same task. For example, `eval/dummy/accuracy/mean@2` is the average accuracy of all runs of the task.
8989

90-
- **Step level**: By default, we report the mean of the metric across all evaluation tasks. For example, `mean@k`, `std@k`, `best@k`, `worst@k` of the metrics across all evaluation tasks are reported. If you want to return detailed statistics, including mean, std, min, max, you can set `monitor.detailed_stats` to `True` in the config.
90+
- From *task level* to *step level*: In `gather_eval_metrics` function, metrics are aggregated across all tasks in the step. For example, `eval/dummy/accuracy/mean@2`, `eval/dummy/accuracy/std@2`, `eval/dummy/accuracy/best@2`, `eval/dummy/accuracy/worst@2` are the average, std, best, and worst accuracy (`eval/dummy/accuracy`) of all tasks in the step.
9191

92-
**Metric Aggregation Process**:
9392

9493
The following diagram illustrates the aggregation process on a dummy dataset with three tasks for evaluation metrics. By default, `mean@k`, `std@k`, `best@k`, `worst@k` of the metrics across all evaluation tasks are reported. You can set `monitor.detailed_stats` to `True` in the config to return detailed statistics.
9594

docs/sphinx_doc/source_zh/tutorial/metrics_reference.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -34,9 +34,9 @@ Rollout 指标跟踪模型生成响应的 rollout 阶段的性能。
3434

3535
考虑一个包含 `batch_size` 个任务的探索步骤,其中每个任务有 `repeat_times` 次运行。Rollout 指标(例如,`rollout/`)在不同级别计算和聚合:
3636

37-
- **任务级别**:跨同一任务的 `repeat_times` 次运行聚合的指标。例如,`rollout/accuracy` 是该任务所有运行的平均准确率。
37+
- *Run 级别**Task 级别*:在 `calculate_task_level_metrics` 函数中,指标跨同一任务的 `repeat_times` 次运行聚合。例如,`rollout/accuracy` 是该任务所有运行的平均准确率。
3838

39-
- **步骤级别**:在步骤级别报告指标。例如,`rollout/accuracy/mean``rollout/accuracy/max``rollout/accuracy/min` 分别是步骤中所有任务的准确率(`rollout/accuracy`)的平均值、最大值和最小值。
39+
- *Task 级别**Step 级别*:在 `gather_metrics` 函数中,指标跨步骤中所有任务聚合。例如,`rollout/accuracy/mean``rollout/accuracy/max``rollout/accuracy/min` 分别是步骤中所有任务的准确率(`rollout/accuracy`)的平均值、最大值和最小值。
4040

4141
以下图表说明了 rollout 指标的聚合过程:
4242

@@ -64,7 +64,7 @@ graph TD
6464
Run3_2 --> Task3_Metric
6565
end
6666
67-
Task1_Metric --> Step_Metrics["Step Level Metrics<br/>rollout/accuracy/mean=0.83<br/>rollout/accuracy/max=0.9<br/>rollout/accuracy/min=0.6"]
67+
Task1_Metric --> Step_Metrics["Step Level Metrics<br/>rollout/accuracy/mean=0.83<br/>rollout/accuracy/max=0.9<br/>rollout/accuracy/min=0.75"]
6868
Task2_Metric --> Step_Metrics
6969
Task3_Metric --> Step_Metrics
7070
end
@@ -83,13 +83,13 @@ graph TD
8383
- Eval 和 bench 指标的计算方式相同,唯一的区别是指标名称的前缀。
8484
- 默认情况下,只返回指标的*平均值*。如果你想返回详细统计信息,可以在配置中将 `monitor.detailed_stats` 设置为 `True`
8585

86-
考虑一个包含 `len(eval_taskset)` 个任务的评估步骤,其中每个任务有 `repeat_times` 次运行。评估指标(例如,`eval/``bench/`)在不同级别计算和聚合
86+
**指标聚合过程**
8787

88-
- **任务级别**:任务级别指标包括(例如,`mean@4``std@4``best@2``worst@2`),这些指标是从任务的 k 次运行中计算的。
88+
考虑一个包含 `len(eval_taskset)` 个任务的评估步骤,其中每个任务有 `repeat_times` 次运行。评估指标(例如,`eval/``bench/`)在不同级别计算和聚合:
8989

90-
- **步骤级别**:默认情况下,我们报告所有评估任务中指标的平均值。例如,报告所有评估任务中指标的 `mean@k``std@k``best@k``worst@k`。如果你想返回详细统计信息,包括 mean、std、min、max,可以在配置中将 `monitor.detailed_stats` 设置为 `True`
90+
- *Run 级别**Task 级别*:在 `calculate_task_level_metrics` 函数中,指标跨同一任务的 `repeat_times` 次运行聚合。例如,`eval/dummy/accuracy/mean@2` 是该任务所有运行的平均准确率
9191

92-
**指标聚合过程**
92+
-*Task 级别**Step 级别*:在 `gather_eval_metrics` 函数中,指标跨步骤中所有任务聚合。例如,`eval/dummy/accuracy/mean@2``eval/dummy/accuracy/std@2``eval/dummy/accuracy/best@2``eval/dummy/accuracy/worst@2` 分别是步骤中所有任务的准确率(`eval/dummy/accuracy`)的平均值、标准差、最佳值和最差值。
9393

9494
以下图表说明了在包含三个任务的虚拟数据集上评估指标的聚合过程。默认情况下,报告所有评估任务中指标的 `mean@k``std@k``best@k``worst@k`。你可以在配置中将 `monitor.detailed_stats` 设置为 `True` 以返回详细统计信息。
9595

@@ -122,7 +122,7 @@ graph TD
122122
Task3_Metric --> Step_Metrics
123123
end
124124
```
125-
当你将 `monitor.detailed_stats` 设置为 `True` 时,,你会得到详细的信息,包括 mean、std、min、max,例如 `eval/dummy/accuracy/mean@2/mean=0.83``eval/dummy/accuracy/mean@2/std=0.062``eval/dummy/accuracy/mean@2/max=0.9``eval/dummy/accuracy/mean@2/min=0.75`
125+
当你将 `monitor.detailed_stats` 设置为 `True` 时,你会得到详细统计信息,包括 mean、std、min、max,例如 `eval/dummy/accuracy/mean@2/mean=0.83``eval/dummy/accuracy/mean@2/std=0.062``eval/dummy/accuracy/mean@2/max=0.9``eval/dummy/accuracy/mean@2/min=0.75`
126126

127127
```mermaid
128128
graph TD

0 commit comments

Comments
 (0)