improve doc

hiyuchang · hiyuchang · commit 397caaa3a3d5 · 2026-01-23T10:15:40.000+08:00
diff --git a/docs/sphinx_doc/source/tutorial/metrics_reference.md b/docs/sphinx_doc/source/tutorial/metrics_reference.md
@@ -34,12 +34,11 @@ Rollout metrics track performance during the rollout phase where the model gener
 
 Consider an exploration step with `batch_size` tasks, where each task has `repeat_times` runs. Rollout metrics (e.g., `rollout/`) are computed and aggregated at different levels:
 
-- **Task level**: Metrics aggregated across `repeat_times` runs of the same task. For example, `rollout/accuracy` is the average accuracy of all runs of the task.
+- From *run level* to *task level*: In `calculate_task_level_metrics` function, metrics are aggregated across `repeat_times` runs of the same task. For example, `rollout/accuracy` is the average accuracy of all runs of the task.
 
-- **Step level**: Metrics are reported at the step level. For example, `rollout/accuracy/mean`, `rollout/accuracy/max`, `rollout/accuracy/min` are the average, max, and min accuracy (`rollout/accuracy`) of all tasks in the step.
+- From *task level* to *step level*: In `gather_metrics` function, metrics are aggregated across all tasks in the step. For example, `rollout/accuracy/mean`, `rollout/accuracy/max`, `rollout/accuracy/min` are the average, max, and min accuracy (`rollout/accuracy`) of all tasks in the step.
 
 The following diagram illustrates the aggregation process for rollout metrics:
-
 ```mermaid
 graph TD
     subgraph Step["Batch_size=3 Tasks"]
@@ -49,7 +48,6 @@ graph TD
             Run1_1 --> Task1_Metric["rollout/accuracy<br/>= 0.85"]
             Run1_2 --> Task1_Metric
         end
-
         subgraph Task2["Task 2 (repeat_times=2)"]
             Run2_1["Run 1<br/>accuracy: 0.6"]
             Run2_2["Run 2<br/>accuracy: 0.9"]
@@ -64,7 +62,7 @@ graph TD
             Run3_2 --> Task3_Metric
         end
 
-        Task1_Metric --> Step_Metrics["Step Level Metrics<br/>rollout/accuracy/mean=0.83<br/>rollout/accuracy/max=0.9<br/>rollout/accuracy/min=0.6"]
+        Task1_Metric --> Step_Metrics["Step Level Metrics<br/>rollout/accuracy/mean=0.83<br/>rollout/accuracy/max=0.9<br/>rollout/accuracy/min=0.75"]
         Task2_Metric --> Step_Metrics
         Task3_Metric --> Step_Metrics
     end
@@ -83,13 +81,14 @@ Evaluation metrics measure model performance on held-out evaluation tasks. These
   - Eval and bench metrics are computed in the same way, the only difference is the prefix of the metric name.
   - By default, only the *mean* of the metric is returned. If you want to return detailed statistics, you can set `monitor.detailed_stats` to `True` in the config.
 
+**Metric Aggregation Process**:
+
 Consider an evaluation step with `len(eval_taskset)` tasks, where each task has `repeat_times` runs. Evaluation metrics (e.g., `eval/`, `bench/`) are computed and aggregated at different levels:
 
-- **Task level**: Task-level metrics include (e.g., `mean@4`, `std@4`, `best@2`, `worst@2`) that are computed from k runs of the task.
+- From *run level* to *task level*: In `calculate_task_level_metrics` function, metrics are aggregated across `repeat_times` runs of the same task. For example, `eval/dummy/accuracy/mean@2` is the average accuracy of all runs of the task.
 
-- **Step level**: By default, we report the mean of the metric across all evaluation tasks. For example, `mean@k`, `std@k`, `best@k`, `worst@k` of the metrics across all evaluation tasks are reported. If you want to return detailed statistics, including mean, std, min, max, you can set `monitor.detailed_stats` to `True` in the config.
+- From *task level* to *step level*: In `gather_eval_metrics` function, metrics are aggregated across all tasks in the step. For example, `eval/dummy/accuracy/mean@2`, `eval/dummy/accuracy/std@2`, `eval/dummy/accuracy/best@2`, `eval/dummy/accuracy/worst@2` are the average, std, best, and worst accuracy (`eval/dummy/accuracy`) of all tasks in the step.
 
-**Metric Aggregation Process**:
 
 The following diagram illustrates the aggregation process on a dummy dataset with three tasks for evaluation metrics. By default, `mean@k`, `std@k`, `best@k`, `worst@k` of the metrics across all evaluation tasks are reported. You can set `monitor.detailed_stats` to `True` in the config to return detailed statistics.
 
diff --git a/docs/sphinx_doc/source_zh/tutorial/metrics_reference.md b/docs/sphinx_doc/source_zh/tutorial/metrics_reference.md
@@ -34,9 +34,9 @@ Rollout 指标跟踪模型生成响应的 rollout 阶段的性能。
 
 考虑一个包含 `batch_size` 个任务的探索步骤，其中每个任务有 `repeat_times` 次运行。Rollout 指标（例如，`rollout/`）在不同级别计算和聚合：
 
-- **任务级别**：跨同一任务的 `repeat_times` 次运行聚合的指标。例如，`rollout/accuracy` 是该任务所有运行的平均准确率。
+- 从*Run 级别*到*Task 级别*：在 `calculate_task_level_metrics` 函数中，指标跨同一任务的 `repeat_times` 次运行聚合。例如，`rollout/accuracy` 是该任务所有运行的平均准确率。
 
-- **步骤级别**：在步骤级别报告指标。例如，`rollout/accuracy/mean`、`rollout/accuracy/max`、`rollout/accuracy/min` 分别是步骤中所有任务的准确率（`rollout/accuracy`）的平均值、最大值和最小值。
+- 从*Task 级别*到*Step 级别*：在 `gather_metrics` 函数中，指标跨步骤中所有任务聚合。例如，`rollout/accuracy/mean`、`rollout/accuracy/max`、`rollout/accuracy/min` 分别是步骤中所有任务的准确率（`rollout/accuracy`）的平均值、最大值和最小值。
 
 以下图表说明了 rollout 指标的聚合过程：
 
@@ -64,7 +64,7 @@ graph TD
             Run3_2 --> Task3_Metric
         end
 
-        Task1_Metric --> Step_Metrics["Step Level Metrics<br/>rollout/accuracy/mean=0.83<br/>rollout/accuracy/max=0.9<br/>rollout/accuracy/min=0.6"]
+        Task1_Metric --> Step_Metrics["Step Level Metrics<br/>rollout/accuracy/mean=0.83<br/>rollout/accuracy/max=0.9<br/>rollout/accuracy/min=0.75"]
         Task2_Metric --> Step_Metrics
         Task3_Metric --> Step_Metrics
     end
@@ -83,13 +83,13 @@ graph TD
   - Eval 和 bench 指标的计算方式相同，唯一的区别是指标名称的前缀。
   - 默认情况下，只返回指标的*平均值*。如果你想返回详细统计信息，可以在配置中将 `monitor.detailed_stats` 设置为 `True`。
 
-考虑一个包含 `len(eval_taskset)` 个任务的评估步骤，其中每个任务有 `repeat_times` 次运行。评估指标（例如，`eval/`、`bench/`）在不同级别计算和聚合：
+**指标聚合过程**：
 
-- **任务级别**：任务级别指标包括（例如，`mean@4`、`std@4`、`best@2`、`worst@2`），这些指标是从任务的 k 次运行中计算的。
+考虑一个包含 `len(eval_taskset)` 个任务的评估步骤，其中每个任务有 `repeat_times` 次运行。评估指标（例如，`eval/`、`bench/`）在不同级别计算和聚合：
 
-- **步骤级别**：默认情况下，我们报告所有评估任务中指标的平均值。例如，报告所有评估任务中指标的 `mean@k`、`std@k`、`best@k`、`worst@k`。如果你想返回详细统计信息，包括 mean、std、min、max，可以在配置中将 `monitor.detailed_stats` 设置为 `True`。
+- 从*Run 级别*到*Task 级别*：在 `calculate_task_level_metrics` 函数中，指标跨同一任务的 `repeat_times` 次运行聚合。例如，`eval/dummy/accuracy/mean@2` 是该任务所有运行的平均准确率。
 
-**指标聚合过程**：
+- 从*Task 级别*到*Step 级别*：在 `gather_eval_metrics` 函数中，指标跨步骤中所有任务聚合。例如，`eval/dummy/accuracy/mean@2`、`eval/dummy/accuracy/std@2`、`eval/dummy/accuracy/best@2`、`eval/dummy/accuracy/worst@2` 分别是步骤中所有任务的准确率（`eval/dummy/accuracy`）的平均值、标准差、最佳值和最差值。
 
 以下图表说明了在包含三个任务的虚拟数据集上评估指标的聚合过程。默认情况下，报告所有评估任务中指标的 `mean@k`、`std@k`、`best@k`、`worst@k`。你可以在配置中将 `monitor.detailed_stats` 设置为 `True` 以返回详细统计信息。
 
@@ -122,7 +122,7 @@ graph TD
         Task3_Metric --> Step_Metrics
     end
 ```
-当你将 `monitor.detailed_stats` 设置为 `True` 时，，你会得到详细的信息，包括 mean、std、min、max，例如 `eval/dummy/accuracy/mean@2/mean=0.83`、`eval/dummy/accuracy/mean@2/std=0.062`、`eval/dummy/accuracy/mean@2/max=0.9` 和 `eval/dummy/accuracy/mean@2/min=0.75`。：
+当你将 `monitor.detailed_stats` 设置为 `True` 时，你会得到详细统计信息，包括 mean、std、min、max，例如 `eval/dummy/accuracy/mean@2/mean=0.83`、`eval/dummy/accuracy/mean@2/std=0.062`、`eval/dummy/accuracy/mean@2/max=0.9` 和 `eval/dummy/accuracy/mean@2/min=0.75`：
 
 ```mermaid
 graph TD