You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/metrics_reference.md
+7-8Lines changed: 7 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,12 +34,11 @@ Rollout metrics track performance during the rollout phase where the model gener
34
34
35
35
Consider an exploration step with `batch_size` tasks, where each task has `repeat_times` runs. Rollout metrics (e.g., `rollout/`) are computed and aggregated at different levels:
36
36
37
-
-**Task level**: Metrics aggregated across `repeat_times` runs of the same task. For example, `rollout/accuracy` is the average accuracy of all runs of the task.
37
+
-From *run level* to *task level*: In `calculate_task_level_metrics` function, metrics are aggregated across `repeat_times` runs of the same task. For example, `rollout/accuracy` is the average accuracy of all runs of the task.
38
38
39
-
-**Step level**: Metrics are reported at the step level. For example, `rollout/accuracy/mean`, `rollout/accuracy/max`, `rollout/accuracy/min` are the average, max, and min accuracy (`rollout/accuracy`) of all tasks in the step.
39
+
-From *task level* to *step level*: In `gather_metrics` function, metrics are aggregated across all tasks in the step. For example, `rollout/accuracy/mean`, `rollout/accuracy/max`, `rollout/accuracy/min` are the average, max, and min accuracy (`rollout/accuracy`) of all tasks in the step.
40
40
41
41
The following diagram illustrates the aggregation process for rollout metrics:
@@ -83,13 +81,14 @@ Evaluation metrics measure model performance on held-out evaluation tasks. These
83
81
- Eval and bench metrics are computed in the same way, the only difference is the prefix of the metric name.
84
82
- By default, only the *mean* of the metric is returned. If you want to return detailed statistics, you can set `monitor.detailed_stats` to `True` in the config.
85
83
84
+
**Metric Aggregation Process**:
85
+
86
86
Consider an evaluation step with `len(eval_taskset)` tasks, where each task has `repeat_times` runs. Evaluation metrics (e.g., `eval/`, `bench/`) are computed and aggregated at different levels:
87
87
88
-
-**Task level**: Task-level metrics include (e.g., `mean@4`, `std@4`, `best@2`, `worst@2`) that are computed from k runs of the task.
88
+
-From *run level* to *task level*: In `calculate_task_level_metrics` function, metrics are aggregated across `repeat_times` runs of the same task. For example, `eval/dummy/accuracy/mean@2` is the average accuracy of all runs of the task.
89
89
90
-
-**Step level**: By default, we report the mean of the metric across all evaluation tasks. For example, `mean@k`, `std@k`, `best@k`, `worst@k` of the metrics across all evaluation tasks are reported. If you want to return detailed statistics, including mean, std, min, max, you can set `monitor.detailed_stats` to `True`in the config.
90
+
-From *task level* to *step level*: In `gather_eval_metrics` function, metrics are aggregated across all tasks in the step. For example, `eval/dummy/accuracy/mean@2`, `eval/dummy/accuracy/std@2`, `eval/dummy/accuracy/best@2`, `eval/dummy/accuracy/worst@2` are the average, std, best, and worst accuracy (`eval/dummy/accuracy`) of all tasks in the step.
91
91
92
-
**Metric Aggregation Process**:
93
92
94
93
The following diagram illustrates the aggregation process on a dummy dataset with three tasks for evaluation metrics. By default, `mean@k`, `std@k`, `best@k`, `worst@k` of the metrics across all evaluation tasks are reported. You can set `monitor.detailed_stats` to `True` in the config to return detailed statistics.
0 commit comments