You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/metrics_reference.md
+37-23Lines changed: 37 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,17 @@ In the following, metrics are categorized by their source component (where they
20
20
21
21
Explorer metrics track performance during the rollout phase where the model generates responses, including rollout metrics (`rollout/`), eval metrics (`eval/`), and some time metrics (`time/`).
22
22
23
-
#### Metric Aggregation Levels
23
+
24
+
#### Rollout Metrics (`rollout/`)
25
+
26
+
Rollout metrics track performance during the rollout phase where the model generates responses.
27
+
28
+
-**Format**: `rollout/{metric_name}/{statistic}`
29
+
-**Examples**:
30
+
-`rollout/accuracy/mean`: Average accuracy of generated responses
31
+
-`rollout/format_score/mean`: Average format correctness score
32
+
33
+
**Metric Aggregation Process**:
24
34
25
35
Consider an exploration step with `batch_size` tasks, where each task has `repeat_times` runs. Rollout metrics (e.g., `rollout/`) are computed and aggregated at different levels:
26
36
@@ -60,13 +70,28 @@ graph TD
60
70
end
61
71
```
62
72
73
+
#### Eval Metrics (`eval/`) and Benchmark Metrics (`bench/`)
74
+
75
+
Evaluation metrics measure model performance on held-out evaluation tasks. These metrics are computed during periodic evaluation runs.
76
+
77
+
-**Format**: `eval/{task_name}/{metric_name}/{statistic}` or `bench/{task_name}/{metric_name}/{statistic}`
78
+
-**Examples**:
79
+
-`eval/gsm8k-eval/accuracy/mean@4`: Mean accuracy across repeat_times=4 runs
80
+
-`bench/gsm8k-eval/accuracy/best@4`: Best accuracy value across repeat_times=4 runs
81
+
82
+
-**Note**:
83
+
- Eval and bench metrics are computed in the same way, the only difference is the prefix of the metric name.
84
+
- By default, only the *mean* of the metric is returned. If you want to return detailed statistics, you can set `monitor.detailed_stats` to `True` in the config.
85
+
63
86
Consider an evaluation step with `len(eval_taskset)` tasks, where each task has `repeat_times` runs. Evaluation metrics (e.g., `eval/`, `bench/`) are computed and aggregated at different levels:
64
87
65
88
-**Task level**: Task-level metrics include (e.g., `mean@4`, `std@4`, `best@2`, `worst@2`) that are computed from k runs of the task.
66
89
67
90
-**Step level**: By default, we report the mean of the metric across all evaluation tasks. For example, `mean@k`, `std@k`, `best@k`, `worst@k` of the metrics across all evaluation tasks are reported. If you want to return detailed statistics, including mean, std, min, max, you can set `monitor.detailed_stats` to `True` in the config.
68
91
69
-
The following diagram illustrates the aggregation process on a dummy dataset with three tasks for evaluation metrics. By default, `mean@k`, `std@k`, `best@k`, `worst@k` of the metrics across all evaluation tasks are reported. you may configure `monitor.detailed_stats` to `True` in the config to return detailed statistics, including mean, std, min, max, e.g., `eval/dummy/accuracy/mean@2/mean=0.83`, `eval/dummy/accuracy/mean@2/std=0.062`, `eval/dummy/accuracy/mean@2/max=0.9`, and `eval/dummy/accuracy/mean@2/min=0.75`.
92
+
**Metric Aggregation Process**:
93
+
94
+
The following diagram illustrates the aggregation process on a dummy dataset with three tasks for evaluation metrics. By default, `mean@k`, `std@k`, `best@k`, `worst@k` of the metrics across all evaluation tasks are reported. You can set `monitor.detailed_stats` to `True` in the config to return detailed statistics.
70
95
71
96
```mermaid
72
97
graph TD
@@ -98,28 +123,17 @@ graph TD
98
123
end
99
124
```
100
125
126
+
When you set `monitor.detailed_stats` to `True`, you will get detailed statistics including mean, std, min, max, e.g., `eval/dummy/accuracy/mean@2/mean=0.83`, `eval/dummy/accuracy/mean@2/std=0.062`, `eval/dummy/accuracy/mean@2/max=0.9`, and `eval/dummy/accuracy/mean@2/min=0.75`:
101
127
102
-
#### Rollout Metrics (`rollout/`)
103
-
104
-
Rollout metrics track performance during the rollout phase where the model generates responses.
105
-
106
-
-**Format**: `rollout/{metric_name}/{statistic}`
107
-
-**Examples**:
108
-
-`rollout/accuracy/mean`: Average accuracy of generated responses
109
-
-`rollout/format_score/mean`: Average format correctness score
110
-
111
-
#### Eval Metrics (`eval/`) and Benchmark Metrics (`bench/`)
112
-
113
-
Evaluation metrics measure model performance on held-out evaluation tasks. These metrics are computed during periodic evaluation runs.
114
-
115
-
-**Format**: `eval/{task_name}/{metric_name}/{statistic}` or `bench/{task_name}/{metric_name}/{statistic}`
116
-
-**Examples**:
117
-
-`eval/gsm8k-eval/accuracy/mean@4`: Mean accuracy across repeat_times=4 runs
118
-
-`bench/gsm8k-eval/accuracy/best@4`: Best accuracy value across repeat_times=4 runs
119
-
120
-
-**Note**:
121
-
- Eval and bench metrics are computed in the same way, the only difference is the prefix of the metric name.
122
-
- By default, only the *mean* of the metric is returned. If you want to return detailed statistics, you can set `monitor.detailed_stats` to `True` in the config.
0 commit comments