Skip to content

Commit 76b53a6

Browse files
committed
update docs
1 parent ead5523 commit 76b53a6

File tree

4 files changed

+285
-99
lines changed

4 files changed

+285
-99
lines changed

docs/sphinx_doc/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ Welcome to Trinity-RFT's documentation!
2626
tutorial/trinity_gpu_configs.md
2727
tutorial/synchronizer.md
2828
tutorial/align_with_verl.md
29+
tutorial/metrics_reference.md
2930

3031

3132
.. toctree::

docs/sphinx_doc/source/tutorial/metrics_reference.md

Lines changed: 100 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Metrics Reference
22

3-
This document provides an overview of the metric categories used in Trinity-RFT for tracking performance.
3+
This document provides an overview of the metric categories used in Trinity-RFT for tracking exploration, evaluation, and training progress.
44

55
## Metric Naming Convention
66

@@ -22,11 +22,81 @@ Explorer metrics track performance during the rollout phase where the model gene
2222

2323
#### Metric Aggregation Levels
2424

25-
Consider a task with `repeat_times` runs, an exploration step with `batch_size` tasks, and an evalutation step with `eval_taskset_size` tasks. Explorer metrics are computed and aggregated at different levels:
26-
27-
- **Task level**: Metrics aggregated across `repeat_times` runs of the same task. For exploration tasks, the metrics are aggregated across all runs of the task, e.g., `rollout/accuracy` is the average accuracy of all runs of the task. For evaluation tasks, task-level metrics include (e.g., `mean@4`, `std@4`, `best@2`, `worst@2`) that are computed from k runs of the task.
28-
29-
- **Step level**: For most cases, the metrics are reported at the step level. For example, `rollout/accuracy/mean`, `rollout/accuracy/max`, `rollout/accuracy/min` are the average, max, and min accuracy (`rollout/accuracy`) of all tasks in the step. As for evaluation tasks, we report the mean of the metric across all evaluation tasks by default; if you want to return detailed statistics, you can set `monitor.detailed_stats` to `True` in the config.
25+
Consider an exploration step with `batch_size` tasks, where each task has `repeat_times` runs. Rollout metrics (e.g., `rollout/`) are computed and aggregated at different levels:
26+
27+
- **Task level**: Metrics aggregated across `repeat_times` runs of the same task. For example, `rollout/accuracy` is the average accuracy of all runs of the task.
28+
29+
- **Step level**: Metrics are reported at the step level. For example, `rollout/accuracy/mean`, `rollout/accuracy/max`, `rollout/accuracy/min` are the average, max, and min accuracy (`rollout/accuracy`) of all tasks in the step.
30+
31+
The following diagram illustrates the aggregation process for rollout metrics:
32+
33+
```mermaid
34+
graph TD
35+
subgraph Step["Batch_size=3 Tasks"]
36+
subgraph Task1["Task 1 (repeat_times=2)"]
37+
Run1_1["Run 1<br/>accuracy: 0.8"]
38+
Run1_2["Run 2<br/>accuracy: 0.9"]
39+
Run1_1 --> Task1_Metric["rollout/accuracy<br/>= 0.85"]
40+
Run1_2 --> Task1_Metric
41+
end
42+
43+
subgraph Task2["Task 2 (repeat_times=2)"]
44+
Run2_1["Run 1<br/>accuracy: 0.6"]
45+
Run2_2["Run 2<br/>accuracy: 0.9"]
46+
Run2_1 --> Task2_Metric["rollout/accuracy<br/>= 0.75"]
47+
Run2_2 --> Task2_Metric
48+
end
49+
50+
subgraph TaskN["Task 3 (repeat_times=2)"]
51+
Run3_1["Run 1<br/>accuracy: 0.95"]
52+
Run3_2["Run 2<br/>accuracy: 0.85"]
53+
Run3_1 --> Task3_Metric["rollout/accuracy<br/>= 0.9"]
54+
Run3_2 --> Task3_Metric
55+
end
56+
57+
Task1_Metric --> Step_Metrics["Step Level Metrics<br/>rollout/accuracy/mean=0.83<br/>rollout/accuracy/max=0.9<br/>rollout/accuracy/min=0.6"]
58+
Task2_Metric --> Step_Metrics
59+
Task3_Metric --> Step_Metrics
60+
end
61+
```
62+
63+
Consider an evaluation step with `len(eval_taskset)` tasks, where each task has `repeat_times` runs. Evaluation metrics (e.g., `eval/`, `bench/`) are computed and aggregated at different levels:
64+
65+
- **Task level**: Task-level metrics include (e.g., `mean@4`, `std@4`, `best@2`, `worst@2`) that are computed from k runs of the task.
66+
67+
- **Step level**: By default, we report the mean of the metric across all evaluation tasks. For example, `mean@k`, `std@k`, `best@k`, `worst@k` of the metrics across all evaluation tasks are reported. If you want to return detailed statistics, including mean, std, min, max, you can set `monitor.detailed_stats` to `True` in the config.
68+
69+
The following diagram illustrates the aggregation process on a dummy dataset with three tasks for evaluation metrics. By default, `mean@k`, `std@k`, `best@k`, `worst@k` of the metrics across all evaluation tasks are reported. you may configure `monitor.detailed_stats` to `True` in the config to return detailed statistics, including mean, std, min, max, e.g., `eval/dummy/accuracy/mean@2/mean=0.83`, `eval/dummy/accuracy/mean@2/std=0.062`, `eval/dummy/accuracy/mean@2/max=0.9`, and `eval/dummy/accuracy/mean@2/min=0.75`.
70+
71+
```mermaid
72+
graph TD
73+
subgraph Step["len(eval_taskset)=3 Tasks"]
74+
subgraph Task1["Task 1 (repeat_times=2)"]
75+
Run1_1["Run 1<br/>accuracy: 0.8"]
76+
Run1_2["Run 2<br/>accuracy: 0.9"]
77+
Run1_1 --> Task1_Metric["eval/dummy/accuracy/mean@2=0.85<br/>eval/dummy/accuracy/std@2=0.05"]
78+
Run1_2 --> Task1_Metric
79+
end
80+
81+
subgraph Task2["Task 2 (repeat_times=2)"]
82+
Run2_1["Run 1<br/>accuracy: 0.6"]
83+
Run2_2["Run 2<br/>accuracy: 0.9"]
84+
Run2_1 --> Task2_Metric["eval/dummy/accuracy/mean@2=0.75<br/>eval/dummy/accuracy/std@2=0.15"]
85+
Run2_2 --> Task2_Metric
86+
end
87+
88+
subgraph TaskN["Task 3 (repeat_times=2)"]
89+
Run3_1["Run 1<br/>accuracy: 0.95"]
90+
Run3_2["Run 2<br/>accuracy: 0.85"]
91+
Run3_1 --> Task3_Metric["eval/dummy/accuracy/mean@2=0.9<br/>eval/dummy/accuracy/std@2=0.05"]
92+
Run3_2 --> Task3_Metric
93+
end
94+
95+
Task1_Metric --> Step_Metrics["Step Level Metrics<br/>eval/dummy/accuracy/mean@2=0.83<br/>eval/dummy/accuracy/std@2=0.083"]
96+
Task2_Metric --> Step_Metrics
97+
Task3_Metric --> Step_Metrics
98+
end
99+
```
30100

31101

32102
#### Rollout Metrics (`rollout/`)
@@ -36,11 +106,7 @@ Rollout metrics track performance during the rollout phase where the model gener
36106
- **Format**: `rollout/{metric_name}/{statistic}`
37107
- **Examples**:
38108
- `rollout/accuracy/mean`: Average accuracy of generated responses
39-
- `rollout/format_score/std`: Average format correctness score
40-
- `rollout/finished_task_count`: Number of completed rollout tasks
41-
- `rollout/model_version`: Model version used for rollout
42-
- `rollout/time/run_execution/mean`: Average execution time per rollout
43-
109+
- `rollout/format_score/mean`: Average format correctness score
44110

45111
#### Eval Metrics (`eval/`) and Benchmark Metrics (`bench/`)
46112

@@ -49,9 +115,7 @@ Evaluation metrics measure model performance on held-out evaluation tasks. These
49115
- **Format**: `eval/{task_name}/{metric_name}/{statistic}` or `bench/{task_name}/{metric_name}/{statistic}`
50116
- **Examples**:
51117
- `eval/gsm8k-eval/accuracy/mean@4`: Mean accuracy across repeat_times=4 runs
52-
- `eval/gsm8k-eval/accuracy/best@2`: Best accuracy value across k=2 runs, computed by bootstrap method
53-
- `eval/gsm8k-eval/accuracy/worst@2`: Worst accuracy value across k=2 runs, computed by bootstrap method
54-
- `bench/gsm8k-eval/accuracy/mean@4`: Mean accuracy across repeat_times=4 runs
118+
- `bench/gsm8k-eval/accuracy/best@4`: Best accuracy value across repeat_times=4 runs
55119

56120
- **Note**:
57121
- Eval and bench metrics are computed in the same way, the only difference is the prefix of the metric name.
@@ -65,99 +129,23 @@ Time metrics measure execution duration for various operations throughout the tr
65129
- **Format**: `time/{operation_name}`
66130
- **Examples**:
67131
- `time/eval`: Time from the start of submitting evaluation tasks to the end of the evaluation phase; this duration includes both evaluation tasks and some rollout tasks.
68-
- `time/read_experience`: Time to read experiences from taskset
69-
- `time/wait_explore_step`: Time waiting for a rollout/exploration step completion
70-
- `time/update_critic`: Time to update critic model
71-
- `time/update_actor`: Time to update actor model
72-
- `time/sync_weight`: Time to synchronize model weights
73-
- `time/save_checkpoint`: Time to save model checkpoint
74132
- `time/train_step`: Total time for one training step
75-
- `time/trainer_sync_interval`: Time interval between trainer synchronizations
76133

77134
**Note**:
78135
- Time measuring can be inaccurate due to the asynchronous nature of the exploration pipeline, but it is still useful for monitoring the overall training progress.
79136
- Above metrics are reported in seconds unless otherwise specified.
80137
- Some training operations also report per-token timing metrics with the prefix `timing_per_token_ms/` (e.g., `timing_per_token_ms/update_actor`, `timing_per_token_ms/update_critic`, `timing_per_token_ms/adv`, `timing_per_token_ms/values`). These metrics normalize execution time by the number of tokens processed, providing efficiency measurements independent of batch size.
81138

82139

83-
### Training Metrics
84-
85-
This category includes metrics that track the training dynamics of the policy (actor) model (`actor/`) and the value function (critic) model (`critic/`), as well as some performance metrics (`perf/`, `global_seqlen/`, `response_length/`, `prompt_length/`, `time/`).
86-
87-
#### Actor Metrics (`actor/`)
88-
89-
Actor metrics track the training dynamics of the policy (actor) model in reinforcement learning.
90-
91-
- **Format**: `actor/{metric_name}`
92-
- **Examples**:
93-
- `actor/pg_loss`: Policy gradient loss
94-
- `actor/entropy_loss`: Entropy regularization loss
95-
- `actor/kl_loss`: KL divergence loss
96-
- `actor/ppo_kl`: PPO-specific KL divergence
97-
- `actor/pg_clipfrac`: Fraction of policy gradient updates clipped
98-
- `actor/final_loss`: Final loss used to update the actor model, usually a combination of policy gradient loss, entropy regularization loss, and KL divergence loss.
99-
100-
#### Critic Metrics (`critic/`)
140+
### Training Metrics
101141

102-
Critic metrics track the training dynamics of the value function (critic) model.
103-
104-
- **Format**: `critic/{metric_name}/{statistic}`
105-
- **Examples**:
106-
- `critic/score/mean`: Mean sequence-level score
107-
- `critic/rewards/mean`: Mean sequence-level reward
108-
- `critic/advantages/mean`: Mean advantage values
109-
- `critic/returns/mean`: Mean return values
110-
111-
#### Performance Metrics (`perf/`)
112-
113-
Performance metrics measure computational efficiency and resource utilization.
114-
115-
- **Format**: `perf/{metric_name}`
116-
- **Examples**:
117-
- `perf/mfu/actor`: Model FLOPs Utilization (MFU) for actor
118-
- `perf/mfu/critic`: Model FLOPs Utilization (MFU) for critic
119-
- `perf/mfu/actor_infer`: Model FLOPs Utilization for actor inference (when recomputing logprobs)
120-
- `perf/max_memory_allocated_gb`: Peak GPU memory allocated
121-
- `perf/max_memory_reserved_gb`: Peak GPU memory reserved
122-
- `perf/cpu_memory_used_gb`: CPU memory usage
123-
- `perf/total_num_tokens`: Total number of tokens processed
124-
- `perf/time_per_step`: Time per training step
125-
- `perf/throughput`: Tokens processed per second
126-
127-
#### Global Sequence Length Metrics (`global_seqlen/`)
128-
129-
Global sequence length metrics track sequence length statistics across the training batch.
130-
131-
- **Format**: `global_seqlen/{statistic}`
132-
- **Examples**:
133-
- `global_seqlen/mean`: Mean sequence length
134-
- `global_seqlen/min`: Minimum sequence length
135-
- `global_seqlen/max`: Maximum sequence length
136-
- `global_seqlen/minmax_diff`: Difference between max and min
137-
- `global_seqlen/balanced_min`: Balanced minimum (for load balancing)
138-
- `global_seqlen/balanced_max`: Balanced maximum (for load balancing)
139-
140-
#### Response and Prompt Length Metrics (`response_length/` and `prompt_length/`)
141-
142-
Metrics tracking the length of generated responses and input prompts.
143-
144-
- **Format**: `response_length/{statistic}` or `prompt_length/{statistic}`
145-
- **Examples**:
146-
- `response_length/mean`: Mean response length in tokens
147-
- `response_length/max`: Maximum response length
148-
- `response_length/min`: Minimum response length
149-
- `response_length/clip_ratio`: Fraction of responses clipped to max length
150-
- `prompt_length/mean`: Mean prompt length in tokens
151-
- `prompt_length/clip_ratio`: Fraction of prompts clipped to max length
152-
153-
154-
**Note**:
155-
- `/clip_ratio` means the fraction of responses/prompts that matches the max length (instead of being truncated).
142+
This category includes metrics that track the training dynamics of the policy (actor) model (`actor/`) and the value function (critic) model (`critic/`), as well as some performance metrics (`perf/`, `global_seqlen/`, `response_length/`, `prompt_length/`, `time/`). These metrics are adapted from [veRL](https://github.com/volcengine/verl). Interested users can refer to the [veRL documentation](https://verl.readthedocs.io/en/latest/index.html) for more details.
156143

157144

158145
### Data Processing Metrics
159146

160-
This category includes metrics that track the processing of experiences through various pipeline operators (`experience_pipeline/`) and data sampling statistics (`sample/`).
147+
This category includes metrics that track the processing of experiences through various pipeline operators (`experience_pipeline/`) and data sampling statistics (`sample/`). These metrics are aggregated at the step level, as the experience pipeline and data sampling are performed in each step.
148+
161149

162150
#### Experience Pipeline Metrics (`experience_pipeline/` and `time/experience_pipeline/`)
163151

@@ -166,11 +154,24 @@ Experience pipeline metrics track the processing of experiences through various
166154
- **Format**: `experience_pipeline/{metric_name}`
167155
- **Examples**:
168156
- `experience_pipeline/experience_count`: Number of experiences processed
169-
- `experience_pipeline/filtered_count`: Number of experiences filtered out
170-
- `experience_pipeline/group_advantages/reward_mean/mean`: Mean reward statistics
171-
- `time/experience_pipeline/operator/{operator_name}`: Time for specific pipeline operators
172-
- `time/experience_pipeline/write`: Time to write experiences to storage
173-
- `time/experience_pipeline/total`: Total time for experience processing
157+
- `experience_pipeline/group_advantages/reward_mean/mean`: Here `reward_mean` is the mean reward of each task, then we compute the mean of the mean rewards of all tasks in the step.
158+
159+
The following diagram illustrates the aggregation process for data processing metrics:
160+
```mermaid
161+
graph TD
162+
subgraph Step["4 Experiences in one step"]
163+
subgraph Task1["Experience 1"]
164+
Run1_1["Run 1<br/>reward_mean: 0.8"]
165+
Run1_2["Run 2<br/>reward_mean: 0.8"]
166+
Run2_1["Run 3<br/>reward_mean: 0.9"]
167+
Run2_2["Run 4<br/>reward_mean: 0.9"]
168+
Run1_1 --> Task1_Metric["rollout/accuracy<br/>= 0.85"]
169+
Run1_2 --> Task1_Metric
170+
Run2_1 --> Task1_Metric
171+
Run2_2 --> Task1_Metric
172+
end
173+
end
174+
```
174175

175176
#### Sample Metrics (`sample/`)
176177

docs/sphinx_doc/source_zh/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
tutorial/trinity_gpu_configs.md
2626
tutorial/synchronizer.md
2727
tutorial/align_with_verl.md
28+
tutorial/metrics_reference.md
2829

2930
.. toctree::
3031
:maxdepth: 1

0 commit comments

Comments
 (0)