agentscope-ai · pan-x-c · Jan 23, 2026 · Jan 20, 2026 · Jan 21, 2026 · Jan 21, 2026
diff --git a/docs/sphinx_doc/source/index.rst b/docs/sphinx_doc/source/index.rst
@@ -26,6 +26,7 @@ Welcome to Trinity-RFT's documentation!
    tutorial/trinity_gpu_configs.md
    tutorial/synchronizer.md
    tutorial/align_with_verl.md
+   tutorial/metrics_reference.md
 
 
 .. toctree::

diff --git a/docs/sphinx_doc/source/tutorial/metrics_reference.md b/docs/sphinx_doc/source/tutorial/metrics_reference.md
@@ -0,0 +1,196 @@
+# Metrics Reference
+
+This document provides an overview of the metric categories used in Trinity-RFT for tracking exploration, evaluation, and training progress.
+
+## Metric Naming Convention
+
+Most metrics follow a hierarchical naming convention: `{category}/{taskset_name}/{metric_name}/{statistic}`
+
+- **Category**: Broad functional area (rollout, eval, time, actor, critic, etc.)
+- **Taskset name**: Name of the taskset used, only applicable for eval metrics
+- **Metric name**: Specific metric being measured
+- **Statistic**: Aggregation method (mean, max, min, std, etc.) if applicable
+
+
+## Metric Categories
+
+In the following, metrics are categorized by their source component (where they are generated) and their metric prefix (the first part of the metric name).
+
+### Explorer Metrics
+
+Explorer metrics track performance during the rollout phase where the model generates responses, including rollout metrics (`rollout/`), eval metrics (`eval/`), and some time metrics (`time/`).
+
+
+#### Rollout Metrics (`rollout/`)
+
+Rollout metrics track performance during the rollout phase where the model generates responses.
+
+- **Format**: `rollout/{metric_name}/{statistic}`
+- **Examples**:
+  - `rollout/accuracy/mean`: Average accuracy of generated responses
+  - `rollout/format_score/mean`: Average format correctness score
+
+**Metric Aggregation Process**:
+
+Consider an exploration step with `batch_size` tasks, where each task has `repeat_times` runs. Rollout metrics (e.g., `rollout/`) are computed and aggregated at different levels:
+
+- From *run level* to *task level*: In `calculate_task_level_metrics` function, metrics are aggregated across `repeat_times` runs of the same task. For example, `rollout/accuracy` is the average accuracy of all runs of the task.
+
+- From *task level* to *step level*: In `gather_metrics` function, metrics are aggregated across all tasks in the step. For example, `rollout/accuracy/mean`, `rollout/accuracy/max`, `rollout/accuracy/min` are the average, max, and min accuracy (`rollout/accuracy`) of all tasks in the step.
+
+The following diagram illustrates the aggregation process for rollout metrics:
+```mermaid
+graph TD
+    subgraph Step["Batch_size=3 Tasks"]
+        subgraph Task1["Task 1 (repeat_times=2)"]
+            Run1_1["Run 1<br/>accuracy: 0.8"]
+            Run1_2["Run 2<br/>accuracy: 0.9"]
+            Run1_1 --> Task1_Metric["rollout/accuracy<br/>= 0.85"]
+            Run1_2 --> Task1_Metric
+        end
+        subgraph Task2["Task 2 (repeat_times=2)"]
+            Run2_1["Run 1<br/>accuracy: 0.6"]
+            Run2_2["Run 2<br/>accuracy: 0.9"]
+            Run2_1 --> Task2_Metric["rollout/accuracy<br/>= 0.75"]
+            Run2_2 --> Task2_Metric
+        end
+
+        subgraph TaskN["Task 3 (repeat_times=2)"]
+            Run3_1["Run 1<br/>accuracy: 0.95"]
+            Run3_2["Run 2<br/>accuracy: 0.85"]
+            Run3_1 --> Task3_Metric["rollout/accuracy<br/>= 0.9"]
+            Run3_2 --> Task3_Metric
+        end
+
+        Task1_Metric --> Step_Metrics["Step Level Metrics<br/>rollout/accuracy/mean=0.83<br/>rollout/accuracy/max=0.9<br/>rollout/accuracy/min=0.75"]
+        Task2_Metric --> Step_Metrics
+        Task3_Metric --> Step_Metrics
+    end
+```
+
+#### Eval Metrics (`eval/`) and Benchmark Metrics (`bench/`)
+
+Evaluation metrics measure model performance on held-out evaluation tasks. These metrics are computed during periodic evaluation runs.
+
+- **Format**: `eval/{task_name}/{metric_name}/{statistic}` or `bench/{task_name}/{metric_name}/{statistic}`
+- **Examples**:
+  - `eval/gsm8k-eval/accuracy/mean@4`: Mean accuracy across repeat_times=4 runs
+  - `bench/gsm8k-eval/accuracy/best@4`: Best accuracy value across repeat_times=4 runs
+
+- **Note**:
+  - Eval and bench metrics are computed in the same way, the only difference is the prefix of the metric name.
+  - By default, only the *mean* of the metric is returned. If you want to return detailed statistics, you can set `monitor.detailed_stats` to `True` in the config.
+
+**Metric Aggregation Process**:
+
+Consider an evaluation step with `len(eval_taskset)` tasks, where each task has `repeat_times` runs. Evaluation metrics (e.g., `eval/`, `bench/`) are computed and aggregated at different levels:
+
+- From *run level* to *task level*: In `calculate_task_level_metrics` function, metrics are aggregated across `repeat_times` runs of the same task. For example, `eval/dummy/accuracy/mean@2` is the average accuracy of all runs of the task.
+
+- From *task level* to *step level*: In `gather_eval_metrics` function, metrics are aggregated across all tasks in the step. For example, `eval/dummy/accuracy/mean@2`, `eval/dummy/accuracy/std@2`, `eval/dummy/accuracy/best@2`, `eval/dummy/accuracy/worst@2` are the average, std, best, and worst accuracy (`eval/dummy/accuracy`) of all tasks in the step.
+
+
+The following diagram illustrates the aggregation process on a dummy dataset with three tasks for evaluation metrics. By default, `mean@k`, `std@k`, `best@k`, `worst@k` of the metrics across all evaluation tasks are reported. You can set `monitor.detailed_stats` to `True` in the config to return detailed statistics.
+
+```mermaid
+graph TD
+    subgraph Step["len(eval_taskset)=3 Tasks"]
+        subgraph Task1["Task 1 (repeat_times=2)"]
+            Run1_1["Run 1<br/>accuracy: 0.8"]
+            Run1_2["Run 2<br/>accuracy: 0.9"]
+            Run1_1 --> Task1_Metric["eval/dummy/accuracy/mean@2=0.85<br/>eval/dummy/accuracy/std@2=0.05"]
+            Run1_2 --> Task1_Metric
+        end
+
+        subgraph Task2["Task 2 (repeat_times=2)"]
+            Run2_1["Run 1<br/>accuracy: 0.6"]
+            Run2_2["Run 2<br/>accuracy: 0.9"]
+            Run2_1 --> Task2_Metric["eval/dummy/accuracy/mean@2=0.75<br/>eval/dummy/accuracy/std@2=0.15"]
+            Run2_2 --> Task2_Metric
+        end
+
+        subgraph TaskN["Task 3 (repeat_times=2)"]
+            Run3_1["Run 1<br/>accuracy: 0.95"]
+            Run3_2["Run 2<br/>accuracy: 0.85"]
+            Run3_1 --> Task3_Metric["eval/dummy/accuracy/mean@2=0.9<br/>eval/dummy/accuracy/std@2=0.05"]
+            Run3_2 --> Task3_Metric
+        end
+
+        Task1_Metric --> Step_Metrics["Step Level Metrics<br/>eval/dummy/accuracy/mean@2=0.83<br/>eval/dummy/accuracy/std@2=0.083"]
+        Task2_Metric --> Step_Metrics
+        Task3_Metric --> Step_Metrics
+    end
+```
+
+When you set `monitor.detailed_stats` to `True`, you will get detailed statistics including mean, std, min, max, e.g., `eval/dummy/accuracy/mean@2/mean=0.83`, `eval/dummy/accuracy/mean@2/std=0.062`, `eval/dummy/accuracy/mean@2/max=0.9`, and `eval/dummy/accuracy/mean@2/min=0.75`:
+
+```mermaid
+graph TD
+    subgraph Step_Metrics_Summary["Detailed Statistics"]
+        Stat1["eval/dummy/accuracy/mean@2/mean=0.83<br/>eval/dummy/accuracy/mean@2/std=0.062<br/>eval/dummy/accuracy/mean@2/max=0.9<br/>eval/dummy/accuracy/mean@2/min=0.75"]
+        Stat2["eval/dummy/accuracy/std@2/mean=0.083<br/>eval/dummy/accuracy/std@2/std=0.047<br/>eval/dummy/accuracy/std@2/max=0.15<br/>eval/dummy/accuracy/std@2/min=0.05"]
+        Step_Metrics --> Stat1
+        Step_Metrics --> Stat2
+    end
+```
+
+
+#### Time Metrics (`time/`)
+
+Time metrics measure execution duration for various operations throughout the training pipeline.
+
+- **Format**: `time/{operation_name}`
+- **Examples**:
+  - `time/eval`: Time from the start of submitting evaluation tasks to the end of the evaluation phase; this duration includes both evaluation tasks and some rollout tasks.
+  - `time/train_step`: Total time for one training step
+
+**Note**:
+  - Time measuring can be inaccurate due to the asynchronous nature of the exploration pipeline, but it is still useful for monitoring the overall training progress.
+  - Above metrics are reported in seconds unless otherwise specified.
+  - Some training operations also report per-token timing metrics with the prefix `timing_per_token_ms/` (e.g., `timing_per_token_ms/update_actor`, `timing_per_token_ms/update_critic`, `timing_per_token_ms/adv`, `timing_per_token_ms/values`). These metrics normalize execution time by the number of tokens processed, providing efficiency measurements independent of batch size.
+
+
+### Training Metrics
+
+This category includes metrics that track the training dynamics of the policy (actor) model (`actor/`) and the value function (critic) model (`critic/`), as well as some performance metrics (`perf/`, `global_seqlen/`, `response_length/`, `prompt_length/`, `time/`). These metrics are adapted from [veRL](https://github.com/volcengine/verl). Interested users can refer to the [veRL documentation](https://verl.readthedocs.io/en/latest/index.html) for more details.
+
+
+### Data Processing Metrics
+
+This category includes metrics that track the processing of experiences through various pipeline operators (`experience_pipeline/`) and data sampling statistics (`sample/`). These metrics are aggregated at the step level, as the experience pipeline and data sampling are performed in each step.
+
+
+#### Experience Pipeline Metrics (`experience_pipeline/` and `time/experience_pipeline/`)
+
+Experience pipeline metrics track the processing of experiences through various pipeline operators. Each metric represents the count of the specific operator in one step.
+
+- **Format**: `experience_pipeline/{metric_name}`
+- **Examples**:
+  - `experience_pipeline/experience_count`: Number of experiences processed
+  - `experience_pipeline/group_advantages/reward_mean/mean`: Here `reward_mean` is the mean reward of each task, then we compute the mean of the mean rewards of all tasks in the step.
+
+The following diagram illustrates the aggregation process for data processing metrics:
+```mermaid
+graph TD
+    subgraph Step["4 Experiences in one step"]
+        subgraph Task1["Experience 1"]
+            Run1_1["Run 1<br/>reward_mean: 0.8"]
+            Run1_2["Run 2<br/>reward_mean: 0.8"]
+            Run2_1["Run 3<br/>reward_mean: 0.9"]
+            Run2_2["Run 4<br/>reward_mean: 0.9"]
+            Run1_1 --> Task1_Metric["rollout/accuracy<br/>= 0.85"]
+            Run1_2 --> Task1_Metric
+            Run2_1 --> Task1_Metric
+            Run2_2 --> Task1_Metric
+        end
+    end
+```
+
+#### Sample Metrics (`sample/`)
+
+Sample metrics track data sampling statistics during training.
+
+- **Format**: `sample/{metric_name}`
+- **Examples**:
+  - `sample/model_version/mean`: Mean model version of sampled experiences
+  - `sample/task_count`: Number of tasks in the sampled batch
diff --git a/docs/sphinx_doc/source_zh/index.rst b/docs/sphinx_doc/source_zh/index.rst
@@ -25,6 +25,7 @@
    tutorial/trinity_gpu_configs.md
    tutorial/synchronizer.md
    tutorial/align_with_verl.md
+   tutorial/metrics_reference.md
 
 .. toctree::
    :maxdepth: 1