You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Trinity-RFT uses a unified config file to manage all config items. For the data module, you need to focus on the `data` section in the config file.
29
+
Trinity-RFT uses a unified config file to manage all config items. For the data module, you need to focus on the `data_processor` section in the config file.
30
30
31
31
In this example, assume that you need to rank all math questions and corresponding answers by their difficulties. So you can set these config items like the following example:
32
32
33
33
```yaml
34
34
data_processor:
35
35
# basic info
36
-
source_data_path: '/path/to/gsm8k'
36
+
source_data_path: /PATH/TO/GSM8K/
37
37
load_kwargs:
38
38
split: 'train'# only need the train split
39
39
format: # set the field mappings
@@ -58,7 +58,7 @@ If you are not familiar with Data-Juicer, the data module provides a natural-lan
58
58
```yaml
59
59
data_processor:
60
60
# basic info
61
-
source_data_path: '/path/to/gsm8k'
61
+
source_data_path: /PATH/TO/GSM8K/
62
62
load_kwargs:
63
63
split: 'train' # only need the train split
64
64
format: # set the field mappings
@@ -100,7 +100,7 @@ After preparing the Data-Juicer data processing recipe, you can set the `dj_conf
Trinity-RFT uses a unified config file to manage all config items. For the data module, you need to focus on the `data` section in the config file.
168
+
Trinity-RFT uses a unified config file to manage all config items. For the data module, you need to focus on the `data_processor` section in the config file.
169
169
170
170
In this example, assume that you need to rank all math questions and corresponding answers by their difficulties. So you can set these config items like the following example:
171
171
@@ -187,7 +187,7 @@ data_processor:
187
187
188
188
Here you can set the basic information for the example dataset, database information that is used to store the result dataset, and some other items about downstream dataset loading for exploring and training, which is similar to the example above.
189
189
190
-
For this example, we assume that you are somehow familiar with the basic usage of Data-Juicer, so we need to prepare a Data-Juicer data processing recipe in `tests/test_configs/human_annotator_test_dj_cfg.yaml` that includes an OP of `human_preference_annotation_mapper`. For example:
190
+
For this example, we assume that you are somehow familiar with the basic usage of Data-Juicer, so we need to prepare a Data-Juicer data processing recipe in [`tests/test_configs/human_annotator_test_dj_cfg.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/tests/test_configs/human_annotator_test_dj_cfg.yaml) that includes an OP of `human_preference_annotation_mapper`. For example:
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/example_mix_algo.md
+13-1Lines changed: 13 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,9 +25,15 @@ The first term corresponds to the standard GRPO objective, which aims to maximiz
25
25
We prompt a powerful LLM to generate responses with the CoT process for some pre-defined questions. The collected dta are viewed as some experiences from an expert. We store them in a `jsonl` file `expert_data.jsonl` with the following format:
26
26
27
27
```json
28
-
{"question": "What is the average of 4, 6, and 8?","response": "I add the numbers together and divide by the count: 4 + 6 + 8 = 18, divided by 3 gives 6. The answer is 6."}
28
+
{
29
+
"messages": [
30
+
{ "role": "system", "content": <system_prompt> },
31
+
{ "role": "user", "content": "What is the sum of 4 and 12?" },
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/example_multi_turn.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,8 +15,8 @@ To run the ALFworld and WebShop env, you need to setup the corresponding environ
15
15
- WebShop is a simulated online shopping environment where AI agents learn to shop based on user requirements. The platform allows agents to browse products, compare options, and make purchase decisions, mimicking real-world e-commerce interactions.
16
16
17
17
You may refer to their original environment to complete the setup.
18
-
- For ALFworld, refer to: https://github.com/alfworld/alfworld
19
-
- For WebShop, refer to: https://github.com/princeton-nlp/WebShop
18
+
- For ALFWorld, refer to the [ALFWorld](https://github.com/alfworld/alfworld) repository.
19
+
- For WebShop, refer to the [WebShop](https://github.com/princeton-nlp/WebShop) repository.
20
20
21
21
### Data Preparation
22
22
Our dataset follows the format in Huggingface datasets library, so we should correspondingly convert our env dataset.
@@ -36,7 +36,7 @@ The task is described as an environment instead of a single prompt.
36
36
37
37
## Step 2: Config preparation and run the experiment
38
38
39
-
You can refer to `example_reasoning_basic` to setup the config and others. The default config files are [`alfworld.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld/alfworld.yaml) and [`webshop.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_webshop/webshop.yaml), respectively.
39
+
You can refer to [Quick Start](./example_reasoning_basic.md) to setup the config and others. The default config files are [`alfworld.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld/alfworld.yaml) and [`webshop.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_webshop/webshop.yaml), respectively.
40
40
You may revise the configurations properly and run the experiment!
41
41
42
42
```bash
@@ -104,7 +104,7 @@ class AlfworldWorkflow(MultiTurnWorkflow):
104
104
...
105
105
```
106
106
107
-
and include them in the init files in`trinity/common/workflows/__init__.py`
107
+
and include it in the init file`trinity/common/workflows/__init__.py`
108
108
109
109
```diff
110
110
# -*- coding: utf-8 -*-
@@ -120,7 +120,7 @@ and include them in the init files in `trinity/common/workflows/__init__.py`
120
120
]
121
121
```
122
122
123
-
Then you are all set! It should be pretty simple😄, and both environments converge.
123
+
Then you are all set! It should be pretty simple😄, and the training processes in both environments converge.
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/trinity_configs.md
+19-18Lines changed: 19 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ The configuration for **Trinity-RFT** is defined in a `YAML` file and organized
8
8
9
9
```yaml
10
10
project: Trinity-RFT
11
-
name: tutorial
11
+
name: example
12
12
mode: both
13
13
checkpoint_root_dir: /PATH/TO/CHECKPOINT
14
14
@@ -78,7 +78,7 @@ Specifies the algorithm type and its related hyperparameters.
78
78
```yaml
79
79
algorithm:
80
80
algorithm_type: grpo
81
-
repeat_times: 1
81
+
repeat_times: 8
82
82
83
83
# The following parameters are optional
84
84
# If not specified, they will automatically be set based on the `algorithm_type`
@@ -89,12 +89,11 @@ algorithm:
89
89
entropy_loss_fn: "default"
90
90
```
91
91
92
-
- `algorithm_type`: Type of reinforcement learning algorithm. Supported types: `ppo`, `grpo`, `opmd`, `dpo`.
93
-
- `repeat_times`: Number of times each task is repeated. Default is `1`. In `dpo`, this is automatically set to `2`.
94
-
92
+
- `algorithm_type`: Type of reinforcement learning algorithm. Supported types: `ppo`, `grpo`, `opmd`, `dpo`, `sft`, `mix`.
93
+
- `repeat_times`: Number of times each task is repeated. Default is `1`. In `dpo`, this is automatically set to `2`. Some algorithms such as GRPO and OPMD require `repeat_times` > 1.
95
94
- `sample_strategy`: The sampling strategy used for loading experiences from experience buffer.
96
95
- `advantage_fn`: The advantage function used for computing advantages.
97
-
- `kl_penalty_fn`: The KL penalty function used for computing KL penalty.
96
+
- `kl_penalty_fn`: The KL penalty function used for computing KL penalty applied in reward.
98
97
- `kl_loss_fn`: The KL loss function used for computing KL loss.
99
98
- `entropy_loss_fn`: The entropy loss function used for computing entropy loss.
100
99
@@ -111,8 +110,8 @@ monitor:
111
110
```
112
111
113
112
- `monitor_type`: Type of monitoring system. Options:
114
-
- `wandb`: Logs to Weights & Biases. Requires logging in and setting `WANDB_API_KEY`. Project and run names match the `project` and `name` fields in global configs.
115
-
- `tensorboard`: Logs to TensorBoard. Files are saved under `<checkpoint_root_dir>/<project>/<name>/monitor/tensorboard`.
113
+
- `wandb`: Logs to [Weights & Biases](https://docs.wandb.ai/quickstart/). Requires logging in and setting `WANDB_API_KEY`. Project and run names match the `project` and `name` fields in global configs.
114
+
- `tensorboard`: Logs to [TensorBoard](https://www.tensorflow.org/tensorboard). Files are saved under `<checkpoint_root_dir>/<project>/<name>/monitor/tensorboard`.
116
115
117
116
---
118
117
@@ -122,13 +121,13 @@ Defines the model paths and token limits.
122
121
123
122
```yaml
124
123
model:
125
-
model_path: '/PATH/TO/MODEL/CHECKPOINT/'
124
+
model_path: /PATH/TO/MODEL/
126
125
critic_model_path: ''
127
126
max_prompt_tokens: 4096
128
127
max_response_tokens: 16384
129
128
```
130
129
131
-
- `model_path`: Path to the model checkpoint being trained.
130
+
- `model_path`: Path to the model being trained.
132
131
- `critic_model_path`: Optional path to a separate critic model. If empty, defaults to `model_path`.
133
132
- `max_prompt_tokens`: Maximum number of tokens allowed in input prompts.
134
133
- `max_response_tokens`: Maximum number of tokens allowed in generated responses.
@@ -175,8 +174,8 @@ buffer:
175
174
default_reward_fn_type: 'countdown_reward'
176
175
```
177
176
178
-
- `batch_size`: Number of samples used per training step. *Please do not multiply this value by the `algorithm.repeat_times` manually*.
179
-
- `total_epochs`: Total number of training epochs. Not applicable for streaming datasets (e.g., queue-based buffers).
177
+
- `batch_size`: Number of tasks used per training step. *Please do not multiply this value by the `algorithm.repeat_times` manually*.
178
+
- `total_epochs`: Total number of training epochs.
180
179
181
180
### Explorer Input
182
181
@@ -227,6 +226,8 @@ The configuration for each task dataset is defined as follows:
227
226
- For `file` storage type, the path is the path to the directory that contains the task dataset files.
228
227
- For `queue` storage type, the path is optional. You can back up the data in the queue by specifying a sqlite database path here.
229
228
- For `sql` storage type, the path is the path to the sqlite database file.
229
+
- `subset_name`: The subset name of the task dataset. Default is `None`.
230
+
- `split`: The split of the task dataset. Default is `train`.
230
231
- `format`: Defines keys for prompts and responses in the dataset.
231
232
- `prompt_key`: Specifies which column in the dataset contains the prompt data.
232
233
- `response_key`: Specifies which column in the dataset contains the response data.
@@ -302,9 +303,9 @@ synchronizer:
302
303
```
303
304
304
305
- `sync_method`: Method of synchronization. Options:
305
-
- `nccl`: Uses NCCL for fast synchronization.
306
-
- `checkpoint`: Loads latest model from disk.
307
-
- `sync_interval`: Interval (in steps) between synchronizations.
306
+
- `nccl`: Uses NCCL for fast synchronization. Supported for `both` mode.
307
+
- `checkpoint`: Loads latest model from disk. Supported for `train`, `explore`, or `bench` mode.
308
+
- `sync_interval`: Interval (in steps) of model weight synchronization between trainer and explorer.
308
309
- `sync_timeout`: Timeout duration for synchronization.
309
310
310
311
---
@@ -324,7 +325,7 @@ trainer:
324
325
- `trainer_type`: Trainer backend implementation. Currently only supports `verl`.
325
326
- `save_interval`: Frequency (in steps) at which to save model checkpoints.
326
327
- `trainer_config_path`: The path to the trainer configuration file.
327
-
- `train_config`: The configuration of the trainer. Only one needs to be set for `trainer.trainer_config` and `trainer.trainer_config_path`
328
+
- `trainer_config`: The trainer configuration provided inline. Only one of `trainer_config_path` and `trainer_config` should be specified.
328
329
329
330
---
330
331
@@ -334,7 +335,7 @@ Configures preprocessing and data cleaning pipelines.
0 commit comments