You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/example_data_functionalities.md
+95-1Lines changed: 95 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -141,7 +141,7 @@ And you can set the `clean_strategy` to 'iterative' to get a better dataset.
141
141
142
142
143
143
144
-
All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml).
144
+
All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_task_pipeline/gsm8k.yaml).
145
145
146
146
147
147
@@ -165,6 +165,100 @@ trinity run --config <Trinity-RFT_config_path>
165
165
166
166
If you follow the steps above, Trinity-RFT will send a request to the data processor server, the data active iterator will be activated, compute difficulty scores for each sample in the raw dataset, and rank the dataset according to difficulty scores. After that, the data processor server stores the result dataset into the output buffer, when exploring begins, it will load the prepared dataset and continue the downstream steps.
167
167
168
+
## Example: Data Processor for Experience Pipeline
169
+
170
+
In this example, you will learn how to apply the data processor of Trinity-RFT to reshape rewards of experiences after exploring. This example takes GSM-8K dataset as the example dataset to figure out:
171
+
172
+
173
+
Before getting started, you need to prepare the main environment of Trinity-RFT and start server for the data processor according to the first subsection in the previous example.
174
+
175
+
### Configure the Data Processor
176
+
177
+
In this example, assume that you need to add an extra reward item to the experiences outputted by the explorer, which access the quality scores of the experiences. So you can set the `experience_pipeline` config like the following example:
Here you can set the input/output buffers for the experience pipeline, and some other items about reward shaping:
216
+
217
+
+ `data_processor_url`: the URL of the data processor service, which is started in the previous step.
218
+
+ `experience_pipeline`: the configs for the experience pipeline. Experience pipeline is used to process the experiences outputted by the explorer, such as reward shaping, data filtering and augmentation. It consists of several inner configs:
219
+
+ `input_buffers`: the input buffers for the experience pipeline. It usually loads from the explorer output buffer, so we need to specify the `explorer_output` in the `buffer` config, and here we only need to specify the name that is aligned with the `explorer_output`. It allows multiple input buffers, but for now, we only need to specify one.
220
+
+ `output_buffer`: the output buffer for the experience pipeline. It usually writes results to the input buffer of trainer, so we only need to the specify the buffer name that is aligned with the `trainer_input` in the `buffer` config.
221
+
+ `format`: some dataset format config items, which are used to map original data field names to unified ones. Here we only need to specify the field name to store the original reward information.
222
+
+ `reward_shaping`: the method to reshap the reward. Usually we use some stats computed by operators in Data-Juicer as new reward items. It's a list that allows multiple methods to reshape rewards. Each item in the list has the following config items:
223
+
+ `stats_key`: which stats to use as the new reward item.
224
+
+ `op_type`: the operator to apply the new reward item to the original reward. For now, ["ADD", "SUB", "MUL", "DIV"] are supported.
225
+
+ `weight`: the weight of the new reward item.
226
+
227
+
In addition, there are several config items related to the data active iterator in `experience_pipeline` part, which is used to compute stats used to reshape rewards. This part is similar to the `task_pipeline` part in the previous example. The Data-Juicer config used here is:
228
+
```yaml
229
+
# This is a Data-Juicer data processing recipe
230
+
project_name: 'gsm-8k-experience-quality'
231
+
232
+
np: 32
233
+
234
+
process:
235
+
- llm_quality_score_filter:
236
+
api_or_hf_model: "qwen2.5-32b-instruct" # use "qwen2.5-32b-instruct" to calculate the quality scores.
237
+
min_score: 0.0
238
+
input_keys: ["prompt_text", "prompt_text"] # set input_keys and field_names to the existing key names in gsm-8k. Here calculating the difficulty scores according to both questions and answers.
239
+
field_names: ["prompt", "response"]
240
+
```
241
+
242
+
All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_experience_pipeline/gsm8k.yaml).
243
+
244
+
### Exploring & Training
245
+
After preparing the config files of Trinity-RFT, you can start your ray cluster and run the RFT process including the data active iterator part with the following commands:
246
+
247
+
```shell
248
+
# start the ray cluster
249
+
# on master node
250
+
ray start --head
251
+
# on worker nodes
252
+
ray start --address=<master_address>
253
+
254
+
# run RFT
255
+
trinity run --config <Trinity-RFT_config_path>
256
+
```
257
+
258
+
If you follow the steps above, Trinity-RFT will send a request to the data processor server and prepare the experience pipeline.
259
+
It will watch the explorer output buffer. Once there is a new batch of experience, the data processor will compute stats for the experience and reshape the rewards. Then it writes the reshaped experience to the trainer input buffer for training.
260
+
261
+
168
262
## Example: Human in the Loop
169
263
Sometimes, you might need to involve human feedbacks for some raw data. In this example, you will learn how to annotate raw data to get a better dataset before training. This example takes an example Q&A dataset and tries to select the chosen and rejected ones for DPO method.
0 commit comments