Skip to content

Commit d16f0a8

Browse files
committed
* update example docs for experience pipeline
1 parent d9501cf commit d16f0a8

File tree

2 files changed

+97
-1
lines changed

2 files changed

+97
-1
lines changed

docs/sphinx_doc/source/tutorial/example_data_functionalities.md

Lines changed: 95 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -141,7 +141,7 @@ And you can set the `clean_strategy` to 'iterative' to get a better dataset.
141141

142142

143143

144-
All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml).
144+
All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_task_pipeline/gsm8k.yaml).
145145

146146

147147

@@ -165,6 +165,100 @@ trinity run --config <Trinity-RFT_config_path>
165165

166166
If you follow the steps above, Trinity-RFT will send a request to the data processor server, the data active iterator will be activated, compute difficulty scores for each sample in the raw dataset, and rank the dataset according to difficulty scores. After that, the data processor server stores the result dataset into the output buffer, when exploring begins, it will load the prepared dataset and continue the downstream steps.
167167

168+
## Example: Data Processor for Experience Pipeline
169+
170+
In this example, you will learn how to apply the data processor of Trinity-RFT to reshape rewards of experiences after exploring. This example takes GSM-8K dataset as the example dataset to figure out:
171+
172+
173+
Before getting started, you need to prepare the main environment of Trinity-RFT and start server for the data processor according to the first subsection in the previous example.
174+
175+
### Configure the Data Processor
176+
177+
In this example, assume that you need to add an extra reward item to the experiences outputted by the explorer, which access the quality scores of the experiences. So you can set the `experience_pipeline` config like the following example:
178+
179+
```yaml
180+
data_processor:
181+
data_processor_url: 'http://127.0.0.1:5005/data_processor'
182+
# experience pipeline related
183+
experience_pipeline:
184+
# I/O buffers
185+
input_buffers:
186+
- name: gsm8k_exp_output
187+
output_buffer:
188+
name: reshaped_gsm8k_exp_input
189+
# format mapping
190+
format:
191+
reward_key: 'reward' # the key name of the reward in the experience
192+
# data active iterator related
193+
dj_config_path: 'examples/grpo_gsm8k_experience_pipeline/dj_scoring_exp.yaml'
194+
clean_strategy: 'iterative'
195+
# reward shaping
196+
reward_shaping:
197+
- stats_key: 'llm_quality_score'
198+
op_type: ADD
199+
weight: 1.0
200+
201+
# the buffer config
202+
buffer:
203+
...
204+
explorer_output:
205+
name: gsm8k_exp_output
206+
storage_type: queue
207+
path: 'sqlite:///gsm8k_exp_output.db'
208+
trainer_input:
209+
experience_buffer:
210+
name: reshaped_gsm8k_exp_input
211+
storage_type: queue
212+
path: 'sqlite:///reshaped_gsm8k_exp_input.db'
213+
```
214+
215+
Here you can set the input/output buffers for the experience pipeline, and some other items about reward shaping:
216+
217+
+ `data_processor_url`: the URL of the data processor service, which is started in the previous step.
218+
+ `experience_pipeline`: the configs for the experience pipeline. Experience pipeline is used to process the experiences outputted by the explorer, such as reward shaping, data filtering and augmentation. It consists of several inner configs:
219+
+ `input_buffers`: the input buffers for the experience pipeline. It usually loads from the explorer output buffer, so we need to specify the `explorer_output` in the `buffer` config, and here we only need to specify the name that is aligned with the `explorer_output`. It allows multiple input buffers, but for now, we only need to specify one.
220+
+ `output_buffer`: the output buffer for the experience pipeline. It usually writes results to the input buffer of trainer, so we only need to the specify the buffer name that is aligned with the `trainer_input` in the `buffer` config.
221+
+ `format`: some dataset format config items, which are used to map original data field names to unified ones. Here we only need to specify the field name to store the original reward information.
222+
+ `reward_shaping`: the method to reshap the reward. Usually we use some stats computed by operators in Data-Juicer as new reward items. It's a list that allows multiple methods to reshape rewards. Each item in the list has the following config items:
223+
+ `stats_key`: which stats to use as the new reward item.
224+
+ `op_type`: the operator to apply the new reward item to the original reward. For now, ["ADD", "SUB", "MUL", "DIV"] are supported.
225+
+ `weight`: the weight of the new reward item.
226+
227+
In addition, there are several config items related to the data active iterator in `experience_pipeline` part, which is used to compute stats used to reshape rewards. This part is similar to the `task_pipeline` part in the previous example. The Data-Juicer config used here is:
228+
```yaml
229+
# This is a Data-Juicer data processing recipe
230+
project_name: 'gsm-8k-experience-quality'
231+
232+
np: 32
233+
234+
process:
235+
- llm_quality_score_filter:
236+
api_or_hf_model: "qwen2.5-32b-instruct" # use "qwen2.5-32b-instruct" to calculate the quality scores.
237+
min_score: 0.0
238+
input_keys: ["prompt_text", "prompt_text"] # set input_keys and field_names to the existing key names in gsm-8k. Here calculating the difficulty scores according to both questions and answers.
239+
field_names: ["prompt", "response"]
240+
```
241+
242+
All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_experience_pipeline/gsm8k.yaml).
243+
244+
### Exploring & Training
245+
After preparing the config files of Trinity-RFT, you can start your ray cluster and run the RFT process including the data active iterator part with the following commands:
246+
247+
```shell
248+
# start the ray cluster
249+
# on master node
250+
ray start --head
251+
# on worker nodes
252+
ray start --address=<master_address>
253+
254+
# run RFT
255+
trinity run --config <Trinity-RFT_config_path>
256+
```
257+
258+
If you follow the steps above, Trinity-RFT will send a request to the data processor server and prepare the experience pipeline.
259+
It will watch the explorer output buffer. Once there is a new batch of experience, the data processor will compute stats for the experience and reshape the rewards. Then it writes the reshaped experience to the trainer input buffer for training.
260+
261+
168262
## Example: Human in the Loop
169263
Sometimes, you might need to involve human feedbacks for some raw data. In this example, you will learn how to annotate raw data to get a better dataset before training. This example takes an example Q&A dataset and tries to select the chosen and rejected ones for DPO method.
170264

examples/grpo_gsm8k_experience_pipeline/dj_scoring_exp.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# This is a Data-Juicer data processing recipe
22
project_name: 'gsm-8k-experience-quality'
33

4+
np: 32
5+
46
process:
57
- llm_quality_score_filter:
68
api_or_hf_model: "qwen2.5-32b-instruct" # use "qwen2.5-32b-instruct" to calculate the quality scores.

0 commit comments

Comments
 (0)