Skip to content

Commit eddf4e4

Browse files
authored
Refactor data module and support task pipeline in data processor (#92)
1 parent 6f2d7c7 commit eddf4e4

31 files changed

+756
-553
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,7 @@ ENV/
8484
logs/
8585

8686
# data-juicer
87+
tmp/
8788
outputs/
8889
# agentscope
8990
runs/

docs/sphinx_doc/source/tutorial/example_data_functionalities.md

Lines changed: 113 additions & 83 deletions
Original file line numberDiff line numberDiff line change
@@ -1,80 +1,97 @@
11
# Data Processing
22

3-
## Example: reasoning task
3+
## Example: Data Processor for Task Pipeline
44

5-
In this example, you will learn how to apply the data module of Trinity-RFT to prepare the dataset before exploring and training. This example takes GSM-8K dataset as the example dataset to figure out:
5+
In this example, you will learn how to apply the data processor of Trinity-RFT to prepare and prioritize the dataset before task exploring and training. This example takes GSM-8K dataset as the example dataset to figure out:
66

7-
1. how to prepare the data module
8-
2. how to configure the data module
9-
3. what the data module can do
7+
1. how to prepare the data processor
8+
2. how to configure the data processor
9+
3. what the data processor can do
1010

11-
Before getting started, you need to prepare the main environment of Trinity-RFT according to the [installation section of the README file](../main.md), and you need to install [postgresql](https://www.postgresql.org/docs/current/tutorial-install.html) as well.
11+
Before getting started, you need to prepare the main environment of Trinity-RFT according to the [installation section of the README file](../main.md).
1212

1313
### Data Preparation
1414

15-
#### Prepare the Data Module
15+
#### Prepare the Data Processor
1616

17-
As the overall framework of Trinity-RFT shows, the data module is one of the high-level functions. Trinity-RFT encapsulates the data module as an independent service to avoid dependency conflict issues. Thus you need to prepare a split environment for this module and start the server.
17+
As the overall framework of Trinity-RFT shows, the data processor is one of the high-level functions. Trinity-RFT encapsulates the data processor as an independent service to avoid dependency conflict issues. Thus you need to prepare a split environment for this module and start the server.
1818

1919
```shell
20-
# prepare split environments, including the one of data module
20+
# prepare split environments, including the one of data processor
2121
python scripts/install.py
2222

2323
# start all split servers
2424
python scripts/start_servers.py
2525
```
2626

27-
### Configure the Data Module
27+
### Configure the Data Processor
2828

29-
Trinity-RFT uses a unified config file to manage all config items. For the data module, you need to focus on the `data_processor` section in the config file.
29+
Trinity-RFT uses a unified config file to manage all config items. For the data processor, you need to focus on the `data_processor` section in the config file.
3030

3131
In this example, assume that you need to rank all math questions and corresponding answers by their difficulties. So you can set these config items like the following example:
3232

3333
```yaml
3434
data_processor:
35-
# basic info
36-
source_data_path: /PATH/TO/GSM8K/
37-
load_kwargs:
38-
split: 'train' # only need the train split
39-
format: # set the field mappings
40-
prompt_key: 'question'
41-
response_key: 'answer'
42-
# database related. The result dataset will be stored in the database.
43-
db_url: 'postgresql://{user_name}@localhost:5432/{db_name}'
35+
data_processor_url: 'http://127.0.0.1:5005/data_processor'
36+
# task pipeline related
37+
task_pipeline:
38+
# I/O buffers
39+
input_buffers:
40+
- name: 'raw_input'
41+
path: /PATH/TO/GSM8K/
42+
storage_type: 'file'
43+
raw: true
44+
output_buffer:
45+
name: 'raw_output'
46+
path: /PATH/TO/OUTPUT/JSONL/FILE
47+
storage_type: 'file'
48+
# format mapping
49+
format:
50+
prompt_key: 'question'
51+
response_key: 'answer'
4452
```
4553
46-
Here you can set the basic information for the GSM-8K dataset, database information that is used to store the result dataset, and some other items about downstream dataset loading for exploring and training:
54+
Here you can set the basic buffers for the GSM-8K dataset input and output and some other items about downstream dataset loading for exploring and training:
4755
48-
+ `source_data_path`: the path to the raw dataset.
49-
+ `load_kwargs`: extra config arguments for loading the raw dataset. Mainly for the `load_dataset` method in HuggingFace `datasets` library.
50-
+ `format`: some dataset format config items, which are used to map original data field names to unified ones.
51-
+ `db_url`: the URL of the postgresql database to store the result dataset.
56+
+ `data_processor_url`: the URL of the data processor service, which is started in the previous step.
57+
+ `task_pipeline`: the configs for the task pipeline. Task pipeline is used to process the raw dataset. It consists of several inner configs:
58+
+ `input_buffers`: the input buffers for the task pipeline. We usually load from raw dataset files in this pipeline, thus we need to the dataset `path` and set the `storage_type` to "file" and set `raw` to True. It allows multiple input buffers. We can name each buffer with the `name` field.
59+
+ `output_buffer`: the output buffer for the task pipeline. We usually store the processed dataset in files as well, thus we need to set the `storage_type` to "file".
60+
+ `format`: some dataset format config items, which are used to map original data field names to unified ones.
5261

53-
In addition, there are several config items related to the data active iterator, which is used to prepare a better dataset. The core part of the data active iterator, Data-Juicer, provides tens of operators to help clean or calculate key information for each sample in the dataset. You can configure this part depending on how familiar you are with Data-Juicer.
62+
In addition, there are several config items related to the data active iterator in `task_pipeline` part, which is used to prepare a better dataset. The core part of the data active iterator, Data-Juicer, provides tens of operators to help clean or calculate key information for each sample in the dataset. You can configure this part depending on how familiar you are with Data-Juicer.
5463

5564
#### Not familiar with Data-Juicer
56-
If you are not familiar with Data-Juicer, the data module provides a natural-language-based method to config the data processing recipe. What you need to do is only describe the demands of how you want to prepare for the raw dataset, and an agent will be invoked to arrange the data processing recipe for you. Here is an example:
65+
If you are not familiar with Data-Juicer, the data processor provides a natural-language-based method to config the data processing recipe. What you need to do is only describe the demands of how you want to prepare for the raw dataset, and an agent will be invoked to arrange the data processing recipe for you. Here is an example:
5766

5867
```yaml
5968
data_processor:
60-
# basic info
61-
source_data_path: /PATH/TO/GSM8K/
62-
load_kwargs:
63-
split: 'train' # only need the train split
64-
format: # set the field mappings
65-
prompt_key: 'question'
66-
response_key: 'answer'
67-
# database related. The result dataset will be stored in the database.
68-
db_url: 'postgresql://{user_name}@localhost:5432/{db_name}'
69-
70-
#### new part about data active iterator
71-
dj_process_desc: 'Please compute difficulty scores for these math questions.'
72-
agent_model_name: 'qwen-max'
73-
agent_model_config:
74-
config_name: 'my-qwen-instruction'
75-
model_type: 'dashscope_chat'
76-
model_name: 'qwen2.5-72b-instruct'
77-
clean_strategy: 'iterative'
69+
data_processor_url: 'http://127.0.0.1:5005/data_processor'
70+
# task pipeline related
71+
task_pipeline:
72+
# I/O buffers
73+
input_buffers:
74+
- name: 'raw_input'
75+
path: /PATH/TO/GSM8K/
76+
storage_type: 'file'
77+
raw: true
78+
output_buffer:
79+
name: 'raw_output'
80+
path: /PATH/TO/OUTPUT/JSONL/FILE
81+
storage_type: 'file'
82+
# format mapping
83+
format:
84+
prompt_key: 'question'
85+
response_key: 'answer'
86+
87+
#### new part about data active iterator
88+
dj_process_desc: 'Please compute difficulty scores for these math questions.'
89+
agent_model_name: 'qwen-max'
90+
agent_model_config:
91+
config_name: 'my-qwen-instruction'
92+
model_type: 'dashscope_chat'
93+
model_name: 'qwen2.5-72b-instruct'
94+
clean_strategy: 'iterative'
7895
```
7996

8097
You can write your demand description in config item `dj_process_desc`, and set the model name and configs used for the agent in config items `agent_model_name` and `agent_model_config`. Here we use Qwen2.5-72b-Instruct as our recipe managing agent. And you can set the `clean_strategy` to 'iterative' to get a better dataset.
@@ -99,19 +116,27 @@ After preparing the Data-Juicer data processing recipe, you can set the `dj_conf
99116

100117
```yaml
101118
data_processor:
102-
# basic info
103-
source_data_path: /PATH/TO/GSM8K/
104-
load_kwargs:
105-
split: 'train' # only need the train split
106-
format: # set the field mappings
107-
prompt_key: 'question'
108-
response_key: 'answer'
109-
# database related. The result dataset will be stored in the database.
110-
db_url: 'postgresql://{user_name}@localhost:5432/{db_name}'
111-
112-
#### new part about data active iterator
113-
dj_config_path: '/path/to/the/Data-Juicer/data/processing/recipe/above.yaml'
114-
clean_strategy: 'iterative'
119+
data_processor_url: 'http://127.0.0.1:5005/data_processor'
120+
# task pipeline related
121+
task_pipeline:
122+
# I/O buffers
123+
input_buffers:
124+
- name: 'raw_input'
125+
path: /PATH/TO/GSM8K/
126+
storage_type: 'file'
127+
raw: true
128+
output_buffer:
129+
name: 'raw_output'
130+
path: /PATH/TO/OUTPUT/JSONL/FILE
131+
storage_type: 'file'
132+
# format mapping
133+
format:
134+
prompt_key: 'question'
135+
response_key: 'answer'
136+
137+
#### new part about data active iterator
138+
dj_config_path: '/path/to/the/Data-Juicer/data/processing/recipe/above.yaml'
139+
clean_strategy: 'iterative'
115140
```
116141

117142
And you can set the `clean_strategy` to 'iterative' to get a better dataset.
@@ -123,7 +148,7 @@ All config items in the `data` section can be found [here](trinity_configs.md).
123148

124149

125150
```{note}
126-
Only when one of `dj_process_desc` and `dj_config_path` is provided, the data module and the data active iterator will be activated. Otherwise, this part will be skipped and it will enter into the exploring stage directly.
151+
Only when one of `xxx_pipeline` is provided, and one of `dj_process_desc` and `dj_config_path` in the pipeline config is provided, the data processor and the data active iterator will be activated. Otherwise, this part will be skipped and it will enter into the exploring stage directly.
127152
```
128153

129154
### Exploring & Training
@@ -140,49 +165,54 @@ ray start --address=<master_address>
140165
trinity run --config <Trinity-RFT_config_path>
141166
```
142167

143-
If you follow the steps above, Trinity-RFT will send a request to the data module server, the data active iterator will be activated and compute difficulty scores for each sample in the raw dataset. After that, the data module server stores the result dataset into the database, when exploring begins, it will load the prepared dataset and continue the downstream steps.
168+
If you follow the steps above, Trinity-RFT will send a request to the data processor server, the data active iterator will be activated, compute difficulty scores for each sample in the raw dataset, and rank the dataset according to difficulty scores. After that, the data processor server stores the result dataset into the output buffer, when exploring begins, it will load the prepared dataset and continue the downstream steps.
144169

145-
146-
147-
## Example: human in the loop
170+
## Example: Human in the Loop
148171
Sometimes, you might need to involve human feedbacks for some raw data. In this example, you will learn how to annotate raw data to get a better dataset before training. This example takes an example Q&A dataset and tries to select the chosen and rejected ones for DPO method.
149172

150-
Before getting started, you need to prepare the main environment of Trinity-RFT according to the installation section of the README file, install postgresql, and [start a label-studio server](https://github.com/modelscope/data-juicer/tree/main/tools/humanops) from Data-Juicer from source.
173+
Before getting started, you need to prepare the main environment of Trinity-RFT according to the installation section of the README file, and [start a label-studio server](https://github.com/modelscope/data-juicer/tree/main/tools/humanops) from Data-Juicer from source.
151174

152175
### Data Preparation
153176

154-
#### Prepare the Data Module
177+
#### Prepare the Data Processor
155178

156-
As the overall framework of Trinity-RFT shows, the data module is one of the high-level functions. Trinity-RFT encapsulates the data module as an independent service to avoid dependency conflict issues. Thus you need to prepare a split environment for this module and start the server.
179+
As the overall framework of Trinity-RFT shows, the data processor is one of the high-level functions. Trinity-RFT encapsulates the data processor as an independent service to avoid dependency conflict issues. Thus you need to prepare a split environment for this module and start the server.
157180

158181
```shell
159-
# prepare split environments, including the one of data module
182+
# prepare split environments, including the one of data processor
160183
python scripts/install.py
161184

162185
# start all split servers
163186
python scripts/start_servers.py
164187
```
165188

166-
### Configure the Data Module
189+
### Configure the Data Processor
167190

168-
Trinity-RFT uses a unified config file to manage all config items. For the data module, you need to focus on the `data_processor` section in the config file.
191+
Trinity-RFT uses a unified config file to manage all config items. For the data processor, you need to focus on the `data_processor` section in the config file.
169192

170-
In this example, assume that you need to rank all math questions and corresponding answers by their difficulties. So you can set these config items like the following example:
193+
In this example, assume that you need to select the chosen and rejected responses for DPO method. So you can set these config items like the following example:
171194

172195
```yaml
173196
data_processor:
174-
# basic info
175-
source_data_path: 'tests/test_data/test_human_annotator'
176-
load_kwargs:
177-
split: 'train' # only need the train split
178-
format: # set the field mappings
179-
prompt_key: 'prompt'
180-
chosen_key: 'chosen'
181-
rejected_key: 'rejected'
182-
#### new part about data active iterator
183-
dj_config_path: 'tests/test_configs/human_annotator_test_dj_cfg.yaml'
184-
# database related. The result dataset will be stored in the database.
185-
db_url: 'postgresql://{user_name}@localhost:5432/{db_name}'
197+
data_processor_url: 'http://127.0.0.1:5005/data_processor'
198+
# task pipeline related
199+
task_pipeline:
200+
# I/O buffers
201+
input_buffers:
202+
- name: 'raw_input'
203+
path: 'tests/test_data/test_human_annotator'
204+
storage_type: 'file'
205+
raw: true
206+
output_buffer:
207+
name: 'raw_output'
208+
path: './outputs/task_pipeline_output/prioritized_gsm8k.jsonl'
209+
storage_type: 'file'
210+
format: # set the field mappings
211+
prompt_key: 'prompt'
212+
chosen_key: 'chosen'
213+
rejected_key: 'rejected'
214+
#### new part about data active iterator
215+
dj_config_path: 'tests/test_configs/human_annotator_test_dj_cfg.yaml'
186216
```
187217
188218
Here you can set the basic information for the example dataset, database information that is used to store the result dataset, and some other items about downstream dataset loading for exploring and training, which is similar to the example above.
@@ -223,7 +253,7 @@ You can set more config items for this OP (e.g. notification when annotation is
223253

224254
### Start Running
225255

226-
When you start running with the RFT config, the data module will start the OP `human_preference_annotation_mapper`, and then you can find a new project on the "Projects" page of the label-studio server.
256+
When you start running with the RFT config, the data processor will start the OP `human_preference_annotation_mapper`, and then you can find a new project on the "Projects" page of the label-studio server.
227257

228258
![](../../assets/data-projects.png)
229259

environments/data.yaml

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,5 @@ dependencies:
66
- pip:
77
- py-data-juicer
88
- agentscope
9-
- flask
10-
- omegaconf
11-
- sqlalchemy
12-
- psycopg2
13-
- networkx
149
- transformers
1510
- "-e ..[dev]"

examples/grpo_gsm8k/gsm8k.yaml

Lines changed: 0 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -4,19 +4,6 @@ checkpoint_root_dir: /PATH/TO/CHECKPOINT/
44
algorithm:
55
algorithm_type: grpo
66
repeat_times: 8
7-
data_processor:
8-
# basic info
9-
source_data_path: 'openai/gsm8k'
10-
# data active iterator related
11-
dj_process_desc: 'Please compute difficulty scores for these math questions.'
12-
agent_model_name: 'qwen-max'
13-
agent_model_config:
14-
config_name: 'my-qwen-instruction'
15-
model_type: 'dashscope_chat'
16-
model_name: 'qwen2.5-72b-instruct'
17-
clean_strategy: 'iterative'
18-
# db related
19-
db_url: ''
207

218
model:
229
model_path: /PATH/TO/MODEL/
@@ -41,9 +28,7 @@ buffer:
4128
prompt_key: 'question'
4229
response_key: 'answer'
4330
rollout_args:
44-
n: 8
4531
temperature: 1.0
46-
logprobs: 0
4732
eval_tasksets:
4833
- name: gsm8k-eval
4934
storage_type: file
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# GRPO on GSM8K dataset with Task Pipeline
2+
3+
This example shows the usage of GRPO on the GSM8K dataset, with a task pipeline to prioritize the raw dataset before training.
4+
5+
For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_data_functionalities.md).
6+
7+
The config files are located in [`gsm8k.yaml`](gsm8k.yaml) and [`train_gsm8k.yaml`](train_gsm8k.yaml).

0 commit comments

Comments
 (0)