Skip to content

Commit 2baed2e

Browse files
committed
* modify according to xuchen's comments
1 parent 1666099 commit 2baed2e

File tree

7 files changed

+35
-41
lines changed

7 files changed

+35
-41
lines changed

docs/sphinx_doc/source/tutorial/example_data_functionalities.md

Lines changed: 19 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -2,22 +2,22 @@
22

33
## Example: Data Processor for Task Pipeline
44

5-
In this example, you will learn how to apply the data processor workflow of Trinity-RFT to prepare and prioritize the dataset before task exploring and training. This example takes GSM-8K dataset as the example dataset to figure out:
5+
In this example, you will learn how to apply the data processor of Trinity-RFT to prepare and prioritize the dataset before task exploring and training. This example takes GSM-8K dataset as the example dataset to figure out:
66

7-
1. how to prepare the data workflow
7+
1. how to prepare the data processor
88
2. how to configure the data processor
9-
3. what the data workflow can do
9+
3. what the data processor can do
1010

1111
Before getting started, you need to prepare the main environment of Trinity-RFT according to the [installation section of the README file](../main.md).
1212

1313
### Data Preparation
1414

15-
#### Prepare the Data Workflow
15+
#### Prepare the Data Processor
1616

17-
As the overall framework of Trinity-RFT shows, the data workflow is one of the high-level functions. Trinity-RFT encapsulates the data workflow as an independent service to avoid dependency conflict issues. Thus you need to prepare a split environment for this module and start the server.
17+
As the overall framework of Trinity-RFT shows, the data processor is one of the high-level functions. Trinity-RFT encapsulates the data processor as an independent service to avoid dependency conflict issues. Thus you need to prepare a split environment for this module and start the server.
1818

1919
```shell
20-
# prepare split environments, including the one of data workflow
20+
# prepare split environments, including the one of data processor
2121
python scripts/install.py
2222

2323
# start all split servers
@@ -32,7 +32,7 @@ In this example, assume that you need to rank all math questions and correspondi
3232

3333
```yaml
3434
data_processor:
35-
data_workflow_url: 'http://127.0.0.1:5005/data_workflow'
35+
data_processor_url: 'http://127.0.0.1:5005/data_processor'
3636
# task pipeline related
3737
task_pipeline:
3838
# I/O buffers
@@ -53,7 +53,7 @@ data_processor:
5353
5454
Here you can set the basic buffers for the GSM-8K dataset input and output and some other items about downstream dataset loading for exploring and training:
5555
56-
+ `data_workflow_url`: the URL of the data processor service, which is started in the previous step.
56+
+ `data_processor_url`: the URL of the data processor service, which is started in the previous step.
5757
+ `task_pipeline`: the configs for the task pipeline. Task pipeline is used to process the raw dataset. It consists of several inner configs:
5858
+ `input_buffers`: the input buffers for the task pipeline. We usually load from raw dataset files in this pipeline, thus we need to the dataset `path` and set the `storage_type` to "file" and set `raw` to True. It allows multiple input buffers. We can name each buffer with the `name` field.
5959
+ `output_buffer`: the output buffer for the task pipeline. We usually store the processed dataset in files as well, thus we need to set the `storage_type` to "file".
@@ -66,7 +66,7 @@ If you are not familiar with Data-Juicer, the data processor provides a natural-
6666

6767
```yaml
6868
data_processor:
69-
data_workflow_url: 'http://127.0.0.1:5005/data_workflow'
69+
data_processor_url: 'http://127.0.0.1:5005/data_processor'
7070
# task pipeline related
7171
task_pipeline:
7272
# I/O buffers
@@ -116,7 +116,7 @@ After preparing the Data-Juicer data processing recipe, you can set the `dj_conf
116116

117117
```yaml
118118
data_processor:
119-
data_workflow_url: 'http://127.0.0.1:5005/data_workflow'
119+
data_processor_url: 'http://127.0.0.1:5005/data_processor'
120120
# task pipeline related
121121
task_pipeline:
122122
# I/O buffers
@@ -148,7 +148,7 @@ All config items in the `data` section can be found [here](trinity_configs.md).
148148

149149

150150
```{note}
151-
Only when one of `xxx_pipeline` is provided, and one of `dj_process_desc` and `dj_config_path` in the pipeline config is provided, the data workflow and the data active iterator will be activated. Otherwise, this part will be skipped and it will enter into the exploring stage directly.
151+
Only when one of `xxx_pipeline` is provided, and one of `dj_process_desc` and `dj_config_path` in the pipeline config is provided, the data processor and the data active iterator will be activated. Otherwise, this part will be skipped and it will enter into the exploring stage directly.
152152
```
153153

154154
### Exploring & Training
@@ -165,13 +165,7 @@ ray start --address=<master_address>
165165
trinity run --config <Trinity-RFT_config_path>
166166
```
167167

168-
If you follow the steps above, Trinity-RFT will send a request to the data workflow server, the data active iterator will be activated, compute difficulty scores for each sample in the raw dataset, and rank the dataset according to difficulty scores. After that, the data workflow server stores the result dataset into the output buffer, when exploring begins, it will load the prepared dataset and continue the downstream steps.
169-
170-
171-
## Example: Data Processor for Experience Pipeline
172-
173-
TBD.
174-
168+
If you follow the steps above, Trinity-RFT will send a request to the data processor server, the data active iterator will be activated, compute difficulty scores for each sample in the raw dataset, and rank the dataset according to difficulty scores. After that, the data processor server stores the result dataset into the output buffer, when exploring begins, it will load the prepared dataset and continue the downstream steps.
175169

176170
## Example: Human in the Loop
177171
Sometimes, you might need to involve human feedbacks for some raw data. In this example, you will learn how to annotate raw data to get a better dataset before training. This example takes an example Q&A dataset and tries to select the chosen and rejected ones for DPO method.
@@ -180,27 +174,27 @@ Before getting started, you need to prepare the main environment of Trinity-RFT
180174

181175
### Data Preparation
182176

183-
#### Prepare the Data Workflow
177+
#### Prepare the Data Processor
184178

185-
As the overall framework of Trinity-RFT shows, the data workflow is one of the high-level functions. Trinity-RFT encapsulates the data workflow as an independent service to avoid dependency conflict issues. Thus you need to prepare a split environment for this module and start the server.
179+
As the overall framework of Trinity-RFT shows, the data processor is one of the high-level functions. Trinity-RFT encapsulates the data processor as an independent service to avoid dependency conflict issues. Thus you need to prepare a split environment for this module and start the server.
186180

187181
```shell
188-
# prepare split environments, including the one of data workflow
182+
# prepare split environments, including the one of data processor
189183
python scripts/install.py
190184

191185
# start all split servers
192186
python scripts/start_servers.py
193187
```
194188

195-
### Configure the Data Workflow
189+
### Configure the Data Processor
196190

197-
Trinity-RFT uses a unified config file to manage all config items. For the data workflow, you need to focus on the `data_processor` section in the config file.
191+
Trinity-RFT uses a unified config file to manage all config items. For the data processor, you need to focus on the `data_processor` section in the config file.
198192

199193
In this example, assume that you need to select the chosen and rejected responses for DPO method. So you can set these config items like the following example:
200194

201195
```yaml
202196
data_processor:
203-
data_workflow_url: 'http://127.0.0.1:5005/data_workflow'
197+
data_processor_url: 'http://127.0.0.1:5005/data_processor'
204198
# task pipeline related
205199
task_pipeline:
206200
# I/O buffers
@@ -259,7 +253,7 @@ You can set more config items for this OP (e.g. notification when annotation is
259253

260254
### Start Running
261255

262-
When you start running with the RFT config, the data workflow will start the OP `human_preference_annotation_mapper`, and then you can find a new project on the "Projects" page of the label-studio server.
256+
When you start running with the RFT config, the data processor will start the OP `human_preference_annotation_mapper`, and then you can find a new project on the "Projects" page of the label-studio server.
263257

264258
![](../../assets/data-projects.png)
265259

examples/grpo_gsm8k_task_pipeline/gsm8k.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ algorithm:
55
algorithm_type: grpo
66
repeat_times: 8
77
data_processor:
8-
data_workflow_url: 'http://127.0.0.1:5005/data_workflow'
8+
data_processor_url: 'http://127.0.0.1:5005/data_processor'
99
# task pipeline related
1010
task_pipeline:
1111
# I/O buffers

trinity/cli/client.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,12 +31,12 @@ def request(url, **kwargs):
3131

3232
if __name__ == "__main__":
3333
# --- only for local testing
34-
LOCAL_DATA_WORKFLOW_SERVER_URL = "http://127.0.0.1:5005/data_workflow"
34+
LOCAL_DATA_PROCESSOR_SERVER_URL = "http://127.0.0.1:5005/data_processor"
3535
LOCAL_TRINITY_TRAINING_SERVER_URL = "http://127.0.0.1:5006/trinity_rft"
3636
# --- only for local testing
3737

3838
res = request(
39-
url=LOCAL_DATA_WORKFLOW_SERVER_URL,
39+
url=LOCAL_DATA_PROCESSOR_SERVER_URL,
4040
configPath="examples/grpo_gsm8k/gsm8k.yaml",
4141
)
4242
if res:

trinity/cli/launcher.py

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -144,13 +144,13 @@ def both(config: Config) -> None:
144144
ray.get(trainer.shutdown.remote())
145145

146146

147-
def activate_data_module(data_workflow_url: str, config_path: str):
147+
def activate_data_module(data_processor_url: str, config_path: str):
148148
"""Check whether to activate data module and preprocess datasets."""
149149
from trinity.cli.client import request
150150

151-
logger.info(f"Activating data module of {data_workflow_url}...")
151+
logger.info(f"Activating data module of {data_processor_url}...")
152152
res = request(
153-
url=data_workflow_url,
153+
url=data_processor_url,
154154
configPath=config_path,
155155
)
156156
if res["return_code"] != 0:
@@ -190,7 +190,7 @@ def validate_data_pipeline(data_pipeline_config: DataPipelineConfig, pipeline_ty
190190
return False
191191
elif pipeline_type == "experience":
192192
# experience pipeline specific
193-
pass
193+
raise NotImplementedError("experience_pipeline is not implemented yet.")
194194
else:
195195
logger.warning(
196196
f'Invalid pipeline type: {pipeline_type}. Should be one of ["task", "experience"].'
@@ -207,21 +207,21 @@ def run(config_path: str, dlc: bool = False, plugin_dir: str = None):
207207
# try to activate task pipeline for raw data
208208
data_processor_config = config.data_processor
209209
if (
210-
data_processor_config.data_workflow_url
210+
data_processor_config.data_processor_url
211211
and data_processor_config.task_pipeline
212212
and validate_data_pipeline(data_processor_config.task_pipeline, "task")
213213
):
214214
activate_data_module(
215-
f"{data_processor_config.data_workflow_url}/task_pipeline", config_path
215+
f"{data_processor_config.data_processor_url}/task_pipeline", config_path
216216
)
217217
# try to activate experience pipeline for experiences
218218
if (
219-
data_processor_config.data_workflow_url
219+
data_processor_config.data_processor_url
220220
and data_processor_config.experience_pipeline
221221
and validate_data_pipeline(data_processor_config.experience_pipeline, "experience")
222222
):
223223
activate_data_module(
224-
f"{data_processor_config.data_workflow_url}/experience_pipeline", config_path
224+
f"{data_processor_config.data_processor_url}/experience_pipeline", config_path
225225
)
226226
ray_namespace = f"{config.project}-{config.name}"
227227
if dlc:

trinity/common/config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,7 @@ class DataPipelineConfig:
131131
class DataProcessorConfig:
132132
"""Data-Juicer config"""
133133

134-
data_workflow_url: Optional[str] = None
134+
data_processor_url: Optional[str] = None
135135

136136
# support two types of data pipelines for now
137137
# 1. For task. Data preprocessing from raw dataset to the task set

trinity/data/readme.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -88,14 +88,14 @@ synth_data = synthesizer.process(clean_data)
8888
- Then you need to prepare the `data_processor` section in the config file (e.g. [test_cfg.yaml](tests/test_configs/active_iterator_test_cfg.yaml))
8989
- For the `dj_config_path` argument in it, you can either specify a data-juicer config file path (e.g. [test_dj_cfg.yaml](tests/test_configs/active_iterator_test_dj_cfg.yaml)), or write the demand in `dj_process_desc` argument in natural language and our agent will help you to organize the data-juicer config.
9090
- Finally you can send requests to the data server to start an active iterator to process datasets in many ways:
91-
- Request with `curl`: `curl "http://127.0.0.1:5000/data_workflow?configPath=tests%2Ftest_configs%2Factive_iterator_test_cfg.yaml"`
91+
- Request with `curl`: `curl "http://127.0.0.1:5005/data_processor/task_pipeline?configPath=tests%2Ftest_configs%2Factive_iterator_test_cfg.yaml"`
9292
- Request using our simple client:
9393

9494
```python
9595
from trinity.cli.client import request
9696

9797
res = request(
98-
url="http://127.0.0.1:5005/data_workflow",
98+
url="http://127.0.0.1:5005/data_processor/task_pipeline",
9999
configPath="tests/test_configs/active_iterator_test_cfg.yaml"
100100
)
101101

trinity/data/server.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,11 @@
44

55
app = Flask(__name__)
66

7-
APP_NAME = "data_workflow"
7+
APP_NAME = "data_processor"
88

99

1010
@app.route(f"/{APP_NAME}/<pipeline_type>", methods=["GET"])
11-
def data_workflow(pipeline_type):
11+
def data_processor(pipeline_type):
1212
from trinity.common.config import load_config
1313
from trinity.data.controllers.active_iterator import DataActiveIterator
1414

0 commit comments

Comments
 (0)