Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
e0874db
* reformat existing configs
HYLcool Jun 13, 2025
a1ce540
* after pre-commit
HYLcool Jun 13, 2025
a63b5f4
* modified according to the latest buffer imp.
HYLcool Jun 13, 2025
2c7fd58
* sort by priority before writing to buffer
HYLcool Jun 13, 2025
d60a931
* refine args and messages
HYLcool Jun 13, 2025
086fade
+ Add tests for file buffer
HYLcool Jun 13, 2025
e81ae48
Merge branch 'main' into refactor/data
HYLcool Jun 16, 2025
4ab6676
* merge from main
HYLcool Jun 16, 2025
ec45e8c
* after pre-commit
HYLcool Jun 16, 2025
c15ef5f
* fix task parser
HYLcool Jun 16, 2025
2bdcdea
* update .gitignore
HYLcool Jun 16, 2025
7511c96
* update .gitignore
HYLcool Jun 16, 2025
c93ff10
* fix wrong way to check the types of processors in task parser
HYLcool Jun 17, 2025
66eaece
* update data dependencies
HYLcool Jun 17, 2025
40ee1b6
Merge branch 'main' into refactor/data
HYLcool Jun 17, 2025
55480e3
* create dir first before writing to files
HYLcool Jun 17, 2025
a888035
* create dir first before writing to files
HYLcool Jun 17, 2025
ec7f7c6
+ add missing finish method for file_wrapper
HYLcool Jun 17, 2025
d73f89c
+ add trust_remote_code=True to all load_dataset invoking
HYLcool Jun 18, 2025
df5e4db
* only consider numeric stats
HYLcool Jun 19, 2025
6163ec5
* make task pipeline work
HYLcool Jun 19, 2025
253f2e1
* modify docs for task pipeline
HYLcool Jun 19, 2025
ef781fd
* after pre-commit
HYLcool Jun 19, 2025
1666099
* rename unit test
HYLcool Jun 19, 2025
2baed2e
* modify according to xuchen's comments
HYLcool Jun 20, 2025
8384b9c
Merge branch 'algorithm_dev' into refactor/data
HYLcool Jun 20, 2025
9e2653b
Merge branch 'algorithm_dev' into refactor/data
HYLcool Jun 20, 2025
4fb41a1
fix example config
pan-x-c Jun 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ ENV/
logs/

# data-juicer
tmp/
outputs/
# agentscope
runs/
Expand Down
196 changes: 113 additions & 83 deletions docs/sphinx_doc/source/tutorial/example_data_functionalities.md
Original file line number Diff line number Diff line change
@@ -1,80 +1,97 @@
# Data Processing

## Example: reasoning task
## Example: Data Processor for Task Pipeline

In this example, you will learn how to apply the data module of Trinity-RFT to prepare the dataset before exploring and training. This example takes GSM-8K dataset as the example dataset to figure out:
In this example, you will learn how to apply the data processor of Trinity-RFT to prepare and prioritize the dataset before task exploring and training. This example takes GSM-8K dataset as the example dataset to figure out:

1. how to prepare the data module
2. how to configure the data module
3. what the data module can do
1. how to prepare the data processor
2. how to configure the data processor
3. what the data processor can do

Before getting started, you need to prepare the main environment of Trinity-RFT according to the [installation section of the README file](../main.md), and you need to install [postgresql](https://www.postgresql.org/docs/current/tutorial-install.html) as well.
Before getting started, you need to prepare the main environment of Trinity-RFT according to the [installation section of the README file](../main.md).

### Data Preparation

#### Prepare the Data Module
#### Prepare the Data Processor

As the overall framework of Trinity-RFT shows, the data module is one of the high-level functions. Trinity-RFT encapsulates the data module as an independent service to avoid dependency conflict issues. Thus you need to prepare a split environment for this module and start the server.
As the overall framework of Trinity-RFT shows, the data processor is one of the high-level functions. Trinity-RFT encapsulates the data processor as an independent service to avoid dependency conflict issues. Thus you need to prepare a split environment for this module and start the server.

```shell
# prepare split environments, including the one of data module
# prepare split environments, including the one of data processor
python scripts/install.py

# start all split servers
python scripts/start_servers.py
```

### Configure the Data Module
### Configure the Data Processor

Trinity-RFT uses a unified config file to manage all config items. For the data module, you need to focus on the `data_processor` section in the config file.
Trinity-RFT uses a unified config file to manage all config items. For the data processor, you need to focus on the `data_processor` section in the config file.

In this example, assume that you need to rank all math questions and corresponding answers by their difficulties. So you can set these config items like the following example:

```yaml
data_processor:
# basic info
source_data_path: /PATH/TO/GSM8K/
load_kwargs:
split: 'train' # only need the train split
format: # set the field mappings
prompt_key: 'question'
response_key: 'answer'
# database related. The result dataset will be stored in the database.
db_url: 'postgresql://{user_name}@localhost:5432/{db_name}'
data_processor_url: 'http://127.0.0.1:5005/data_processor'
# task pipeline related
task_pipeline:
# I/O buffers
input_buffers:
- name: 'raw_input'
path: /PATH/TO/GSM8K/
storage_type: 'file'
raw: true
output_buffer:
name: 'raw_output'
path: /PATH/TO/OUTPUT/JSONL/FILE
storage_type: 'file'
# format mapping
format:
prompt_key: 'question'
response_key: 'answer'
```

Here you can set the basic information for the GSM-8K dataset, database information that is used to store the result dataset, and some other items about downstream dataset loading for exploring and training:
Here you can set the basic buffers for the GSM-8K dataset input and output and some other items about downstream dataset loading for exploring and training:

+ `source_data_path`: the path to the raw dataset.
+ `load_kwargs`: extra config arguments for loading the raw dataset. Mainly for the `load_dataset` method in HuggingFace `datasets` library.
+ `format`: some dataset format config items, which are used to map original data field names to unified ones.
+ `db_url`: the URL of the postgresql database to store the result dataset.
+ `data_processor_url`: the URL of the data processor service, which is started in the previous step.
+ `task_pipeline`: the configs for the task pipeline. Task pipeline is used to process the raw dataset. It consists of several inner configs:
+ `input_buffers`: the input buffers for the task pipeline. We usually load from raw dataset files in this pipeline, thus we need to the dataset `path` and set the `storage_type` to "file" and set `raw` to True. It allows multiple input buffers. We can name each buffer with the `name` field.
+ `output_buffer`: the output buffer for the task pipeline. We usually store the processed dataset in files as well, thus we need to set the `storage_type` to "file".
+ `format`: some dataset format config items, which are used to map original data field names to unified ones.

In addition, there are several config items related to the data active iterator, which is used to prepare a better dataset. The core part of the data active iterator, Data-Juicer, provides tens of operators to help clean or calculate key information for each sample in the dataset. You can configure this part depending on how familiar you are with Data-Juicer.
In addition, there are several config items related to the data active iterator in `task_pipeline` part, which is used to prepare a better dataset. The core part of the data active iterator, Data-Juicer, provides tens of operators to help clean or calculate key information for each sample in the dataset. You can configure this part depending on how familiar you are with Data-Juicer.

#### Not familiar with Data-Juicer
If you are not familiar with Data-Juicer, the data module provides a natural-language-based method to config the data processing recipe. What you need to do is only describe the demands of how you want to prepare for the raw dataset, and an agent will be invoked to arrange the data processing recipe for you. Here is an example:
If you are not familiar with Data-Juicer, the data processor provides a natural-language-based method to config the data processing recipe. What you need to do is only describe the demands of how you want to prepare for the raw dataset, and an agent will be invoked to arrange the data processing recipe for you. Here is an example:

```yaml
data_processor:
# basic info
source_data_path: /PATH/TO/GSM8K/
load_kwargs:
split: 'train' # only need the train split
format: # set the field mappings
prompt_key: 'question'
response_key: 'answer'
# database related. The result dataset will be stored in the database.
db_url: 'postgresql://{user_name}@localhost:5432/{db_name}'

#### new part about data active iterator
dj_process_desc: 'Please compute difficulty scores for these math questions.'
agent_model_name: 'qwen-max'
agent_model_config:
config_name: 'my-qwen-instruction'
model_type: 'dashscope_chat'
model_name: 'qwen2.5-72b-instruct'
clean_strategy: 'iterative'
data_processor_url: 'http://127.0.0.1:5005/data_processor'
# task pipeline related
task_pipeline:
# I/O buffers
input_buffers:
- name: 'raw_input'
path: /PATH/TO/GSM8K/
storage_type: 'file'
raw: true
output_buffer:
name: 'raw_output'
path: /PATH/TO/OUTPUT/JSONL/FILE
storage_type: 'file'
# format mapping
format:
prompt_key: 'question'
response_key: 'answer'

#### new part about data active iterator
dj_process_desc: 'Please compute difficulty scores for these math questions.'
agent_model_name: 'qwen-max'
agent_model_config:
config_name: 'my-qwen-instruction'
model_type: 'dashscope_chat'
model_name: 'qwen2.5-72b-instruct'
clean_strategy: 'iterative'
```

You can write your demand description in config item `dj_process_desc`, and set the model name and configs used for the agent in config items `agent_model_name` and `agent_model_config`. Here we use Qwen2.5-72b-Instruct as our recipe managing agent. And you can set the `clean_strategy` to 'iterative' to get a better dataset.
Expand All @@ -99,19 +116,27 @@ After preparing the Data-Juicer data processing recipe, you can set the `dj_conf

```yaml
data_processor:
# basic info
source_data_path: /PATH/TO/GSM8K/
load_kwargs:
split: 'train' # only need the train split
format: # set the field mappings
prompt_key: 'question'
response_key: 'answer'
# database related. The result dataset will be stored in the database.
db_url: 'postgresql://{user_name}@localhost:5432/{db_name}'

#### new part about data active iterator
dj_config_path: '/path/to/the/Data-Juicer/data/processing/recipe/above.yaml'
clean_strategy: 'iterative'
data_processor_url: 'http://127.0.0.1:5005/data_processor'
# task pipeline related
task_pipeline:
# I/O buffers
input_buffers:
- name: 'raw_input'
path: /PATH/TO/GSM8K/
storage_type: 'file'
raw: true
output_buffer:
name: 'raw_output'
path: /PATH/TO/OUTPUT/JSONL/FILE
storage_type: 'file'
# format mapping
format:
prompt_key: 'question'
response_key: 'answer'

#### new part about data active iterator
dj_config_path: '/path/to/the/Data-Juicer/data/processing/recipe/above.yaml'
clean_strategy: 'iterative'
```

And you can set the `clean_strategy` to 'iterative' to get a better dataset.
Expand All @@ -123,7 +148,7 @@ All config items in the `data` section can be found [here](trinity_configs.md).


```{note}
Only when one of `dj_process_desc` and `dj_config_path` is provided, the data module and the data active iterator will be activated. Otherwise, this part will be skipped and it will enter into the exploring stage directly.
Only when one of `xxx_pipeline` is provided, and one of `dj_process_desc` and `dj_config_path` in the pipeline config is provided, the data processor and the data active iterator will be activated. Otherwise, this part will be skipped and it will enter into the exploring stage directly.
```

### Exploring & Training
Expand All @@ -140,49 +165,54 @@ ray start --address=<master_address>
trinity run --config <Trinity-RFT_config_path>
```

If you follow the steps above, Trinity-RFT will send a request to the data module server, the data active iterator will be activated and compute difficulty scores for each sample in the raw dataset. After that, the data module server stores the result dataset into the database, when exploring begins, it will load the prepared dataset and continue the downstream steps.
If you follow the steps above, Trinity-RFT will send a request to the data processor server, the data active iterator will be activated, compute difficulty scores for each sample in the raw dataset, and rank the dataset according to difficulty scores. After that, the data processor server stores the result dataset into the output buffer, when exploring begins, it will load the prepared dataset and continue the downstream steps.



## Example: human in the loop
## Example: Human in the Loop
Sometimes, you might need to involve human feedbacks for some raw data. In this example, you will learn how to annotate raw data to get a better dataset before training. This example takes an example Q&A dataset and tries to select the chosen and rejected ones for DPO method.

Before getting started, you need to prepare the main environment of Trinity-RFT according to the installation section of the README file, install postgresql, and [start a label-studio server](https://github.com/modelscope/data-juicer/tree/main/tools/humanops) from Data-Juicer from source.
Before getting started, you need to prepare the main environment of Trinity-RFT according to the installation section of the README file, and [start a label-studio server](https://github.com/modelscope/data-juicer/tree/main/tools/humanops) from Data-Juicer from source.

### Data Preparation

#### Prepare the Data Module
#### Prepare the Data Processor

As the overall framework of Trinity-RFT shows, the data module is one of the high-level functions. Trinity-RFT encapsulates the data module as an independent service to avoid dependency conflict issues. Thus you need to prepare a split environment for this module and start the server.
As the overall framework of Trinity-RFT shows, the data processor is one of the high-level functions. Trinity-RFT encapsulates the data processor as an independent service to avoid dependency conflict issues. Thus you need to prepare a split environment for this module and start the server.

```shell
# prepare split environments, including the one of data module
# prepare split environments, including the one of data processor
python scripts/install.py

# start all split servers
python scripts/start_servers.py
```

### Configure the Data Module
### Configure the Data Processor

Trinity-RFT uses a unified config file to manage all config items. For the data module, you need to focus on the `data_processor` section in the config file.
Trinity-RFT uses a unified config file to manage all config items. For the data processor, you need to focus on the `data_processor` section in the config file.

In this example, assume that you need to rank all math questions and corresponding answers by their difficulties. So you can set these config items like the following example:
In this example, assume that you need to select the chosen and rejected responses for DPO method. So you can set these config items like the following example:

```yaml
data_processor:
# basic info
source_data_path: 'tests/test_data/test_human_annotator'
load_kwargs:
split: 'train' # only need the train split
format: # set the field mappings
prompt_key: 'prompt'
chosen_key: 'chosen'
rejected_key: 'rejected'
#### new part about data active iterator
dj_config_path: 'tests/test_configs/human_annotator_test_dj_cfg.yaml'
# database related. The result dataset will be stored in the database.
db_url: 'postgresql://{user_name}@localhost:5432/{db_name}'
data_processor_url: 'http://127.0.0.1:5005/data_processor'
# task pipeline related
task_pipeline:
# I/O buffers
input_buffers:
- name: 'raw_input'
path: 'tests/test_data/test_human_annotator'
storage_type: 'file'
raw: true
output_buffer:
name: 'raw_output'
path: './outputs/task_pipeline_output/prioritized_gsm8k.jsonl'
storage_type: 'file'
format: # set the field mappings
prompt_key: 'prompt'
chosen_key: 'chosen'
rejected_key: 'rejected'
#### new part about data active iterator
dj_config_path: 'tests/test_configs/human_annotator_test_dj_cfg.yaml'
```

Here you can set the basic information for the example dataset, database information that is used to store the result dataset, and some other items about downstream dataset loading for exploring and training, which is similar to the example above.
Expand Down Expand Up @@ -223,7 +253,7 @@ You can set more config items for this OP (e.g. notification when annotation is

### Start Running

When you start running with the RFT config, the data module will start the OP `human_preference_annotation_mapper`, and then you can find a new project on the "Projects" page of the label-studio server.
When you start running with the RFT config, the data processor will start the OP `human_preference_annotation_mapper`, and then you can find a new project on the "Projects" page of the label-studio server.

![](../../assets/data-projects.png)

Expand Down
5 changes: 0 additions & 5 deletions environments/data.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,5 @@ dependencies:
- pip:
- py-data-juicer
- agentscope
- flask
- omegaconf
- sqlalchemy
- psycopg2
- networkx
- transformers
- "-e ..[dev]"
15 changes: 0 additions & 15 deletions examples/grpo_gsm8k/gsm8k.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,6 @@ checkpoint_root_dir: /PATH/TO/CHECKPOINT/
algorithm:
algorithm_type: grpo
repeat_times: 8
data_processor:
# basic info
source_data_path: 'openai/gsm8k'
# data active iterator related
dj_process_desc: 'Please compute difficulty scores for these math questions.'
agent_model_name: 'qwen-max'
agent_model_config:
config_name: 'my-qwen-instruction'
model_type: 'dashscope_chat'
model_name: 'qwen2.5-72b-instruct'
clean_strategy: 'iterative'
# db related
db_url: ''

model:
model_path: /PATH/TO/MODEL/
Expand All @@ -41,9 +28,7 @@ buffer:
prompt_key: 'question'
response_key: 'answer'
rollout_args:
n: 8
temperature: 1.0
logprobs: 0
eval_tasksets:
- name: gsm8k-eval
storage_type: file
Expand Down
7 changes: 7 additions & 0 deletions examples/grpo_gsm8k_task_pipeline/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# GRPO on GSM8K dataset with Task Pipeline

This example shows the usage of GRPO on the GSM8K dataset, with a task pipeline to prioritize the raw dataset before training.

For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_data_functionalities.md).

The config files are located in [`gsm8k.yaml`](gsm8k.yaml) and [`train_gsm8k.yaml`](train_gsm8k.yaml).
Loading