Skip to content

Commit c85d853

Browse files
authored
Update docs (#89)
1 parent f24db44 commit c85d853

File tree

9 files changed

+65
-48
lines changed

9 files changed

+65
-48
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -266,7 +266,7 @@ Then, for command-line users, run the RFT process with the following command:
266266
trinity run --config <config_path>
267267
```
268268

269-
> For example, below is the command for fine-tuning Qwen-2.5-1.5B-Instruct on GSM8k dataset using GRPO algorithm:
269+
> For example, below is the command for fine-tuning Qwen2.5-1.5B-Instruct on GSM8k dataset using GRPO algorithm:
270270
> ```shell
271271
> trinity run --config examples/grpo_gsm8k/gsm8k.yaml
272272
> ```
@@ -279,7 +279,7 @@ For more detailed examples about how to use Trinity-RFT, please refer to the fol
279279
+ [Off-policy mode of RFT](./docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md)
280280
+ [Asynchronous mode of RFT](./docs/sphinx_doc/source/tutorial/example_async_mode.md)
281281
+ [Multi-turn tasks](./docs/sphinx_doc/source/tutorial/example_multi_turn.md)
282-
+ [Offline learning by DPO](./docs/sphinx_doc/source/tutorial/example_dpo.md)
282+
+ [Offline learning by DPO or SFT](./docs/sphinx_doc/source/tutorial/example_dpo.md)
283283
+ [Advanced data processing / human-in-the-loop](./docs/sphinx_doc/source/tutorial/example_data_functionalities.md)
284284
285285

docs/sphinx_doc/source/main.md

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -84,15 +84,18 @@ e.g., utilizing NCCL (when feasible) for model weight synchronization, sequence
8484

8585
## Getting started
8686

87-
88-
*Note: this project is currently under active development; comments and suggestions are welcome!*
89-
87+
```{note}
88+
Note: This project is currently under active development; comments and suggestions are welcome!
89+
```
9090

9191

9292

9393
### Step 1: preparations
9494

95-
95+
Trinity-RFT requires
96+
Python version >= 3.10,
97+
CUDA version >= 12.4,
98+
and at least 2 GPUs.
9699

97100

98101
Installation from source (recommended):
@@ -146,11 +149,6 @@ docker build -f scripts/docker/Dockerfile -t trinity-rft:latest .
146149
docker run -it --gpus all --shm-size="64g" --rm -v $PWD:/workspace -v <root_path_of_data_and_checkpoints>:/data trinity-rft:latest
147150
```
148151

149-
Trinity-RFT requires
150-
Python version >= 3.10,
151-
CUDA version >= 12.4,
152-
and at least 2 GPUs.
153-
154152

155153
### Step 2: prepare dataset and model
156154

@@ -243,15 +241,15 @@ trinity run --config <config_path>
243241

244242

245243

246-
For example, below is the command for fine-tuning Qwen-2.5-1.5B-Instruct on GSM8k dataset using GRPO algorithm:
244+
For example, below is the command for fine-tuning Qwen2.5-1.5B-Instruct on GSM8k dataset using GRPO algorithm:
247245

248246
```shell
249247
trinity run --config examples/grpo_gsm8k/gsm8k.yaml
250248
```
251249

252250

253251

254-
More example config files can be found in `examples`.
252+
More example config files can be found in [`examples`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/).
255253

256254

257255

@@ -260,7 +258,7 @@ For more detailed examples about how to use Trinity-RFT, please refer to the fol
260258
+ [Off-policy mode of RFT](tutorial/example_reasoning_advanced.md)
261259
+ [Asynchronous mode of RFT](tutorial/example_async_mode.md)
262260
+ [Multi-turn tasks](tutorial/example_multi_turn.md)
263-
+ [Offline learning by DPO](tutorial/example_dpo.md)
261+
+ [Offline learning by DPO or SFT](tutorial/example_dpo.md)
264262
+ [Advanced data processing / human-in-the-loop](tutorial/example_data_functionalities.md)
265263

266264

docs/sphinx_doc/source/tutorial/example_async_mode.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Asynchronous RFT
22

3-
This example shows how to run RFT in a fully asynchronous mode with the GRPO algorithm, Qwen-2.5-1.5B-Instruct model and GSM8K dataset.
3+
This example shows how to run RFT in a fully asynchronous mode with the GRPO algorithm, Qwen2.5-1.5B-Instruct model and GSM8K dataset.
44

55
Trinity-RFT supports an asynchronous mode by running the trainer and explorer in separate processes.
66

docs/sphinx_doc/source/tutorial/example_data_functionalities.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -26,14 +26,14 @@ python scripts/start_servers.py
2626

2727
### Configure the Data Module
2828

29-
Trinity-RFT uses a unified config file to manage all config items. For the data module, you need to focus on the `data` section in the config file.
29+
Trinity-RFT uses a unified config file to manage all config items. For the data module, you need to focus on the `data_processor` section in the config file.
3030

3131
In this example, assume that you need to rank all math questions and corresponding answers by their difficulties. So you can set these config items like the following example:
3232

3333
```yaml
3434
data_processor:
3535
# basic info
36-
source_data_path: '/path/to/gsm8k'
36+
source_data_path: /PATH/TO/GSM8K/
3737
load_kwargs:
3838
split: 'train' # only need the train split
3939
format: # set the field mappings
@@ -58,7 +58,7 @@ If you are not familiar with Data-Juicer, the data module provides a natural-lan
5858
```yaml
5959
data_processor:
6060
# basic info
61-
source_data_path: '/path/to/gsm8k'
61+
source_data_path: /PATH/TO/GSM8K/
6262
load_kwargs:
6363
split: 'train' # only need the train split
6464
format: # set the field mappings
@@ -100,7 +100,7 @@ After preparing the Data-Juicer data processing recipe, you can set the `dj_conf
100100
```yaml
101101
data_processor:
102102
# basic info
103-
source_data_path: '/path/to/gsm8k'
103+
source_data_path: /PATH/TO/GSM8K/
104104
load_kwargs:
105105
split: 'train' # only need the train split
106106
format: # set the field mappings
@@ -165,7 +165,7 @@ python scripts/start_servers.py
165165

166166
### Configure the Data Module
167167

168-
Trinity-RFT uses a unified config file to manage all config items. For the data module, you need to focus on the `data` section in the config file.
168+
Trinity-RFT uses a unified config file to manage all config items. For the data module, you need to focus on the `data_processor` section in the config file.
169169

170170
In this example, assume that you need to rank all math questions and corresponding answers by their difficulties. So you can set these config items like the following example:
171171

@@ -187,7 +187,7 @@ data_processor:
187187
188188
Here you can set the basic information for the example dataset, database information that is used to store the result dataset, and some other items about downstream dataset loading for exploring and training, which is similar to the example above.
189189
190-
For this example, we assume that you are somehow familiar with the basic usage of Data-Juicer, so we need to prepare a Data-Juicer data processing recipe in `tests/test_configs/human_annotator_test_dj_cfg.yaml` that includes an OP of `human_preference_annotation_mapper`. For example:
190+
For this example, we assume that you are somehow familiar with the basic usage of Data-Juicer, so we need to prepare a Data-Juicer data processing recipe in [`tests/test_configs/human_annotator_test_dj_cfg.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/tests/test_configs/human_annotator_test_dj_cfg.yaml) that includes an OP of `human_preference_annotation_mapper`. For example:
191191

192192
```yaml
193193
project_name: 'demo-human-annotator'

docs/sphinx_doc/source/tutorial/example_dpo.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# Offline DPO and SFT
22

3-
This example describes DPO and SFT based on the Qwen-2.5-1.5B-Instruct model.
3+
This example describes DPO and SFT based on the Qwen2.5-1.5B-Instruct model.
44

55
## Step 1: Model and Data Preparation
66

77
### Model Preparation
88

9-
Download the Qwen-2.5-1.5B-Instruct model to the local directory `$MODEL_PATH/Qwen2.5-1.5B-Instruct`:
9+
Download the Qwen2.5-1.5B-Instruct model to the local directory `$MODEL_PATH/Qwen2.5-1.5B-Instruct`:
1010

1111
```shell
1212
# Using Modelscope

docs/sphinx_doc/source/tutorial/example_mix_algo.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,15 @@ The first term corresponds to the standard GRPO objective, which aims to maximiz
2525
We prompt a powerful LLM to generate responses with the CoT process for some pre-defined questions. The collected dta are viewed as some experiences from an expert. We store them in a `jsonl` file `expert_data.jsonl` with the following format:
2626

2727
```json
28-
{"question": "What is the average of 4, 6, and 8?","response": "I add the numbers together and divide by the count: 4 + 6 + 8 = 18, divided by 3 gives 6. The answer is 6."}
28+
{
29+
"messages": [
30+
{ "role": "system", "content": <system_prompt> },
31+
{ "role": "user", "content": "What is the sum of 4 and 12?" },
32+
{ "role": "assistant", "content": "<think>thinking process...</think>\n<answer>16</answer>" } ]
33+
},
2934
...
3035
```
36+
The path to expert data is passed to `buffer.trainer_input.sft_warmup_dataset` for later use.
3137

3238

3339
## Step 1: Define the Algorithm
@@ -296,3 +302,9 @@ algorithm:
296302
read_batch_size_expert: 64
297303
read_batch_size_usual: 192
298304
```
305+
306+
With the above configurations, the experiment can be run with the following command:
307+
308+
```bash
309+
trinity run --config examples/mix_math/mix_math.yaml
310+
```

docs/sphinx_doc/source/tutorial/example_multi_turn.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@ To run the ALFworld and WebShop env, you need to setup the corresponding environ
1515
- WebShop is a simulated online shopping environment where AI agents learn to shop based on user requirements. The platform allows agents to browse products, compare options, and make purchase decisions, mimicking real-world e-commerce interactions.
1616

1717
You may refer to their original environment to complete the setup.
18-
- For ALFworld, refer to: https://github.com/alfworld/alfworld
19-
- For WebShop, refer to: https://github.com/princeton-nlp/WebShop
18+
- For ALFWorld, refer to the [ALFWorld](https://github.com/alfworld/alfworld) repository.
19+
- For WebShop, refer to the [WebShop](https://github.com/princeton-nlp/WebShop) repository.
2020

2121
### Data Preparation
2222
Our dataset follows the format in Huggingface datasets library, so we should correspondingly convert our env dataset.
@@ -36,7 +36,7 @@ The task is described as an environment instead of a single prompt.
3636

3737
## Step 2: Config preparation and run the experiment
3838

39-
You can refer to `example_reasoning_basic` to setup the config and others. The default config files are [`alfworld.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld/alfworld.yaml) and [`webshop.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_webshop/webshop.yaml), respectively.
39+
You can refer to [Quick Start](./example_reasoning_basic.md) to setup the config and others. The default config files are [`alfworld.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld/alfworld.yaml) and [`webshop.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_webshop/webshop.yaml), respectively.
4040
You may revise the configurations properly and run the experiment!
4141

4242
```bash
@@ -104,7 +104,7 @@ class AlfworldWorkflow(MultiTurnWorkflow):
104104
...
105105
```
106106

107-
and include them in the init files in `trinity/common/workflows/__init__.py`
107+
and include it in the init file `trinity/common/workflows/__init__.py`
108108

109109
```diff
110110
# -*- coding: utf-8 -*-
@@ -120,7 +120,7 @@ and include them in the init files in `trinity/common/workflows/__init__.py`
120120
]
121121
```
122122

123-
Then you are all set! It should be pretty simple😄, and both environments converge.
123+
Then you are all set! It should be pretty simple😄, and the training processes in both environments converge.
124124

125125
![](../../assets/alfworld_reward_curve.png)
126126
![](../../assets/webshop_reward_curve.png)

docs/sphinx_doc/source/tutorial/example_reasoning_basic.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,12 @@ pip install flash-attn -v
3737
# pip install flash-attn -v --no-build-isolation
3838
```
3939

40+
Installation using pip:
41+
42+
```shell
43+
pip install trinity-rft
44+
```
45+
4046
Installation from docker:
4147

4248
We provided a dockerfile for Trinity-RFT.
@@ -60,7 +66,7 @@ docker run -it --gpus all --shm-size="64g" --rm -v $PWD:/workspace -v <root_path
6066

6167
**Model Preparation.**
6268

63-
Download the Qwen-2.5-1.5B-Instruct model to the local directory `$MODEL_PATH/Qwen2.5-1.5B-Instruct`:
69+
Download the Qwen2.5-1.5B-Instruct model to the local directory `$MODEL_PATH/Qwen2.5-1.5B-Instruct`:
6470

6571
```bash
6672
# Using Modelscope

docs/sphinx_doc/source/tutorial/trinity_configs.md

Lines changed: 19 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ The configuration for **Trinity-RFT** is defined in a `YAML` file and organized
88

99
```yaml
1010
project: Trinity-RFT
11-
name: tutorial
11+
name: example
1212
mode: both
1313
checkpoint_root_dir: /PATH/TO/CHECKPOINT
1414

@@ -78,7 +78,7 @@ Specifies the algorithm type and its related hyperparameters.
7878
```yaml
7979
algorithm:
8080
algorithm_type: grpo
81-
repeat_times: 1
81+
repeat_times: 8
8282
8383
# The following parameters are optional
8484
# If not specified, they will automatically be set based on the `algorithm_type`
@@ -89,12 +89,11 @@ algorithm:
8989
entropy_loss_fn: "default"
9090
```
9191
92-
- `algorithm_type`: Type of reinforcement learning algorithm. Supported types: `ppo`, `grpo`, `opmd`, `dpo`.
93-
- `repeat_times`: Number of times each task is repeated. Default is `1`. In `dpo`, this is automatically set to `2`.
94-
92+
- `algorithm_type`: Type of reinforcement learning algorithm. Supported types: `ppo`, `grpo`, `opmd`, `dpo`, `sft`, `mix`.
93+
- `repeat_times`: Number of times each task is repeated. Default is `1`. In `dpo`, this is automatically set to `2`. Some algorithms such as GRPO and OPMD require `repeat_times` > 1.
9594
- `sample_strategy`: The sampling strategy used for loading experiences from experience buffer.
9695
- `advantage_fn`: The advantage function used for computing advantages.
97-
- `kl_penalty_fn`: The KL penalty function used for computing KL penalty.
96+
- `kl_penalty_fn`: The KL penalty function used for computing KL penalty applied in reward.
9897
- `kl_loss_fn`: The KL loss function used for computing KL loss.
9998
- `entropy_loss_fn`: The entropy loss function used for computing entropy loss.
10099

@@ -111,8 +110,8 @@ monitor:
111110
```
112111

113112
- `monitor_type`: Type of monitoring system. Options:
114-
- `wandb`: Logs to Weights & Biases. Requires logging in and setting `WANDB_API_KEY`. Project and run names match the `project` and `name` fields in global configs.
115-
- `tensorboard`: Logs to TensorBoard. Files are saved under `<checkpoint_root_dir>/<project>/<name>/monitor/tensorboard`.
113+
- `wandb`: Logs to [Weights & Biases](https://docs.wandb.ai/quickstart/). Requires logging in and setting `WANDB_API_KEY`. Project and run names match the `project` and `name` fields in global configs.
114+
- `tensorboard`: Logs to [TensorBoard](https://www.tensorflow.org/tensorboard). Files are saved under `<checkpoint_root_dir>/<project>/<name>/monitor/tensorboard`.
116115

117116
---
118117

@@ -122,13 +121,13 @@ Defines the model paths and token limits.
122121

123122
```yaml
124123
model:
125-
model_path: '/PATH/TO/MODEL/CHECKPOINT/'
124+
model_path: /PATH/TO/MODEL/
126125
critic_model_path: ''
127126
max_prompt_tokens: 4096
128127
max_response_tokens: 16384
129128
```
130129

131-
- `model_path`: Path to the model checkpoint being trained.
130+
- `model_path`: Path to the model being trained.
132131
- `critic_model_path`: Optional path to a separate critic model. If empty, defaults to `model_path`.
133132
- `max_prompt_tokens`: Maximum number of tokens allowed in input prompts.
134133
- `max_response_tokens`: Maximum number of tokens allowed in generated responses.
@@ -175,8 +174,8 @@ buffer:
175174
default_reward_fn_type: 'countdown_reward'
176175
```
177176

178-
- `batch_size`: Number of samples used per training step. *Please do not multiply this value by the `algorithm.repeat_times` manually*.
179-
- `total_epochs`: Total number of training epochs. Not applicable for streaming datasets (e.g., queue-based buffers).
177+
- `batch_size`: Number of tasks used per training step. *Please do not multiply this value by the `algorithm.repeat_times` manually*.
178+
- `total_epochs`: Total number of training epochs.
180179

181180
### Explorer Input
182181

@@ -227,6 +226,8 @@ The configuration for each task dataset is defined as follows:
227226
- For `file` storage type, the path is the path to the directory that contains the task dataset files.
228227
- For `queue` storage type, the path is optional. You can back up the data in the queue by specifying a sqlite database path here.
229228
- For `sql` storage type, the path is the path to the sqlite database file.
229+
- `subset_name`: The subset name of the task dataset. Default is `None`.
230+
- `split`: The split of the task dataset. Default is `train`.
230231
- `format`: Defines keys for prompts and responses in the dataset.
231232
- `prompt_key`: Specifies which column in the dataset contains the prompt data.
232233
- `response_key`: Specifies which column in the dataset contains the response data.
@@ -302,9 +303,9 @@ synchronizer:
302303
```
303304

304305
- `sync_method`: Method of synchronization. Options:
305-
- `nccl`: Uses NCCL for fast synchronization.
306-
- `checkpoint`: Loads latest model from disk.
307-
- `sync_interval`: Interval (in steps) between synchronizations.
306+
- `nccl`: Uses NCCL for fast synchronization. Supported for `both` mode.
307+
- `checkpoint`: Loads latest model from disk. Supported for `train`, `explore`, or `bench` mode.
308+
- `sync_interval`: Interval (in steps) of model weight synchronization between trainer and explorer.
308309
- `sync_timeout`: Timeout duration for synchronization.
309310

310311
---
@@ -324,7 +325,7 @@ trainer:
324325
- `trainer_type`: Trainer backend implementation. Currently only supports `verl`.
325326
- `save_interval`: Frequency (in steps) at which to save model checkpoints.
326327
- `trainer_config_path`: The path to the trainer configuration file.
327-
- `train_config`: The configuration of the trainer. Only one needs to be set for `trainer.trainer_config` and `trainer.trainer_config_path`
328+
- `trainer_config`: The trainer configuration provided inline. Only one of `trainer_config_path` and `trainer_config` should be specified.
328329

329330
---
330331

@@ -334,7 +335,7 @@ Configures preprocessing and data cleaning pipelines.
334335

335336
```yaml
336337
data_processor:
337-
source_data_path: '/PATH/TO/DATASET'
338+
source_data_path: /PATH/TO/DATASET
338339
load_kwargs:
339340
split: 'train'
340341
format:
@@ -345,7 +346,7 @@ data_processor:
345346
db_url: 'postgresql://{username}@localhost:5432/{db_name}'
346347
```
347348

348-
- `source_data_path`: Path to the raw dataset.
349+
- `source_data_path`: Path to the task dataset.
349350
- `load_kwargs`: Arguments passed to HuggingFace’s `load_dataset()`.
350351
- `dj_config_path`: Path to Data-Juicer configuration for cleaning.
351352
- `clean_strategy`: Strategy for iterative data cleaning.

0 commit comments

Comments
 (0)