Skip to content

Commit a7b1992

Browse files
committed
Merge branch 'main' into feat/training_service
# Conflicts: # trinity/cli/client.py
2 parents c80d093 + e4a356b commit a7b1992

40 files changed

+415
-76
lines changed

README.md

Lines changed: 43 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -3,42 +3,56 @@
33
<!-- ![trinity-rft](./docs/sphinx_doc/assets/trinity-title.png) -->
44

55
<div align="center">
6-
<img src="./docs/sphinx_doc/assets/trinity-title.png" alt="Trinity-RFT">
6+
<img src="./docs/sphinx_doc/assets/trinity-title.png" alt="Trinity-RFT" style="height: 100px;">
77
</div>
88

99

10+
&nbsp;
1011

11-
Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement fine-tuning (RFT) of large language models (LLM).
12-
Built with a decoupled architecture, seamless integration for agentic workflows, and systematic data processing pipelines, Trinity-RFT can be easily adapted for diverse application scenarios, and serve as a platform for exploring advanced reinforcement learning (RL) paradigms.
1312

1413

14+
**Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement fine-tuning (RFT) of large language models (LLM).**
1515

1616

17+
Built with a decoupled design, seamless integration for agentic workflows, and systematic data processing pipelines, Trinity-RFT can be easily adapted for diverse application scenarios, and serve as a platform for exploring advanced reinforcement learning (RL) paradigms.
1718

18-
**Vision of this project:**
1919

2020

21-
Current RFT approaches, such as RLHF (Reinforcement Learning from Human Feedback) with proxy reward models or training long-CoT reasoning LLMs with rule-based rewards, are limited in their ability to handle dynamic, real-world learning.
22-
Trinity-RFT envisions a future where AI agents learn by interacting directly with environments, collecting delayed or complex reward signals, and continuously refining their behavior through advanced RL paradigms.
23-
For example, imagine an AI scientist that designs an experiment, executes it via interacting with the environment, waits for feedback (while working on some other tasks concurrently), and iteratively updates itself based on true environmental rewards when the experiment is finally finished.
21+
22+
23+
## Vision of this project
24+
25+
26+
Current RFT approaches, such as RLHF (Reinforcement Learning from Human Feedback) with proxy reward models or training long-CoT reasoning models with rule-based rewards, are limited in their ability to handle dynamic, real-world learning.
27+
28+
Trinity-RFT envisions a future where AI agents learn by interacting directly with environments, collecting delayed or complex reward signals, and continuously refining their behavior through RL.
29+
30+
31+
For example, imagine an AI scientist that designs an experiment, executes it, waits for feedback (while working on other tasks concurrently), and iteratively updates itself based on true environmental rewards when the experiment is finally finished.
32+
33+
2434
Trinity-RFT offers a path into this future by addressing critical gaps in existing solutions.
2535

2636

2737

2838

2939

30-
**Key features of Trinity-RFT:**
40+
## Key features
3141

3242

3343

3444
+ **Unified RFT modes & algorithm support.**
35-
Trinity-RFT unifies and generalizes existing RFT methodologies into a flexible and configurable framework, supporting synchronous/asynchronous and on-policy/off-policy/offline training, as well as hybrid modes that combine the above seamlessly into a single learning process (e.g., incorporating expert trajectories or high-quality SFT data to accelerate an online RL process).
45+
Trinity-RFT unifies and generalizes existing RFT methodologies into a flexible and configurable framework, supporting synchronous/asynchronous and on-policy/off-policy/offline training, as well as hybrid modes that combine them seamlessly into a single learning process.
46+
3647

3748
+ **Agent-environment interaction as a first-class citizen.**
38-
Trinity-RFT natively models the challenges of RFT with real-world agent-environment interactions. It allows delayed rewards in multi-step and/or time-lagged feedback loops, handles long-tailed latencies and environment/agent failures gracefully, and supports distributed deployment where explorers (i.e., the rollout agents) and trainers (i.e., the policy model trained by RL) can operate across separate clusters or devices (e.g., explorers on edge devices, trainers in cloud clusters) and scale up independently.
49+
Trinity-RFT allows delayed rewards in multi-step/time-lagged feedback loops, handles long-tailed latencies and environment/agent failures gracefully, and supports distributed deployment where explorers and trainers can operate across separate devices and scale up independently.
50+
51+
3952

4053
+ **Data processing pipelines optimized for RFT with diverse/messy data.**
41-
These include converting raw datasets to prompt/task sets for RL, cleaning/filtering/prioritizing experiences stored in the replay buffer, synthesizing data for tasks and experiences, offering user interfaces for RFT with human in the loop, managing the task and experience buffers (e.g., supporting collection of lagged reward signals), among others.
54+
These include converting raw datasets to prompt/task sets for RL, cleaning/filtering/prioritizing experiences stored in the replay buffer, synthesizing data for tasks and experiences, offering user interfaces for human in the loop, etc.
55+
<!-- managing the task and experience buffers (e.g., supporting collection of lagged reward signals) -->
4256

4357

4458

@@ -59,40 +73,40 @@ These include converting raw datasets to prompt/task sets for RL, cleaning/filte
5973
The overall design of Trinity-RFT exhibits a trinity:
6074
+ RFT-core;
6175
+ agent-environment interaction;
62-
+ data processing pipelines tailored to RFT.
76+
+ data processing pipelines tailored to RFT;
6377

64-
65-
66-
In particular, the design of RFT-core also exhibits a trinity:
78+
and the design of RFT-core also exhibits a trinity:
6779
+ explorer;
6880
+ trainer;
6981
+ manager & buffer.
7082

7183

7284

73-
The explorer, powered by the rollout model, interacts with the environment and generates rollout trajectories to be stored in the experience buffer.
74-
The trainer, powered by the policy model, samples batches of experiences from the buffer and updates the policy via RL algorithms.
75-
These two can be completely decoupled and act asynchronously, except that they share the same experience buffer, and their model weights are synchronized once in a while (according to a schedule specified by user configurations).
85+
The *explorer*, powered by the rollout model, interacts with the environment and generates rollout trajectories to be stored in the experience buffer.
7686

87+
The *trainer*, powered by the policy model, samples batches of experiences from the buffer and updates the policy via RL algorithms.
7788

78-
Such a decoupled design is crucial for making the aforementioned features of Trinity-RFT possible,
79-
e.g., flexible and configurable RFT modes (on-policy/off-policy, synchronous/asynchronous, immediate/lagged rewards),
89+
These two can be completely decoupled and act asynchronously, except that they share the same experience buffer, and their model weights are synchronized once in a while.
90+
Such a decoupled design is crucial for making the aforementioned features of Trinity-RFT possible.
91+
92+
<!-- e.g., flexible and configurable RFT modes (on-policy/off-policy, synchronous/asynchronous, immediate/lagged rewards),
8093
fault tolerance for failures of explorer (agent/environment) or trainer,
8194
high efficiency in the presence of long-tailed rollout latencies,
8295
data processing pipelines and human in the loop of RFT (e.g., via acting on the experience buffer, which is implemented as a persistent database),
83-
among others.
96+
among others. -->
8497

8598

8699

87100
Meanwhile, Trinity-RFT has done the dirty work for ensuring high efficiency in every component of the framework,
88-
e.g., utilizing NCCL (when feasible) for model weight synchronization, sequence concatenation with proper masking for multi-turn conversations and ReAct workflows, pipeline parallelism for the synchronous RFT mode, among many others.
101+
e.g., utilizing NCCL (when feasible) for model weight synchronization, sequence concatenation with proper masking for multi-turn conversations and ReAct-style workflows, pipeline parallelism for the synchronous RFT mode, among many others.
89102

90103

91104

92105
## Getting started
93106

94107

95-
*Note: this project is currently under active development; comments and suggestions are welcome!*
108+
> [!NOTE]
109+
> This project is currently under active development. Comments and suggestions are welcome!
96110
97111

98112

@@ -186,20 +200,17 @@ For more details about dataset downloading, please refer to [Huggingface](https:
186200
### Step 3: configurations
187201

188202

189-
You may customize the configurations in `scripts/config/{config_name}.yaml`and `scripts/config/{train_config_name}.yaml`. For example, the model and dataset are specified as:
203+
You may customize the configurations in [`examples`](examples/). For example, the model and dataset are specified as:
190204

191205
```yaml
192206
model:
193207
model_path: $MODEL_PATH/{model_name}
194208

195209
data:
196210
dataset_path: $DATASET_PATH/{dataset_name}
197-
198-
trainer:
199-
trainer_config_path: scripts/config/{train_config_name}.yaml
200211
```
201212
202-
You may use the default configurations located in the directory `scripts/config`. Please refer to `examples` for more details.
213+
Please refer to [`examples`](examples/) for more details.
203214

204215

205216

@@ -218,7 +229,7 @@ ray start --address=<master_address>
218229

219230

220231

221-
Optionally, we can login into wandb to monitor the RFT process. More details of wandb can be found in its [docs](https://docs.wandb.ai/quickstart/).
232+
Optionally, we can login into [wandb](https://docs.wandb.ai/quickstart/) to better monitor the RFT process:
222233

223234
```shell
224235
export WANDB_API_KEY=<your_api_key>
@@ -238,16 +249,16 @@ trinity run --config <config_path>
238249
For example, below is the command for fine-tuning Qwen-2.5-1B-Instruct on GSM8k dataset using GRPO algorithm:
239250

240251
```shell
241-
trinity run --config scripts/config/gsm8k.yaml
252+
trinity run --config examples/grpo_gsm8k/gsm8k.yaml
242253
```
243254

244255

245256

246-
More example config files can be found in `scripts/config`.
257+
More example config files can be found in `examples`.
247258

248259

249260

250-
For more detailed examples about how to use Trinity-RFT, please refer to the following documents:
261+
For more detailed examples about how to use Trinity-RFT, please refer to the following tutorials:
251262
+ [A quick example with GSM8k](./docs/sphinx_doc/source/tutorial/example_reasoning_basic.md);
252263
+ [Off-policy / asynchronous modes of RFT](./docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md);
253264
+ [Multi-turn tasks](./docs/sphinx_doc/source/tutorial/example_multi_turn.md);
557 KB
Loading
239 KB
Loading

docs/sphinx_doc/source/main.md

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -180,20 +180,17 @@ For more details about dataset downloading, please refer to [Huggingface](https:
180180
### Step 3: configurations
181181

182182

183-
You may customize the configurations in `scripts/config/{config_name}.yaml`and `scripts/config/{train_config_name}.yaml`. For example, the model and dataset are specified as:
183+
You may customize the configurations in [`examples`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/). For example, the model and dataset are specified as:
184184

185185
```yaml
186186
model:
187187
model_path: $MODEL_PATH/{model_name}
188188

189189
data:
190190
dataset_path: $DATASET_PATH/{dataset_name}
191-
192-
trainer:
193-
trainer_config_path: scripts/config/{train_config_name}.yaml
194191
```
195192
196-
You may use the default configurations located in the directory `scripts/config`. Please refer to `examples` for more details.
193+
Please refer to [`examples`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/) for more details.
197194

198195

199196

@@ -232,12 +229,12 @@ trinity run --config <config_path>
232229
For example, below is the command for fine-tuning Qwen-2.5-1B-Instruct on GSM8k dataset using GRPO algorithm:
233230

234231
```shell
235-
trinity run --config scripts/config/gsm8k.yaml
232+
trinity run --config examples/grpo_gsm8k/gsm8k.yaml
236233
```
237234

238235

239236

240-
More example config files can be found in `scripts/config`.
237+
More example config files can be found in `examples`.
241238

242239

243240

docs/sphinx_doc/source/tutorial/example_data_functionalities.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -133,12 +133,13 @@ And you can set the `clean_strategy` to 'iterative' to get a better dataset.
133133

134134

135135

136-
All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](../../../../scripts/config/gsm8k.yaml).
136+
All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml).
137137

138138

139139

140-
> [!NOTE]
141-
> Only when one of `dj_process_desc` and `dj_config_path` is provided, the data module and the data active iterator will be activated. Otherwise, this part will be skipped and it will enter into the exploring stage directly.
140+
```{note}
141+
Only when one of `dj_process_desc` and `dj_config_path` is provided, the data module and the data active iterator will be activated. Otherwise, this part will be skipped and it will enter into the exploring stage directly.
142+
```
142143

143144
### Exploring & Training
144145
After preparing the config files of Trinity-RFT, you can start your ray cluster and run the RFT process including the data active iterator part with the following commands:

docs/sphinx_doc/source/tutorial/example_dpo.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,12 +38,12 @@ Note that the dataset has the keys `prompt`, `chosen` and `rejected`. If not, pa
3838

3939
### Configuration
4040

41-
We use the configurations in `scripts/config/dpo.yaml`and `scripts/config/train_dpo.yaml` for this experiment. Some important setups are listed in the following:
41+
We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) and [`train_dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/train_dpo.yaml) for this experiment. Some important setups are listed in the following:
4242

4343
We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config `mode` to `train` and set `sync_method` to `offline`. The value of `sync_iteration_interval` can be set as same of the value of `save_freq`.
4444

4545
```yaml
46-
# scripts/config/dpo.yaml
46+
# In dpo.yaml
4747
mode: train
4848
synchronizer:
4949
sync_method: 'offline'
@@ -60,7 +60,7 @@ buffer:
6060
trainer:
6161
algorithm_type: dpo
6262

63-
# scripts/config/train_dpo.yaml
63+
# In train_dpo.yaml
6464
actor_rollout_ref:
6565
actor:
6666
alg_type: dpo
@@ -73,5 +73,5 @@ actor_rollout_ref:
7373
Run RFT process with the following command:
7474
7575
```shell
76-
trinity run --config scripts/config/dpo.yaml
76+
trinity run --config examples/dpo_humanlike/dpo.yaml
7777
```

docs/sphinx_doc/source/tutorial/example_multi_turn.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,15 +36,15 @@ The task is described as an environment instead of a single prompt.
3636

3737
## Step 2: Config preparation and run the experiment
3838

39-
You can refer to `example_reasoning_basic` to setup the config and others. The default config files are `scripts/config/alfworld.yaml` and `scripts/config/webshop.yaml`, respectively.
39+
You can refer to `example_reasoning_basic` to setup the config and others. The default config files are [`alfworld.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld/alfworld.yaml) and [`webshop.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_webshop/webshop.yaml), respectively.
4040
You may revise the configurations properly and run the experiment!
4141

4242
```bash
4343
# For ALFworld env
44-
trinity run --config scripts/config/alfworld.yaml
44+
trinity run --config examples/grpo_alfworld/alfworld.yaml
4545

4646
# For WebShop env
47-
trinity run --config scripts/config/webshop.yaml
47+
trinity run --config examples/grpo_webshop/webshop.yaml
4848
```
4949

5050
## Advance: How to build your own environment

docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,11 @@ The algorithm design and analysis can be found in this [technical report](../../
1717

1818
To try out the OPMD algorithm:
1919
```shell
20-
trinity run --config scripts/config/gsm8k_opmd.yaml
20+
trinity run --config examples/opmd_gsm8k/opmd_gsm8k.yaml
2121
```
2222

2323
Note that in this config file, `sync_iteration_interval` is set to 10, i.e., the model weights of explorer and trainer are synchronized only once every 10 training steps, which leads to a challenging off-policy scenario (potentially with abrupt distribution shift during the RFT process).
24-
Other configurations of particular interest are explained at the beginning of `scripts/config/train_gsm8k_opmd.yaml`.
24+
Other configurations of particular interest are explained at the beginning of [`train_opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/opmd_gsm8k/train_opmd_gsm8k.yaml).
2525

2626

2727

@@ -48,4 +48,4 @@ To run this mode, the explorer and trainer need to be launched separately, with
4848

4949

5050

51-
We are still testing this mode more thoroughly. A concrete example is coming soon!
51+
*We are still testing this mode more thoroughly. A concrete example is coming soon!*

docs/sphinx_doc/source/tutorial/example_reasoning_basic.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -48,15 +48,15 @@ synchronizer:
4848
4949
### Use GRPO or PPO Algorithm
5050
51-
We use the configurations in `scripts/config/gsm8k.yaml`and `scripts/config/train_gsm8k.yaml` for this experiment. Some important setups are listed in the following:
51+
We use the configurations in [`gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml) and [`train_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/train_gsm8k.yaml) for this experiment. Some important setups are listed in the following:
5252

5353

5454
```yaml
55-
# scripts/config/gsm8k.yaml
55+
# In gsm8k.yaml
5656
explorer:
5757
repeat_times: {number of rollouts for each task}
5858
59-
# scripts/config/train_gsm8k.yaml
59+
# In train_gsm8k.yaml
6060
actor_rollout_ref:
6161
actor:
6262
use_kl_loss: True (fro GRPO) / False (for PPO)
@@ -69,7 +69,7 @@ algorithm:
6969

7070
Run the RFT process with the following command:
7171
```bash
72-
trinity run --config scripts/config/gsm8k.yaml
72+
trinity run --config examples/grpo_gsm8k/gsm8k.yaml
7373
```
7474

7575

@@ -79,14 +79,14 @@ trinity run --config scripts/config/gsm8k.yaml
7979
Before RFT, we may use SFT as a warmup step. We need to set `trainer.sft_warmup_iteration > 0` and prepare the SFT data to `buffer.train_dataset.path=$DATASET_PATH/{sft_data}`.
8080

8181
```yaml
82-
# Properly set the following configs in scripts/config/gsm8k.yaml
82+
# Properly set the following configs in gsm8k.yaml
8383
buffer:
8484
sft_warmup_dataset:
8585
storage_type: file
8686
algorithm_type: sft
8787
path: <$DATASET_PATH/{sft_data}>
8888
kwargs:
89-
prompt_type: <prompt_type> # messages/plaintext
89+
prompt_type: <prompt_type> # messages/plaintext/chatpair
9090
prompt_key: <prompt_key>
9191
response_key: <response_key>
9292
trainer:
@@ -95,5 +95,5 @@ trainer:
9595

9696
The following command runs SFT and RFT in sequence:
9797
```bash
98-
trinity run --config scripts/config/gsm8k.yaml
98+
trinity run --config examples/grpo_gsm8k/gsm8k.yaml
9999
```

docs/sphinx_doc/source/tutorial/trinity_configs.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Trinity-RFT Configuration
22

3-
The following is the main config file for Trinity-RFT. Take `scripts/config/countdown.yaml` as an example.
3+
The following is the main config file for Trinity-RFT. Take `countdown.yaml` as an example.
44

55

66
## Monitor
@@ -165,7 +165,7 @@ synchronizer:
165165
trainer:
166166
trainer_type: 'verl'
167167
algorithm_type: ppo
168-
trainer_config_path: 'scripts/config/train_countdown.yaml'
168+
trainer_config_path: 'examples/ppo_countdown/train_countdown.yaml'
169169
sft_warmup_iteration: 0
170170
eval_interval: 1000
171171
```

0 commit comments

Comments
 (0)