Skip to content

Commit 24f8baa

Browse files
authored
Update README.md and main.md (#47)
1 parent d397909 commit 24f8baa

File tree

3 files changed

+61
-46
lines changed

3 files changed

+61
-46
lines changed

README.md

Lines changed: 36 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -3,18 +3,18 @@
33
<!-- ![trinity-rft](./docs/sphinx_doc/assets/trinity-title.png) -->
44

55
<div align="center">
6-
<img src="./docs/sphinx_doc/assets/trinity-title.png" alt="Trinity-RFT" style="height: 100px;">
6+
<img src="./docs/sphinx_doc/assets/trinity-title.png" alt="Trinity-RFT" style="height: 120px;">
77
</div>
88

99

1010
&nbsp;
1111

1212

1313

14-
**Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement fine-tuning (RFT) of large language models (LLM).**
14+
**Trinity-RFT is a general-purpose, flexible, scalable and user-friendly framework designed for reinforcement fine-tuning (RFT) of large language models (LLM).**
1515

1616

17-
Built with a decoupled design, seamless integration for agentic workflows, and systematic data processing pipelines, Trinity-RFT can be easily adapted for diverse application scenarios, and serve as a platform for exploring advanced reinforcement learning (RL) paradigms.
17+
Built with a decoupled design, seamless integration for agent-environment interaction, and systematic data processing pipelines, Trinity-RFT can be easily adapted for diverse application scenarios, and serve as a unified platform for exploring advanced reinforcement learning (RL) paradigms.
1818

1919

2020

@@ -23,15 +23,15 @@ Built with a decoupled design, seamless integration for agentic workflows, and s
2323
## Vision of this project
2424

2525

26-
Current RFT approaches, such as RLHF (Reinforcement Learning from Human Feedback) with proxy reward models or training long-CoT reasoning models with rule-based rewards, are limited in their ability to handle dynamic, real-world learning.
26+
Current RFT approaches, such as RLHF (Reinforcement Learning from Human Feedback) with proxy reward models or training long-CoT reasoning models with rule-based rewards, are limited in their ability to handle dynamic, real-world, and continuous learning.
2727

2828
Trinity-RFT envisions a future where AI agents learn by interacting directly with environments, collecting delayed or complex reward signals, and continuously refining their behavior through RL.
2929

3030

3131
For example, imagine an AI scientist that designs an experiment, executes it, waits for feedback (while working on other tasks concurrently), and iteratively updates itself based on true environmental rewards when the experiment is finally finished.
3232

3333

34-
Trinity-RFT offers a path into this future by addressing critical gaps in existing solutions.
34+
Trinity-RFT offers a path into this future by providing various useful features.
3535

3636

3737

@@ -42,7 +42,7 @@ Trinity-RFT offers a path into this future by addressing critical gaps in existi
4242

4343

4444
+ **Unified RFT modes & algorithm support.**
45-
Trinity-RFT unifies and generalizes existing RFT methodologies into a flexible and configurable framework, supporting synchronous/asynchronous and on-policy/off-policy/offline training, as well as hybrid modes that combine them seamlessly into a single learning process.
45+
Trinity-RFT unifies and generalizes existing RFT methodologies into a flexible and configurable framework, supporting synchronous/asynchronous, on-policy/off-policy, and online/offline training, as well as hybrid modes that combine them seamlessly into a single learning process.
4646

4747

4848
+ **Agent-environment interaction as a first-class citizen.**
@@ -51,9 +51,7 @@ Trinity-RFT allows delayed rewards in multi-step/time-lagged feedback loops, han
5151

5252

5353
+ **Data processing pipelines optimized for RFT with diverse/messy data.**
54-
These include converting raw datasets to prompt/task sets for RL, cleaning/filtering/prioritizing experiences stored in the replay buffer, synthesizing data for tasks and experiences, offering user interfaces for human in the loop, etc.
55-
<!-- managing the task and experience buffers (e.g., supporting collection of lagged reward signals) -->
56-
54+
These include converting raw datasets to task sets for RL, cleaning/filtering/prioritizing experiences stored in the replay buffer, synthesizing data for tasks and experiences, offering user interfaces for human in the loop, etc.
5755

5856

5957

@@ -73,20 +71,20 @@ These include converting raw datasets to prompt/task sets for RL, cleaning/filte
7371
The overall design of Trinity-RFT exhibits a trinity:
7472
+ RFT-core;
7573
+ agent-environment interaction;
76-
+ data processing pipelines tailored to RFT;
74+
+ data processing pipelines;
7775

7876
and the design of RFT-core also exhibits a trinity:
7977
+ explorer;
8078
+ trainer;
81-
+ manager & buffer.
79+
+ buffer.
8280

8381

8482

8583
The *explorer*, powered by the rollout model, interacts with the environment and generates rollout trajectories to be stored in the experience buffer.
8684

87-
The *trainer*, powered by the policy model, samples batches of experiences from the buffer and updates the policy via RL algorithms.
85+
The *trainer*, powered by the policy model, samples batches of experiences from the buffer and updates the policy model via RL algorithms.
8886

89-
These two can be completely decoupled and act asynchronously, except that they share the same experience buffer, and their model weights are synchronized once in a while.
87+
These two can be completely decoupled and act asynchronously on separate machines, except that they share the same experience buffer, and their model weights are synchronized once in a while.
9088
Such a decoupled design is crucial for making the aforementioned features of Trinity-RFT possible.
9189

9290
<!-- e.g., flexible and configurable RFT modes (on-policy/off-policy, synchronous/asynchronous, immediate/lagged rewards),
@@ -97,8 +95,11 @@ among others. -->
9795

9896

9997

100-
Meanwhile, Trinity-RFT has done the dirty work for ensuring high efficiency in every component of the framework,
101-
e.g., utilizing NCCL (when feasible) for model weight synchronization, sequence concatenation with proper masking for multi-turn conversations and ReAct-style workflows, pipeline parallelism for the synchronous RFT mode, among many others.
98+
Meanwhile, Trinity-RFT has done a lot of work to ensure high efficiency and robustness in every component of the framework,
99+
e.g., utilizing NCCL (when feasible) for model weight synchronization, sequence concatenation with proper masking for multi-turn conversations and ReAct-style workflows, pipeline parallelism for the synchronous RFT mode,
100+
asynchronous and concurrent LLM inference for rollout,
101+
fault tolerance for agent/environment failures,
102+
among many others.
102103

103104

104105

@@ -146,8 +147,7 @@ pip install flash-attn -v
146147
```
147148

148149
Installation from docker:
149-
150-
We provided a dockerfile for Trinity-RFT (trinity)
150+
we have provided a dockerfile for Trinity-RFT (trinity)
151151

152152
```shell
153153
git clone https://github.com/modelscope/Trinity-RFT
@@ -163,6 +163,12 @@ docker run -it --gpus all --shm-size="64g" --rm -v $PWD:/workspace -v <root_path
163163
```
164164

165165

166+
Trinity-RFT requires
167+
Python version >= 3.10,
168+
CUDA version >= 12.4,
169+
and at least 2 GPUs.
170+
171+
166172
### Step 2: prepare dataset and model
167173

168174

@@ -203,7 +209,7 @@ For more details about dataset downloading, please refer to [Huggingface](https:
203209
For convenience, Trinity-RFT provides a web interface for configuring your RFT process.
204210

205211
> [!NOTE]
206-
> This is a experimental feature. We will continue to improve it and make it more user-friendly.
212+
> This is an experimental feature, and we will continue to improve it.
207213
208214
```bash
209215
trinity studio --port 8080
@@ -214,7 +220,7 @@ Then you can configure your RFT process in the web page and generate a config fi
214220
You can save the config for later use or run it directly as described in the following section.
215221

216222

217-
For advanced users, you can also manually configure your RFT process by editing the config file.
223+
Advanced users can also configure the RFT process by editing the config file directly.
218224
We provide a set of example config files in [`examples`](examples/).
219225

220226

@@ -253,12 +259,12 @@ For studio users, just click the "Run" button in the web page.
253259
254260
255261
For more detailed examples about how to use Trinity-RFT, please refer to the following tutorials:
256-
+ [A quick example with GSM8k](./docs/sphinx_doc/source/tutorial/example_reasoning_basic.md);
257-
+ [Off-policy mode of RFT](./docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md);
258-
+ [Asynchronous mode of RFT](./docs/sphinx_doc/source/tutorial/example_async_mode.md);
259-
+ [Multi-turn tasks](./docs/sphinx_doc/source/tutorial/example_multi_turn.md);
260-
+ [Data processing pipelines](./docs/sphinx_doc/source/tutorial/example_data_functionalities.md);
261-
+ [Offline learning by DPO](./docs/sphinx_doc/source/tutorial/example_dpo.md).
262+
+ [A quick example with GSM8k](./docs/sphinx_doc/source/tutorial/example_reasoning_basic.md)
263+
+ [Off-policy mode of RFT](./docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md)
264+
+ [Asynchronous mode of RFT](./docs/sphinx_doc/source/tutorial/example_async_mode.md)
265+
+ [Multi-turn tasks](./docs/sphinx_doc/source/tutorial/example_multi_turn.md)
266+
+ [Offline learning by DPO](./docs/sphinx_doc/source/tutorial/example_dpo.md)
267+
+ [Advanced data processing / human-in-the-loop](./docs/sphinx_doc/source/tutorial/example_data_functionalities.md)
262268
263269
264270
@@ -279,6 +285,11 @@ Please refer to [this document](./docs/sphinx_doc/source/tutorial/trinity_config
279285
Please refer to [this document](./docs/sphinx_doc/source/tutorial/trinity_programming_guide.md).
280286
281287
288+
## Upcoming features
289+
290+
A tentative roadmap: https://github.com/modelscope/Trinity-RFT/issues/51
291+
292+
282293
283294
## Contribution guide
284295
46.5 KB
Loading

docs/sphinx_doc/source/main.md

Lines changed: 25 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -5,21 +5,22 @@
55

66

77

8-
Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement fine-tuning (RFT) of large language models (LLM).
9-
Built with a decoupled architecture, seamless integration for agentic workflows, and systematic data processing pipelines, Trinity-RFT can be easily adapted for diverse application scenarios, and serve as a platform for exploring advanced reinforcement learning (RL) paradigms.
8+
Trinity-RFT is a general-purpose, flexible, scalable and user-friendly framework designed for reinforcement fine-tuning (RFT) of large language models (LLM).
9+
10+
Built with a decoupled design, seamless integration for agent-environment interaction, and systematic data processing pipelines, Trinity-RFT can be easily adapted for diverse application scenarios, and serve as a unified platform for exploring advanced reinforcement learning (RL) paradigms.
1011

1112

1213

1314

1415

15-
**Vision of this project:**
1616

17+
**Vision of this project:**
1718

18-
Current RFT approaches, such as RLHF (Reinforcement Learning from Human Feedback) with proxy reward models or training long-CoT reasoning LLMs with rule-based rewards, are limited in their ability to handle dynamic, real-world learning.
19-
Trinity-RFT envisions a future where AI agents learn by interacting directly with environments, collecting delayed or complex reward signals, and continuously refining their behavior through advanced RL paradigms.
20-
For example, imagine an AI scientist that designs an experiment, executes it via interacting with the environment, waits for feedback (while working on some other tasks concurrently), and iteratively updates itself based on true environmental rewards when the experiment is finally finished.
21-
Trinity-RFT offers a path into this future by addressing critical gaps in existing solutions.
2219

20+
Current RFT approaches, such as RLHF (Reinforcement Learning from Human Feedback) with proxy reward models or training long-CoT reasoning models with rule-based rewards, are limited in their ability to handle dynamic, real-world, and continuous learning.
21+
Trinity-RFT envisions a future where AI agents learn by interacting directly with environments, collecting delayed or complex reward signals, and continuously refining their behavior through RL.
22+
For example, imagine an AI scientist that designs an experiment, executes it, waits for feedback (while working on other tasks concurrently), and iteratively updates itself based on true environmental rewards when the experiment is finally finished.
23+
Trinity-RFT offers a path into this future by providing various useful features.
2324

2425

2526

@@ -29,13 +30,13 @@ Trinity-RFT offers a path into this future by addressing critical gaps in existi
2930

3031

3132
+ **Unified RFT modes & algorithm support.**
32-
Trinity-RFT unifies and generalizes existing RFT methodologies into a flexible and configurable framework, supporting synchronous/asynchronous and on-policy/off-policy/offline training, as well as hybrid modes that combine the above seamlessly into a single learning process (e.g., incorporating expert trajectories or high-quality SFT data to accelerate an online RL process).
33+
Trinity-RFT unifies and generalizes existing RFT methodologies into a flexible and configurable framework, supporting synchronous/asynchronous, on-policy/off-policy, and online/offline training, as well as hybrid modes that combine the above seamlessly into a single learning process (e.g., incorporating expert trajectories or high-quality SFT data to accelerate an online RL process).
3334

3435
+ **Agent-environment interaction as a first-class citizen.**
3536
Trinity-RFT natively models the challenges of RFT with real-world agent-environment interactions. It allows delayed rewards in multi-step and/or time-lagged feedback loops, handles long-tailed latencies and environment/agent failures gracefully, and supports distributed deployment where explorers (i.e., the rollout agents) and trainers (i.e., the policy model trained by RL) can operate across separate clusters or devices (e.g., explorers on edge devices, trainers in cloud clusters) and scale up independently.
3637

3738
+ **Data processing pipelines optimized for RFT with diverse/messy data.**
38-
These include converting raw datasets to prompt/task sets for RL, cleaning/filtering/prioritizing experiences stored in the replay buffer, synthesizing data for tasks and experiences, offering user interfaces for RFT with human in the loop, managing the task and experience buffers (e.g., supporting collection of lagged reward signals), among others.
39+
These include converting raw datasets to task sets for RL, cleaning/filtering/prioritizing experiences stored in the replay buffer, synthesizing data for tasks and experiences, offering user interfaces for RFT with human in the loop, among others.
3940

4041

4142

@@ -51,20 +52,20 @@ These include converting raw datasets to prompt/task sets for RL, cleaning/filte
5152
The overall design of Trinity-RFT exhibits a trinity:
5253
+ RFT-core;
5354
+ agent-environment interaction;
54-
+ data processing pipelines tailored to RFT.
55+
+ data processing pipelines.
5556

5657

5758

5859
In particular, the design of RFT-core also exhibits a trinity:
5960
+ explorer;
6061
+ trainer;
61-
+ manager & buffer.
62+
+ buffer.
6263

6364

6465

6566
The explorer, powered by the rollout model, interacts with the environment and generates rollout trajectories to be stored in the experience buffer.
6667
The trainer, powered by the policy model, samples batches of experiences from the buffer and updates the policy via RL algorithms.
67-
These two can be completely decoupled and act asynchronously, except that they share the same experience buffer, and their model weights are synchronized once in a while (according to a schedule specified by user configurations).
68+
These two can be completely decoupled and act asynchronously on separate machines, except that they share the same experience buffer, and their model weights are synchronized once in a while (according to a schedule specified by user configurations).
6869

6970

7071
Such a decoupled design is crucial for making the aforementioned features of Trinity-RFT possible,
@@ -77,7 +78,7 @@ among others.
7778

7879

7980
Meanwhile, Trinity-RFT has done the dirty work for ensuring high efficiency in every component of the framework,
80-
e.g., utilizing NCCL (when feasible) for model weight synchronization, sequence concatenation with proper masking for multi-turn conversations and ReAct workflows, pipeline parallelism for the synchronous RFT mode, among many others.
81+
e.g., utilizing NCCL (when feasible) for model weight synchronization, sequence concatenation with proper masking for multi-turn conversations and ReAct workflows, pipeline parallelism for the synchronous RFT mode, asynchronous and concurrent LLM inference for rollout, among many others.
8182

8283

8384

@@ -125,8 +126,7 @@ pip install flash-attn -v
125126

126127

127128
Installation from docker:
128-
129-
We provided a dockerfile for Trinity-RFT (trinity)
129+
we have provided a dockerfile for Trinity-RFT (trinity)
130130

131131
```shell
132132
git clone https://github.com/modelscope/Trinity-RFT
@@ -141,6 +141,10 @@ docker build -f scripts/docker/Dockerfile -t trinity-rft:latest .
141141
docker run -it --gpus all --shm-size="64g" --rm -v $PWD:/workspace -v <root_path_of_data_and_checkpoints>:/data trinity-rft:latest
142142
```
143143

144+
Trinity-RFT requires
145+
Python version >= 3.10,
146+
CUDA version >= 12.4,
147+
and at least 2 GPUs.
144148

145149

146150
### Step 2: prepare dataset and model
@@ -247,12 +251,12 @@ More example config files can be found in `examples`.
247251

248252

249253
For more detailed examples about how to use Trinity-RFT, please refer to the following documents:
250-
+ [A quick example with GSM8k](tutorial/example_reasoning_basic.md);
251-
+ [Off-policy mode of RFT](tutorial/example_reasoning_advanced.md);
252-
+ [Asynchronous mode of RFT](tutorial/example_async_mode.md);
253-
+ [Multi-turn tasks](tutorial/example_multi_turn.md);
254-
+ [Data processing pipelines](tutorial/example_data_functionalities.md);
255-
+ [Offline learning by DPO](tutorial/example_dpo.md).
254+
+ [A quick example with GSM8k](tutorial/example_reasoning_basic.md)
255+
+ [Off-policy mode of RFT](tutorial/example_reasoning_advanced.md)
256+
+ [Asynchronous mode of RFT](tutorial/example_async_mode.md)
257+
+ [Multi-turn tasks](tutorial/example_multi_turn.md)
258+
+ [Offline learning by DPO](tutorial/example_dpo.md)
259+
+ [Advanced data processing / human-in-the-loop](tutorial/example_data_functionalities.md)
256260

257261

258262

0 commit comments

Comments
 (0)