modelscope
diff --git a/‎README.md‎
Lines changed: 43 additions & 32 deletions b/‎README.md‎
Lines changed: 43 additions & 32 deletions
diff --git a/‎docs/sphinx_doc/assets/trinity-design.png‎
557 KB b/‎docs/sphinx_doc/assets/trinity-design.png‎
557 KB
diff --git a/‎docs/sphinx_doc/assets/trinity-title.png‎
239 KB b/‎docs/sphinx_doc/assets/trinity-title.png‎
239 KB
diff --git a/‎docs/sphinx_doc/source/main.md‎
Lines changed: 4 additions & 7 deletions b/‎docs/sphinx_doc/source/main.md‎
Lines changed: 4 additions & 7 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/example_data_functionalities.md‎
Lines changed: 4 additions & 3 deletions b/‎docs/sphinx_doc/source/tutorial/example_data_functionalities.md‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/example_dpo.md‎
Lines changed: 4 additions & 4 deletions b/‎docs/sphinx_doc/source/tutorial/example_dpo.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/example_multi_turn.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/sphinx_doc/source/tutorial/example_multi_turn.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/example_reasoning_basic.md‎
Lines changed: 7 additions & 7 deletions b/‎docs/sphinx_doc/source/tutorial/example_reasoning_basic.md‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/trinity_configs.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/sphinx_doc/source/tutorial/trinity_configs.md‎
Lines changed: 2 additions & 2 deletions
@@ -3,42 +3,56 @@
 <!-- ![trinity-rft](./docs/sphinx_doc/assets/trinity-title.png) -->
 
 <div align="center">
-  <img src="./docs/sphinx_doc/assets/trinity-title.png" alt="Trinity-RFT">
+  <img src="./docs/sphinx_doc/assets/trinity-title.png" alt="Trinity-RFT" style="height: 100px;">
 </div>
 
 
+&nbsp;
 
-Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement fine-tuning (RFT) of large language models (LLM).
-Built with a decoupled architecture, seamless integration for agentic workflows, and systematic data processing pipelines, Trinity-RFT can be easily adapted for diverse application scenarios, and serve as a platform for exploring advanced reinforcement learning (RL) paradigms.
 
 
+**Trinity-RFT is a general-purpose, flexible and scalable framework designed for reinforcement fine-tuning (RFT) of large language models (LLM).**
 
 
+Built with a decoupled design, seamless integration for agentic workflows, and systematic data processing pipelines, Trinity-RFT can be easily adapted for diverse application scenarios, and serve as a platform for exploring advanced reinforcement learning (RL) paradigms.
 
-**Vision of this project:**
 
 
-Current RFT approaches, such as RLHF (Reinforcement Learning from Human Feedback) with proxy reward models or training long-CoT reasoning LLMs with rule-based rewards, are limited in their ability to handle dynamic, real-world learning.
-Trinity-RFT envisions a future where AI agents learn by interacting directly with environments, collecting delayed or complex reward signals, and continuously refining their behavior through advanced RL paradigms.
-For example, imagine an AI scientist that designs an experiment, executes it via interacting with the environment, waits for feedback (while working on some other tasks concurrently), and iteratively updates itself based on true environmental rewards when the experiment is finally finished.
+
+
+## Vision of this project
+
+
+Current RFT approaches, such as RLHF (Reinforcement Learning from Human Feedback) with proxy reward models or training long-CoT reasoning models with rule-based rewards, are limited in their ability to handle dynamic, real-world learning.
+
+Trinity-RFT envisions a future where AI agents learn by interacting directly with environments, collecting delayed or complex reward signals, and continuously refining their behavior through RL.
+
+
+For example, imagine an AI scientist that designs an experiment, executes it, waits for feedback (while working on other tasks concurrently), and iteratively updates itself based on true environmental rewards when the experiment is finally finished.
+
+
 Trinity-RFT offers a path into this future by addressing critical gaps in existing solutions.
 
 
 
 
 
-**Key features of Trinity-RFT:**
+## Key features
 
 
 
 + **Unified RFT modes & algorithm support.**
-Trinity-RFT unifies and generalizes existing RFT methodologies into a flexible and configurable framework, supporting synchronous/asynchronous and on-policy/off-policy/offline training, as well as hybrid modes that combine the above seamlessly into a single learning process (e.g., incorporating expert trajectories or high-quality SFT data to accelerate an online RL process).
+Trinity-RFT unifies and generalizes existing RFT methodologies into a flexible and configurable framework, supporting synchronous/asynchronous and on-policy/off-policy/offline training, as well as hybrid modes that combine them seamlessly into a single learning process.
+
 
 + **Agent-environment interaction as a first-class citizen.**
-Trinity-RFT natively models the challenges of RFT with real-world agent-environment interactions. It allows delayed rewards in multi-step and/or time-lagged feedback loops, handles long-tailed latencies and environment/agent failures gracefully, and supports distributed deployment where explorers (i.e., the rollout agents) and trainers (i.e., the policy model trained by RL) can operate across separate clusters or devices (e.g., explorers on edge devices, trainers in cloud clusters) and scale up independently.
+Trinity-RFT allows delayed rewards in multi-step/time-lagged feedback loops, handles long-tailed latencies and environment/agent failures gracefully, and supports distributed deployment where explorers and trainers can operate across separate devices and scale up independently.
+
+
 
 + **Data processing pipelines optimized for RFT with diverse/messy data.**
-These include converting raw datasets to prompt/task sets for RL, cleaning/filtering/prioritizing experiences stored in the replay buffer, synthesizing data for tasks and experiences, offering user interfaces for RFT with human in the loop, managing the task and experience buffers (e.g., supporting collection of lagged reward signals), among others.
+These include converting raw datasets to prompt/task sets for RL, cleaning/filtering/prioritizing experiences stored in the replay buffer, synthesizing data for tasks and experiences, offering user interfaces for human in the loop, etc.
+<!-- managing the task and experience buffers (e.g., supporting collection of lagged reward signals) -->
 
 
 
@@ -59,40 +73,40 @@ These include converting raw datasets to prompt/task sets for RL, cleaning/filte
 The overall design of Trinity-RFT exhibits a trinity:
 + RFT-core;
 + agent-environment interaction;
-+ data processing pipelines tailored to RFT.
++ data processing pipelines tailored to RFT;
 
-
-
-In particular, the design of RFT-core also exhibits a trinity:
+and the design of RFT-core also exhibits a trinity:
 + explorer;
 + trainer;
 + manager & buffer.
 
 
 
-The explorer, powered by the rollout model, interacts with the environment and generates rollout trajectories to be stored in the experience buffer.
-The trainer, powered by the policy model, samples batches of experiences from the buffer and updates the policy via RL algorithms.
-These two can be completely decoupled and act asynchronously, except that they share the same experience buffer, and their model weights are synchronized once in a while (according to a schedule specified by user configurations).
+The *explorer*, powered by the rollout model, interacts with the environment and generates rollout trajectories to be stored in the experience buffer.
 
+The *trainer*, powered by the policy model, samples batches of experiences from the buffer and updates the policy via RL algorithms.
 
-Such a decoupled design is crucial for making the aforementioned features of Trinity-RFT possible,
-e.g., flexible and configurable RFT modes (on-policy/off-policy, synchronous/asynchronous, immediate/lagged rewards),
+These two can be completely decoupled and act asynchronously, except that they share the same experience buffer, and their model weights are synchronized once in a while.
+Such a decoupled design is crucial for making the aforementioned features of Trinity-RFT possible.
+
+<!-- e.g., flexible and configurable RFT modes (on-policy/off-policy, synchronous/asynchronous, immediate/lagged rewards),
 fault tolerance for failures of explorer (agent/environment) or trainer,
 high efficiency in the presence of long-tailed rollout latencies,
 data processing pipelines and human in the loop of RFT (e.g., via acting on the experience buffer, which is implemented as a persistent database),
-among others.
+among others. -->
 
 
 
 Meanwhile, Trinity-RFT has done the dirty work for ensuring high efficiency in every component of the framework,
-e.g., utilizing NCCL (when feasible) for model weight synchronization, sequence concatenation with proper masking for multi-turn conversations and ReAct workflows, pipeline parallelism for the synchronous RFT mode, among many others.
+e.g., utilizing NCCL (when feasible) for model weight synchronization, sequence concatenation with proper masking for multi-turn conversations and ReAct-style workflows, pipeline parallelism for the synchronous RFT mode, among many others.
 
 
 
 ## Getting started
 
 
-*Note: this project is currently under active development; comments and suggestions are welcome!*
+> [!NOTE]
+> This project is currently under active development. Comments and suggestions are welcome!
 
 
 
@@ -186,20 +200,17 @@ For more details about dataset downloading, please refer to [Huggingface](https:
 ### Step 3: configurations
 
 
-You may customize the configurations in `scripts/config/{config_name}.yaml`and `scripts/config/{train_config_name}.yaml`. For example, the model and dataset are specified as:
+You may customize the configurations in [`examples`](examples/). For example, the model and dataset are specified as:
 
 ```yaml
 model:
   model_path: $MODEL_PATH/{model_name}
 
 data:
   dataset_path: $DATASET_PATH/{dataset_name}
-
-trainer:
-  trainer_config_path: scripts/config/{train_config_name}.yaml
 ```
 
-You may use the default configurations located in the directory `scripts/config`. Please refer to `examples` for more details.
+Please refer to [`examples`](examples/) for more details.
 
 
 
@@ -218,7 +229,7 @@ ray start --address=<master_address>
 
 
 
-Optionally, we can login into wandb to monitor the RFT process. More details of wandb can be found in its [docs](https://docs.wandb.ai/quickstart/).
+Optionally, we can login into [wandb](https://docs.wandb.ai/quickstart/) to better monitor the RFT process:
 
 ```shell
 export WANDB_API_KEY=<your_api_key>
@@ -238,16 +249,16 @@ trinity run --config <config_path>
 For example, below is the command for fine-tuning Qwen-2.5-1B-Instruct on GSM8k dataset using GRPO algorithm:
 
 ```shell
-trinity run --config scripts/config/gsm8k.yaml
+trinity run --config examples/grpo_gsm8k/gsm8k.yaml
 ```
 
 
 
-More example config files can be found in `scripts/config`.
+More example config files can be found in `examples`.
 
 
 
-For more detailed examples about how to use Trinity-RFT, please refer to the following documents:
+For more detailed examples about how to use Trinity-RFT, please refer to the following tutorials:
 + [A quick example with GSM8k](./docs/sphinx_doc/source/tutorial/example_reasoning_basic.md);
 + [Off-policy / asynchronous modes of RFT](./docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md);
 + [Multi-turn tasks](./docs/sphinx_doc/source/tutorial/example_multi_turn.md);
 
@@ -180,20 +180,17 @@ For more details about dataset downloading, please refer to [Huggingface](https:
 ### Step 3: configurations
 
 
-You may customize the configurations in `scripts/config/{config_name}.yaml`and `scripts/config/{train_config_name}.yaml`. For example, the model and dataset are specified as:
+You may customize the configurations in [`examples`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/). For example, the model and dataset are specified as:
 
 ```yaml
 model:
   model_path: $MODEL_PATH/{model_name}
 
 data:
   dataset_path: $DATASET_PATH/{dataset_name}
-
-trainer:
-  trainer_config_path: scripts/config/{train_config_name}.yaml
 ```
 
-You may use the default configurations located in the directory `scripts/config`. Please refer to `examples` for more details.
+Please refer to [`examples`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/) for more details.
 
 
 
@@ -232,12 +229,12 @@ trinity run --config <config_path>
 For example, below is the command for fine-tuning Qwen-2.5-1B-Instruct on GSM8k dataset using GRPO algorithm:
 
 ```shell
-trinity run --config scripts/config/gsm8k.yaml
+trinity run --config examples/grpo_gsm8k/gsm8k.yaml
 ```
 
 
 
-More example config files can be found in `scripts/config`.
+More example config files can be found in `examples`.
 
 
 
 
@@ -133,12 +133,13 @@ And you can set the `clean_strategy` to 'iterative' to get a better dataset.
 
 
 
-All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](../../../../scripts/config/gsm8k.yaml).
+All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml).
 
 
 
-> [!NOTE]
-> Only when one of `dj_process_desc` and `dj_config_path` is provided, the data module and the data active iterator will be activated. Otherwise, this part will be skipped and it will enter into the exploring stage directly.
+```{note}
+Only when one of `dj_process_desc` and `dj_config_path` is provided, the data module and the data active iterator will be activated. Otherwise, this part will be skipped and it will enter into the exploring stage directly.
+```
 
 ### Exploring & Training
 After preparing the config files of Trinity-RFT, you can start your ray cluster and run the RFT process including the data active iterator part with the following commands:
 
@@ -38,12 +38,12 @@ Note that the dataset has the keys `prompt`, `chosen` and `rejected`. If not, pa
 
 ### Configuration
 
-We use the configurations in `scripts/config/dpo.yaml`and `scripts/config/train_dpo.yaml` for this experiment. Some important setups are listed in the following:
+We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) and [`train_dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/train_dpo.yaml) for this experiment. Some important setups are listed in the following:
 
 We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config `mode` to `train` and set `sync_method` to `offline`. The value of `sync_iteration_interval` can be set as same of the value of `save_freq`.
 
 ```yaml
-# scripts/config/dpo.yaml
+# In dpo.yaml
 mode: train
 synchronizer:
   sync_method: 'offline'
@@ -60,7 +60,7 @@ buffer:
 trainer:
   algorithm_type: dpo
 
-# scripts/config/train_dpo.yaml
+# In train_dpo.yaml
 actor_rollout_ref:
   actor:
     alg_type: dpo
@@ -73,5 +73,5 @@ actor_rollout_ref:
 Run RFT process with the following command:
 
 ```shell
-trinity run --config scripts/config/dpo.yaml
+trinity run --config examples/dpo_humanlike/dpo.yaml
 ```
@@ -36,15 +36,15 @@ The task is described as an environment instead of a single prompt.
 
 ## Step 2: Config preparation and run the experiment
 
-You can refer to `example_reasoning_basic` to setup the config and others. The default config files are `scripts/config/alfworld.yaml` and `scripts/config/webshop.yaml`, respectively.
+You can refer to `example_reasoning_basic` to setup the config and others. The default config files are [`alfworld.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld/alfworld.yaml) and [`webshop.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_webshop/webshop.yaml), respectively.
 You may revise the configurations properly and run the experiment!
 
 ```bash
 # For ALFworld env
-trinity run --config scripts/config/alfworld.yaml
+trinity run --config examples/grpo_alfworld/alfworld.yaml
 
 # For WebShop env
-trinity run --config scripts/config/webshop.yaml
+trinity run --config examples/grpo_webshop/webshop.yaml
 ```
 
 ## Advance: How to build your own environment
 
@@ -17,11 +17,11 @@ The algorithm design and analysis can be found in this [technical report](../../
 
 To try out the OPMD algorithm:
 ```shell
-trinity run --config scripts/config/gsm8k_opmd.yaml
+trinity run --config examples/opmd_gsm8k/opmd_gsm8k.yaml
 ```
 
 Note that in this config file, `sync_iteration_interval` is set to 10, i.e., the model weights of explorer and trainer are synchronized only once every 10 training steps, which leads to a challenging off-policy scenario (potentially with abrupt distribution shift during the RFT process).
-Other configurations of particular interest are explained at the beginning of `scripts/config/train_gsm8k_opmd.yaml`.
+Other configurations of particular interest are explained at the beginning of [`train_opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/opmd_gsm8k/train_opmd_gsm8k.yaml).
 
 
 
@@ -48,4 +48,4 @@ To run this mode, the explorer and trainer need to be launched separately, with
 
 
 
-We are still testing this mode more thoroughly. A concrete example is coming soon!
+*We are still testing this mode more thoroughly. A concrete example is coming soon!*
@@ -48,15 +48,15 @@ synchronizer:
 
 ### Use GRPO or PPO Algorithm
 
-We use the configurations in `scripts/config/gsm8k.yaml`and `scripts/config/train_gsm8k.yaml` for this experiment. Some important setups are listed in the following:
+We use the configurations in [`gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml) and [`train_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/train_gsm8k.yaml) for this experiment. Some important setups are listed in the following:
 
 
 ```yaml
-# scripts/config/gsm8k.yaml
+# In gsm8k.yaml
 explorer:
   repeat_times: {number of rollouts for each task}
 
-# scripts/config/train_gsm8k.yaml
+# In train_gsm8k.yaml
 actor_rollout_ref:
   actor:
     use_kl_loss: True (fro GRPO) / False (for PPO)
@@ -69,7 +69,7 @@ algorithm:
 
 Run the RFT process with the following command:
 ```bash
-trinity run --config scripts/config/gsm8k.yaml
+trinity run --config examples/grpo_gsm8k/gsm8k.yaml
 ```
 
 
@@ -79,14 +79,14 @@ trinity run --config scripts/config/gsm8k.yaml
 Before RFT, we may use SFT as a warmup step. We need to set `trainer.sft_warmup_iteration > 0` and prepare the SFT data to `buffer.train_dataset.path=$DATASET_PATH/{sft_data}`.
 
 ```yaml
-# Properly set the following configs in scripts/config/gsm8k.yaml
+# Properly set the following configs in gsm8k.yaml
 buffer:
   sft_warmup_dataset:
     storage_type: file
     algorithm_type: sft
     path: <$DATASET_PATH/{sft_data}>
     kwargs:
-      prompt_type: <prompt_type> # messages/plaintext
+      prompt_type: <prompt_type> # messages/plaintext/chatpair
       prompt_key: <prompt_key>
       response_key: <response_key>
 trainer:
@@ -95,5 +95,5 @@ trainer:
 
 The following command runs SFT and RFT in sequence:
 ```bash
-trinity run --config scripts/config/gsm8k.yaml
+trinity run --config examples/grpo_gsm8k/gsm8k.yaml
 ```
@@ -1,6 +1,6 @@
 # Trinity-RFT Configuration
 
-The following is the main config file for Trinity-RFT. Take `scripts/config/countdown.yaml` as an example.
+The following is the main config file for Trinity-RFT. Take `countdown.yaml` as an example.
 
 
 ## Monitor
@@ -165,7 +165,7 @@ synchronizer:
 trainer:
   trainer_type: 'verl'
   algorithm_type: ppo
-  trainer_config_path: 'scripts/config/train_countdown.yaml'
+  trainer_config_path: 'examples/ppo_countdown/train_countdown.yaml'
   sft_warmup_iteration: 0
   eval_interval: 1000
 ```