modelscope · hiyuchang · Apr 24, 2025 · Apr 23, 2025 · Apr 23, 2025 · Apr 23, 2025
diff --git a/README.md b/README.md
@@ -200,20 +200,17 @@ For more details about dataset downloading, please refer to [Huggingface](https:
 ### Step 3: configurations
 
 
-You may customize the configurations in `scripts/config/{config_name}.yaml`and `scripts/config/{train_config_name}.yaml`. For example, the model and dataset are specified as:
+You may customize the configurations in [`examples`](examples/). For example, the model and dataset are specified as:
 
 ```yaml
 model:
   model_path: $MODEL_PATH/{model_name}
 
 data:
   dataset_path: $DATASET_PATH/{dataset_name}
-
-trainer:
-  trainer_config_path: scripts/config/{train_config_name}.yaml
 ```
 
-You may use the default configurations located in the directory `scripts/config`. Please refer to `examples` for more details.
+Please refer to [`examples`](examples/) for more details.
 
 
 
@@ -252,12 +249,12 @@ trinity run --config <config_path>
 For example, below is the command for fine-tuning Qwen-2.5-1B-Instruct on GSM8k dataset using GRPO algorithm:
 
 ```shell
-trinity run --config scripts/config/gsm8k.yaml
+trinity run --config examples/grpo_gsm8k/gsm8k.yaml
 ```
 
 
 
-More example config files can be found in `scripts/config`.
+More example config files can be found in `examples`.
 
 
 

diff --git a/docs/sphinx_doc/source/main.md b/docs/sphinx_doc/source/main.md
@@ -180,20 +180,17 @@ For more details about dataset downloading, please refer to [Huggingface](https:
 ### Step 3: configurations
 
 
-You may customize the configurations in `scripts/config/{config_name}.yaml`and `scripts/config/{train_config_name}.yaml`. For example, the model and dataset are specified as:
+You may customize the configurations in [`examples`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/). For example, the model and dataset are specified as:
 
 ```yaml
 model:
   model_path: $MODEL_PATH/{model_name}
 
 data:
   dataset_path: $DATASET_PATH/{dataset_name}
-
-trainer:
-  trainer_config_path: scripts/config/{train_config_name}.yaml
 ```
 
-You may use the default configurations located in the directory `scripts/config`. Please refer to `examples` for more details.
+Please refer to [`examples`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/) for more details.
 
 
 
@@ -232,12 +229,12 @@ trinity run --config <config_path>
 For example, below is the command for fine-tuning Qwen-2.5-1B-Instruct on GSM8k dataset using GRPO algorithm:
 
 ```shell
-trinity run --config scripts/config/gsm8k.yaml
+trinity run --config examples/grpo_gsm8k/gsm8k.yaml
 ```
 
 
 
-More example config files can be found in `scripts/config`.
+More example config files can be found in `examples`.
 
 
 

diff --git a/docs/sphinx_doc/source/tutorial/example_data_functionalities.md b/docs/sphinx_doc/source/tutorial/example_data_functionalities.md
@@ -133,7 +133,7 @@ And you can set the `clean_strategy` to 'iterative' to get a better dataset.
 
 
 
-All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](../../../../scripts/config/gsm8k.yaml).
+All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml).
 
 
 

diff --git a/docs/sphinx_doc/source/tutorial/example_dpo.md b/docs/sphinx_doc/source/tutorial/example_dpo.md
@@ -38,12 +38,12 @@ Note that the dataset has the keys `prompt`, `chosen` and `rejected`. If not, pa
 
 ### Configuration
 
-We use the configurations in `scripts/config/dpo.yaml`and `scripts/config/train_dpo.yaml` for this experiment. Some important setups are listed in the following:
+We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) and [`train_dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/train_dpo.yaml) for this experiment. Some important setups are listed in the following:
 
 We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config `mode` to `train` and set `sync_method` to `offline`. The value of `sync_iteration_interval` can be set as same of the value of `save_freq`.
 
 ```yaml
-# scripts/config/dpo.yaml
+# In dpo.yaml
 mode: train
 synchronizer:
   sync_method: 'offline'
@@ -60,7 +60,7 @@ buffer:
 trainer:
   algorithm_type: dpo
 
-# scripts/config/train_dpo.yaml
+# In train_dpo.yaml
 actor_rollout_ref:
   actor:
     alg_type: dpo
@@ -73,5 +73,5 @@ actor_rollout_ref:
 Run RFT process with the following command:
 
 ```shell
-trinity run --config scripts/config/dpo.yaml
+trinity run --config examples/dpo_humanlike/dpo.yaml
 ```
diff --git a/docs/sphinx_doc/source/tutorial/example_multi_turn.md b/docs/sphinx_doc/source/tutorial/example_multi_turn.md
@@ -36,15 +36,15 @@ The task is described as an environment instead of a single prompt.
 
 ## Step 2: Config preparation and run the experiment
 
-You can refer to `example_reasoning_basic` to setup the config and others. The default config files are `scripts/config/alfworld.yaml` and `scripts/config/webshop.yaml`, respectively.
+You can refer to `example_reasoning_basic` to setup the config and others. The default config files are [`alfworld.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld/alfworld.yaml) and [`webshop.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_webshop/webshop.yaml), respectively.
 You may revise the configurations properly and run the experiment!
 
 ```bash
 # For ALFworld env
-trinity run --config scripts/config/alfworld.yaml
+trinity run --config examples/grpo_alfworld/alfworld.yaml
 
 # For WebShop env
-trinity run --config scripts/config/webshop.yaml
+trinity run --config examples/grpo_webshop/webshop.yaml
 ```
 
 ## Advance: How to build your own environment

diff --git a/docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md b/docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md
@@ -17,11 +17,11 @@ The algorithm design and analysis can be found in this [technical report](../../
 
 To try out the OPMD algorithm:
 ```shell
-trinity run --config scripts/config/gsm8k_opmd.yaml
+trinity run --config examples/opmd_gsm8k/opmd_gsm8k.yaml
 ```
 
 Note that in this config file, `sync_iteration_interval` is set to 10, i.e., the model weights of explorer and trainer are synchronized only once every 10 training steps, which leads to a challenging off-policy scenario (potentially with abrupt distribution shift during the RFT process).
-Other configurations of particular interest are explained at the beginning of `scripts/config/train_gsm8k_opmd.yaml`.
+Other configurations of particular interest are explained at the beginning of [`train_opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/opmd_gsm8k/train_opmd_gsm8k.yaml).
 
 
 

diff --git a/docs/sphinx_doc/source/tutorial/example_reasoning_basic.md b/docs/sphinx_doc/source/tutorial/example_reasoning_basic.md
@@ -48,15 +48,15 @@ synchronizer:
 
 ### Use GRPO or PPO Algorithm
 
-We use the configurations in `scripts/config/gsm8k.yaml`and `scripts/config/train_gsm8k.yaml` for this experiment. Some important setups are listed in the following:
+We use the configurations in [`gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml) and [`train_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/train_gsm8k.yaml) for this experiment. Some important setups are listed in the following:
 
 
 ```yaml
-# scripts/config/gsm8k.yaml
+# In gsm8k.yaml
 explorer:
   repeat_times: {number of rollouts for each task}
 
-# scripts/config/train_gsm8k.yaml
+# In train_gsm8k.yaml
 actor_rollout_ref:
   actor:
     use_kl_loss: True (fro GRPO) / False (for PPO)
@@ -69,7 +69,7 @@ algorithm:
 
 Run the RFT process with the following command:
 ```bash
-trinity run --config scripts/config/gsm8k.yaml
+trinity run --config examples/grpo_gsm8k/gsm8k.yaml
 ```
 
 
@@ -79,14 +79,14 @@ trinity run --config scripts/config/gsm8k.yaml
 Before RFT, we may use SFT as a warmup step. We need to set `trainer.sft_warmup_iteration > 0` and prepare the SFT data to `buffer.train_dataset.path=$DATASET_PATH/{sft_data}`.
 
 ```yaml
-# Properly set the following configs in scripts/config/gsm8k.yaml
+# Properly set the following configs in gsm8k.yaml
 buffer:
   sft_warmup_dataset:
     storage_type: file
     algorithm_type: sft
     path: <$DATASET_PATH/{sft_data}>
     kwargs:
-      prompt_type: <prompt_type> # messages/plaintext
+      prompt_type: <prompt_type> # messages/plaintext/chatpair
       prompt_key: <prompt_key>
       response_key: <response_key>
 trainer:
@@ -95,5 +95,5 @@ trainer:
 
 The following command runs SFT and RFT in sequence:
 ```bash
-trinity run --config scripts/config/gsm8k.yaml
+trinity run --config examples/grpo_gsm8k/gsm8k.yaml
 ```
diff --git a/docs/sphinx_doc/source/tutorial/trinity_configs.md b/docs/sphinx_doc/source/tutorial/trinity_configs.md
@@ -1,6 +1,6 @@
 # Trinity-RFT Configuration
 
-The following is the main config file for Trinity-RFT. Take `scripts/config/countdown.yaml` as an example.
+The following is the main config file for Trinity-RFT. Take `countdown.yaml` as an example.
 
 
 ## Monitor
@@ -165,7 +165,7 @@ synchronizer:
 trainer:
   trainer_type: 'verl'
   algorithm_type: ppo
-  trainer_config_path: 'scripts/config/train_countdown.yaml'
+  trainer_config_path: 'examples/ppo_countdown/train_countdown.yaml'
   sft_warmup_iteration: 0
   eval_interval: 1000
 ```

diff --git a/examples/dpo_humanlike/README.md b/examples/dpo_humanlike/README.md
@@ -0,0 +1,7 @@
+# DPO on HumanLike Dataset
+
+This example shows the usage of DPO on the HumanLike dataset.
+
+For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_dpo.md).
+
+The config files are located in [`dpo.yaml`](dpo.yaml) and [`train_dpo.yaml`](train_dpo.yaml).
diff --git a/scripts/config/dpo.yaml → examples/dpo_humanlike/dpo.yaml b/scripts/config/dpo.yaml → examples/dpo_humanlike/dpo.yaml
@@ -53,7 +53,7 @@ synchronizer:
 trainer:
   trainer_type: 'verl'
   algorithm_type: dpo
-  trainer_config_path: 'scripts/config/train_dpo.yaml'
+  trainer_config_path: 'examples/dpo_humanlike/train_dpo.yaml'
 monitor:
   cache_root_dir: ""
   project: "dpo_example"

diff --git a/scripts/config/train_dpo.yaml → examples/dpo_humanlike/train_dpo.yaml b/scripts/config/train_dpo.yaml → examples/dpo_humanlike/train_dpo.yaml
diff --git a/examples/grpo_alfworld/README.md b/examples/grpo_alfworld/README.md
@@ -0,0 +1,7 @@
+# GRPO on ALFWorld Dataset
+
+This example shows the usage of GRPO on the ALFWorld dataset.
+
+For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_multi_turn.md).
+
+The config files are located in [`alfworld.yaml`](alfworld.yaml) and [`train_alfworld.yaml`](train_alfworld.yaml).
diff --git a/scripts/config/alfworld.yaml → examples/grpo_alfworld/alfworld.yaml b/scripts/config/alfworld.yaml → examples/grpo_alfworld/alfworld.yaml
@@ -49,7 +49,7 @@ synchronizer:
 trainer:
   trainer_type: 'verl'
   algorithm_type: ppo
-  trainer_config_path: 'scripts/config/train_alfworld.yaml'
+  trainer_config_path: 'examples/grpo_alfworld/train_alfworld.yaml'
 monitor:
   cache_root_dir: ""
   project: "ALFWORLD"

diff --git a/scripts/config/train_alfworld.yaml → examples/grpo_alfworld/train_alfworld.yaml b/scripts/config/train_alfworld.yaml → examples/grpo_alfworld/train_alfworld.yaml
diff --git a/examples/grpo_gsm8k/README.md b/examples/grpo_gsm8k/README.md
@@ -0,0 +1,7 @@
+# GRPO on GSM8K dataset
+
+This example shows the usage of GRPO on the GSM8K dataset.
+
+For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_reasoning_basic.md).
+
+The config files are located in [`gsm8k.yaml`](gsm8k.yaml) and [`train_gsm8k.yaml`](train_gsm8k.yaml).
diff --git a/scripts/config/gsm8k.yaml → examples/grpo_gsm8k/gsm8k.yaml b/scripts/config/gsm8k.yaml → examples/grpo_gsm8k/gsm8k.yaml
@@ -67,7 +67,7 @@ synchronizer:
 trainer:
   trainer_type: 'verl'
   algorithm_type: ppo
-  trainer_config_path: 'scripts/config/train_gsm8k.yaml'
+  trainer_config_path: 'examples/grpo_gsm8k/train_gsm8k.yaml'
   sft_warmup_iteration: 0 # Set to integer to enable sft warmup
   eval_interval: 50
 monitor:

diff --git a/scripts/config/train_gsm8k.yaml → examples/grpo_gsm8k/train_gsm8k.yaml b/scripts/config/train_gsm8k.yaml → examples/grpo_gsm8k/train_gsm8k.yaml
diff --git a/examples/grpo_math/README.md b/examples/grpo_math/README.md
@@ -0,0 +1,7 @@
+# Example: PPO on MATH dataset
+
+This example shows the usage of PPO on the MATH dataset.
+
+For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_reasoning_basic.md).
+
+The config files are located in [`math.yaml`](math.yaml) and [`train_math.yaml`](train_math.yaml).
diff --git a/examples/grpo_math/math.yaml b/examples/grpo_math/math.yaml
@@ -0,0 +1,63 @@
+data:
+  # basic info
+  dataset_path: /PATH/TO/DATASET/
+  # dataset_config:
+  train_split: train
+  eval_split: test
+  format_config:
+    prompt_key: 'question'
+    response_key: 'gt_answer'
+  # db related
+  db_url: ''
+  # downstream loading related
+  total_epoch: 20
+  batch_size: 288
+  default_workflow_type: 'math_workflow'
+model:
+  model_path: /PATH/TO/MODEL/
+  max_prompt_tokens: 1024
+  max_response_tokens: 3072
+  checkpoint_path: /PATH/TO/CHECKPOINT/
+  load_checkpoint: true
+cluster:
+  node_num: 1
+  gpu_per_node: 8
+buffer:
+  max_retry_times: 3
+  max_retry_interval: 1
+  train_dataset:
+    name: math_buffer
+    storage_type: queue
+    algorithm_type: ppo
+    path: 'sqlite:////math.db'
+explorer:
+  engine_type: vllm_async
+  engine_num: 2
+  runner_num: 32
+  tensor_parallel_size: 1
+  enable_prefix_caching: false
+  enforce_eager: true
+  dtype: bfloat16
+  temperature: 1.0
+  top_p: 1.0
+  top_k: -1
+  seed: 42
+  logprobs: 0
+  repeat_times: 8
+  use_ray: false
+  backend: 'nccl'
+  max_pending_requests: 32
+  max_waiting_steps: 4
+synchronizer:
+  sync_method: 'online'
+  sync_iteration_interval: 2
+trainer:
+  trainer_type: 'verl'
+  algorithm_type: ppo
+  trainer_config_path: 'examples/grpo_math/train_math.yaml'
+  sft_warmup_iteration: 0 # Set to integer to enable sft warmup
+  eval_interval: 10
+monitor:
+  cache_root_dir: ""
+  project: grpo_math
+  name: grpo_math_example
Original file line number	Diff line number	Diff line change
Expand Up		@@ -133,7 +133,7 @@ And you can set the `clean_strategy` to 'iterative' to get a better dataset.



		All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](../../../../scripts/config/gsm8k.yaml).
		All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml).



Expand Down