modelscope
diff --git a/‎README.md‎
Lines changed: 4 additions & 7 deletions b/‎README.md‎
Lines changed: 4 additions & 7 deletions
diff --git a/‎docs/sphinx_doc/source/main.md‎
Lines changed: 4 additions & 7 deletions b/‎docs/sphinx_doc/source/main.md‎
Lines changed: 4 additions & 7 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/example_data_functionalities.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/sphinx_doc/source/tutorial/example_data_functionalities.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/sphinx_doc/source/tutorial/example_dpo.md‎
Lines changed: 4 additions & 4 deletions b/‎docs/sphinx_doc/source/tutorial/example_dpo.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/example_multi_turn.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/sphinx_doc/source/tutorial/example_multi_turn.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/example_reasoning_basic.md‎
Lines changed: 7 additions & 7 deletions b/‎docs/sphinx_doc/source/tutorial/example_reasoning_basic.md‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/trinity_configs.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/sphinx_doc/source/tutorial/trinity_configs.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/dpo_humanlike/README.md‎
Lines changed: 7 additions & 0 deletions b/‎examples/dpo_humanlike/README.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎scripts/config/dpo.yaml‎ ‎examples/dpo_humanlike/dpo.yaml‎scripts/config/dpo.yaml renamed to examples/dpo_humanlike/dpo.yaml
Lines changed: 1 addition & 1 deletion b/‎scripts/config/dpo.yaml‎ ‎examples/dpo_humanlike/dpo.yaml‎scripts/config/dpo.yaml renamed to examples/dpo_humanlike/dpo.yaml
Lines changed: 1 addition & 1 deletion
@@ -200,20 +200,17 @@ For more details about dataset downloading, please refer to [Huggingface](https:
 ### Step 3: configurations
 
 
-You may customize the configurations in `scripts/config/{config_name}.yaml`and `scripts/config/{train_config_name}.yaml`. For example, the model and dataset are specified as:
+You may customize the configurations in [`examples`](examples/). For example, the model and dataset are specified as:
 
 ```yaml
 model:
   model_path: $MODEL_PATH/{model_name}
 
 data:
   dataset_path: $DATASET_PATH/{dataset_name}
-
-trainer:
-  trainer_config_path: scripts/config/{train_config_name}.yaml
 ```
 
-You may use the default configurations located in the directory `scripts/config`. Please refer to `examples` for more details.
+Please refer to [`examples`](examples/) for more details.
 
 
 
@@ -252,12 +249,12 @@ trinity run --config <config_path>
 For example, below is the command for fine-tuning Qwen-2.5-1B-Instruct on GSM8k dataset using GRPO algorithm:
 
 ```shell
-trinity run --config scripts/config/gsm8k.yaml
+trinity run --config examples/grpo_gsm8k/gsm8k.yaml
 ```
 
 
 
-More example config files can be found in `scripts/config`.
+More example config files can be found in `examples`.
 
 
 
 
@@ -180,20 +180,17 @@ For more details about dataset downloading, please refer to [Huggingface](https:
 ### Step 3: configurations
 
 
-You may customize the configurations in `scripts/config/{config_name}.yaml`and `scripts/config/{train_config_name}.yaml`. For example, the model and dataset are specified as:
+You may customize the configurations in [`examples`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/). For example, the model and dataset are specified as:
 
 ```yaml
 model:
   model_path: $MODEL_PATH/{model_name}
 
 data:
   dataset_path: $DATASET_PATH/{dataset_name}
-
-trainer:
-  trainer_config_path: scripts/config/{train_config_name}.yaml
 ```
 
-You may use the default configurations located in the directory `scripts/config`. Please refer to `examples` for more details.
+Please refer to [`examples`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/) for more details.
 
 
 
@@ -232,12 +229,12 @@ trinity run --config <config_path>
 For example, below is the command for fine-tuning Qwen-2.5-1B-Instruct on GSM8k dataset using GRPO algorithm:
 
 ```shell
-trinity run --config scripts/config/gsm8k.yaml
+trinity run --config examples/grpo_gsm8k/gsm8k.yaml
 ```
 
 
 
-More example config files can be found in `scripts/config`.
+More example config files can be found in `examples`.
 
 
 
 
@@ -133,7 +133,7 @@ And you can set the `clean_strategy` to 'iterative' to get a better dataset.
 
 
 
-All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](../../../../scripts/config/gsm8k.yaml).
+All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml).
 
 
 
 
@@ -38,12 +38,12 @@ Note that the dataset has the keys `prompt`, `chosen` and `rejected`. If not, pa
 
 ### Configuration
 
-We use the configurations in `scripts/config/dpo.yaml`and `scripts/config/train_dpo.yaml` for this experiment. Some important setups are listed in the following:
+We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) and [`train_dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/train_dpo.yaml) for this experiment. Some important setups are listed in the following:
 
 We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config `mode` to `train` and set `sync_method` to `offline`. The value of `sync_iteration_interval` can be set as same of the value of `save_freq`.
 
 ```yaml
-# scripts/config/dpo.yaml
+# In dpo.yaml
 mode: train
 synchronizer:
   sync_method: 'offline'
@@ -60,7 +60,7 @@ buffer:
 trainer:
   algorithm_type: dpo
 
-# scripts/config/train_dpo.yaml
+# In train_dpo.yaml
 actor_rollout_ref:
   actor:
     alg_type: dpo
@@ -73,5 +73,5 @@ actor_rollout_ref:
 Run RFT process with the following command:
 
 ```shell
-trinity run --config scripts/config/dpo.yaml
+trinity run --config examples/dpo_humanlike/dpo.yaml
 ```
@@ -36,15 +36,15 @@ The task is described as an environment instead of a single prompt.
 
 ## Step 2: Config preparation and run the experiment
 
-You can refer to `example_reasoning_basic` to setup the config and others. The default config files are `scripts/config/alfworld.yaml` and `scripts/config/webshop.yaml`, respectively.
+You can refer to `example_reasoning_basic` to setup the config and others. The default config files are [`alfworld.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld/alfworld.yaml) and [`webshop.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_webshop/webshop.yaml), respectively.
 You may revise the configurations properly and run the experiment!
 
 ```bash
 # For ALFworld env
-trinity run --config scripts/config/alfworld.yaml
+trinity run --config examples/grpo_alfworld/alfworld.yaml
 
 # For WebShop env
-trinity run --config scripts/config/webshop.yaml
+trinity run --config examples/grpo_webshop/webshop.yaml
 ```
 
 ## Advance: How to build your own environment
 
@@ -17,11 +17,11 @@ The algorithm design and analysis can be found in this [technical report](../../
 
 To try out the OPMD algorithm:
 ```shell
-trinity run --config scripts/config/gsm8k_opmd.yaml
+trinity run --config examples/opmd_gsm8k/opmd_gsm8k.yaml
 ```
 
 Note that in this config file, `sync_iteration_interval` is set to 10, i.e., the model weights of explorer and trainer are synchronized only once every 10 training steps, which leads to a challenging off-policy scenario (potentially with abrupt distribution shift during the RFT process).
-Other configurations of particular interest are explained at the beginning of `scripts/config/train_gsm8k_opmd.yaml`.
+Other configurations of particular interest are explained at the beginning of [`train_opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/opmd_gsm8k/train_opmd_gsm8k.yaml).
 
 
 
 
@@ -48,15 +48,15 @@ synchronizer:
 
 ### Use GRPO or PPO Algorithm
 
-We use the configurations in `scripts/config/gsm8k.yaml`and `scripts/config/train_gsm8k.yaml` for this experiment. Some important setups are listed in the following:
+We use the configurations in [`gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml) and [`train_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/train_gsm8k.yaml) for this experiment. Some important setups are listed in the following:
 
 
 ```yaml
-# scripts/config/gsm8k.yaml
+# In gsm8k.yaml
 explorer:
   repeat_times: {number of rollouts for each task}
 
-# scripts/config/train_gsm8k.yaml
+# In train_gsm8k.yaml
 actor_rollout_ref:
   actor:
     use_kl_loss: True (fro GRPO) / False (for PPO)
@@ -69,7 +69,7 @@ algorithm:
 
 Run the RFT process with the following command:
 ```bash
-trinity run --config scripts/config/gsm8k.yaml
+trinity run --config examples/grpo_gsm8k/gsm8k.yaml
 ```
 
 
@@ -79,14 +79,14 @@ trinity run --config scripts/config/gsm8k.yaml
 Before RFT, we may use SFT as a warmup step. We need to set `trainer.sft_warmup_iteration > 0` and prepare the SFT data to `buffer.train_dataset.path=$DATASET_PATH/{sft_data}`.
 
 ```yaml
-# Properly set the following configs in scripts/config/gsm8k.yaml
+# Properly set the following configs in gsm8k.yaml
 buffer:
   sft_warmup_dataset:
     storage_type: file
     algorithm_type: sft
     path: <$DATASET_PATH/{sft_data}>
     kwargs:
-      prompt_type: <prompt_type> # messages/plaintext
+      prompt_type: <prompt_type> # messages/plaintext/chatpair
       prompt_key: <prompt_key>
       response_key: <response_key>
 trainer:
@@ -95,5 +95,5 @@ trainer:
 
 The following command runs SFT and RFT in sequence:
 ```bash
-trinity run --config scripts/config/gsm8k.yaml
+trinity run --config examples/grpo_gsm8k/gsm8k.yaml
 ```
@@ -1,6 +1,6 @@
 # Trinity-RFT Configuration
 
-The following is the main config file for Trinity-RFT. Take `scripts/config/countdown.yaml` as an example.
+The following is the main config file for Trinity-RFT. Take `countdown.yaml` as an example.
 
 
 ## Monitor
@@ -165,7 +165,7 @@ synchronizer:
 trainer:
   trainer_type: 'verl'
   algorithm_type: ppo
-  trainer_config_path: 'scripts/config/train_countdown.yaml'
+  trainer_config_path: 'examples/ppo_countdown/train_countdown.yaml'
   sft_warmup_iteration: 0
   eval_interval: 1000
 ```
 
@@ -0,0 +1,7 @@
+# DPO on HumanLike Dataset
+
+This example shows the usage of DPO on the HumanLike dataset.
+
+For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_dpo.md).
+
+The config files are located in [`dpo.yaml`](dpo.yaml) and [`train_dpo.yaml`](train_dpo.yaml).
@@ -53,7 +53,7 @@ synchronizer:
 trainer:
   trainer_type: 'verl'
   algorithm_type: dpo
-  trainer_config_path: 'scripts/config/train_dpo.yaml'
+  trainer_config_path: 'examples/dpo_humanlike/train_dpo.yaml'
 monitor:
   cache_root_dir: ""
   project: "dpo_example"
Original file line number	Diff line number	Diff line change
@@ -133,7 +133,7 @@ And you can set the `clean_strategy` to 'iterative' to get a better dataset.
`133`	`133`
`134`	`134`
`135`	`135`
`136`		-All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](../../../../scripts/config/gsm8k.yaml).
	`136`	+All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml).
`137`	`137`
`138`	`138`
`139`	`139`