Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 4 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,20 +200,17 @@ For more details about dataset downloading, please refer to [Huggingface](https:
### Step 3: configurations


You may customize the configurations in `scripts/config/{config_name}.yaml`and `scripts/config/{train_config_name}.yaml`. For example, the model and dataset are specified as:
You may customize the configurations in [`examples`](examples/). For example, the model and dataset are specified as:

```yaml
model:
model_path: $MODEL_PATH/{model_name}

data:
dataset_path: $DATASET_PATH/{dataset_name}

trainer:
trainer_config_path: scripts/config/{train_config_name}.yaml
```

You may use the default configurations located in the directory `scripts/config`. Please refer to `examples` for more details.
Please refer to [`examples`](examples/) for more details.



Expand Down Expand Up @@ -252,12 +249,12 @@ trinity run --config <config_path>
For example, below is the command for fine-tuning Qwen-2.5-1B-Instruct on GSM8k dataset using GRPO algorithm:

```shell
trinity run --config scripts/config/gsm8k.yaml
trinity run --config examples/grpo_gsm8k/gsm8k.yaml
```



More example config files can be found in `scripts/config`.
More example config files can be found in `examples`.



Expand Down
11 changes: 4 additions & 7 deletions docs/sphinx_doc/source/main.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,20 +180,17 @@ For more details about dataset downloading, please refer to [Huggingface](https:
### Step 3: configurations


You may customize the configurations in `scripts/config/{config_name}.yaml`and `scripts/config/{train_config_name}.yaml`. For example, the model and dataset are specified as:
You may customize the configurations in [`examples`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/). For example, the model and dataset are specified as:

```yaml
model:
model_path: $MODEL_PATH/{model_name}

data:
dataset_path: $DATASET_PATH/{dataset_name}

trainer:
trainer_config_path: scripts/config/{train_config_name}.yaml
```

You may use the default configurations located in the directory `scripts/config`. Please refer to `examples` for more details.
Please refer to [`examples`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/) for more details.



Expand Down Expand Up @@ -232,12 +229,12 @@ trinity run --config <config_path>
For example, below is the command for fine-tuning Qwen-2.5-1B-Instruct on GSM8k dataset using GRPO algorithm:

```shell
trinity run --config scripts/config/gsm8k.yaml
trinity run --config examples/grpo_gsm8k/gsm8k.yaml
```



More example config files can be found in `scripts/config`.
More example config files can be found in `examples`.



Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ And you can set the `clean_strategy` to 'iterative' to get a better dataset.



All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](../../../../scripts/config/gsm8k.yaml).
All config items in the `data` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM-8K can be found in [the config file of gsm8k](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml).



Expand Down
8 changes: 4 additions & 4 deletions docs/sphinx_doc/source/tutorial/example_dpo.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,12 @@ Note that the dataset has the keys `prompt`, `chosen` and `rejected`. If not, pa

### Configuration

We use the configurations in `scripts/config/dpo.yaml`and `scripts/config/train_dpo.yaml` for this experiment. Some important setups are listed in the following:
We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) and [`train_dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/train_dpo.yaml) for this experiment. Some important setups are listed in the following:

We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config `mode` to `train` and set `sync_method` to `offline`. The value of `sync_iteration_interval` can be set as same of the value of `save_freq`.

```yaml
# scripts/config/dpo.yaml
# In dpo.yaml
mode: train
synchronizer:
sync_method: 'offline'
Expand All @@ -60,7 +60,7 @@ buffer:
trainer:
algorithm_type: dpo

# scripts/config/train_dpo.yaml
# In train_dpo.yaml
actor_rollout_ref:
actor:
alg_type: dpo
Expand All @@ -73,5 +73,5 @@ actor_rollout_ref:
Run RFT process with the following command:

```shell
trinity run --config scripts/config/dpo.yaml
trinity run --config examples/dpo_humanlike/dpo.yaml
```
6 changes: 3 additions & 3 deletions docs/sphinx_doc/source/tutorial/example_multi_turn.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,15 +36,15 @@ The task is described as an environment instead of a single prompt.

## Step 2: Config preparation and run the experiment

You can refer to `example_reasoning_basic` to setup the config and others. The default config files are `scripts/config/alfworld.yaml` and `scripts/config/webshop.yaml`, respectively.
You can refer to `example_reasoning_basic` to setup the config and others. The default config files are [`alfworld.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld/alfworld.yaml) and [`webshop.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_webshop/webshop.yaml), respectively.
You may revise the configurations properly and run the experiment!

```bash
# For ALFworld env
trinity run --config scripts/config/alfworld.yaml
trinity run --config examples/grpo_alfworld/alfworld.yaml

# For WebShop env
trinity run --config scripts/config/webshop.yaml
trinity run --config examples/grpo_webshop/webshop.yaml
```

## Advance: How to build your own environment
Expand Down
4 changes: 2 additions & 2 deletions docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,11 @@ The algorithm design and analysis can be found in this [technical report](../../

To try out the OPMD algorithm:
```shell
trinity run --config scripts/config/gsm8k_opmd.yaml
trinity run --config examples/opmd_gsm8k/opmd_gsm8k.yaml
```

Note that in this config file, `sync_iteration_interval` is set to 10, i.e., the model weights of explorer and trainer are synchronized only once every 10 training steps, which leads to a challenging off-policy scenario (potentially with abrupt distribution shift during the RFT process).
Other configurations of particular interest are explained at the beginning of `scripts/config/train_gsm8k_opmd.yaml`.
Other configurations of particular interest are explained at the beginning of [`train_opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/opmd_gsm8k/train_opmd_gsm8k.yaml).



Expand Down
14 changes: 7 additions & 7 deletions docs/sphinx_doc/source/tutorial/example_reasoning_basic.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,15 +48,15 @@ synchronizer:

### Use GRPO or PPO Algorithm

We use the configurations in `scripts/config/gsm8k.yaml`and `scripts/config/train_gsm8k.yaml` for this experiment. Some important setups are listed in the following:
We use the configurations in [`gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml) and [`train_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/train_gsm8k.yaml) for this experiment. Some important setups are listed in the following:


```yaml
# scripts/config/gsm8k.yaml
# In gsm8k.yaml
explorer:
repeat_times: {number of rollouts for each task}

# scripts/config/train_gsm8k.yaml
# In train_gsm8k.yaml
actor_rollout_ref:
actor:
use_kl_loss: True (fro GRPO) / False (for PPO)
Expand All @@ -69,7 +69,7 @@ algorithm:

Run the RFT process with the following command:
```bash
trinity run --config scripts/config/gsm8k.yaml
trinity run --config examples/grpo_gsm8k/gsm8k.yaml
```


Expand All @@ -79,14 +79,14 @@ trinity run --config scripts/config/gsm8k.yaml
Before RFT, we may use SFT as a warmup step. We need to set `trainer.sft_warmup_iteration > 0` and prepare the SFT data to `buffer.train_dataset.path=$DATASET_PATH/{sft_data}`.

```yaml
# Properly set the following configs in scripts/config/gsm8k.yaml
# Properly set the following configs in gsm8k.yaml
buffer:
sft_warmup_dataset:
storage_type: file
algorithm_type: sft
path: <$DATASET_PATH/{sft_data}>
kwargs:
prompt_type: <prompt_type> # messages/plaintext
prompt_type: <prompt_type> # messages/plaintext/chatpair
prompt_key: <prompt_key>
response_key: <response_key>
trainer:
Expand All @@ -95,5 +95,5 @@ trainer:

The following command runs SFT and RFT in sequence:
```bash
trinity run --config scripts/config/gsm8k.yaml
trinity run --config examples/grpo_gsm8k/gsm8k.yaml
```
4 changes: 2 additions & 2 deletions docs/sphinx_doc/source/tutorial/trinity_configs.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Trinity-RFT Configuration

The following is the main config file for Trinity-RFT. Take `scripts/config/countdown.yaml` as an example.
The following is the main config file for Trinity-RFT. Take `countdown.yaml` as an example.


## Monitor
Expand Down Expand Up @@ -165,7 +165,7 @@ synchronizer:
trainer:
trainer_type: 'verl'
algorithm_type: ppo
trainer_config_path: 'scripts/config/train_countdown.yaml'
trainer_config_path: 'examples/ppo_countdown/train_countdown.yaml'
sft_warmup_iteration: 0
eval_interval: 1000
```
Expand Down
7 changes: 7 additions & 0 deletions examples/dpo_humanlike/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# DPO on HumanLike Dataset

This example shows the usage of DPO on the HumanLike dataset.

For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_dpo.md).

The config files are located in [`dpo.yaml`](dpo.yaml) and [`train_dpo.yaml`](train_dpo.yaml).
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ synchronizer:
trainer:
trainer_type: 'verl'
algorithm_type: dpo
trainer_config_path: 'scripts/config/train_dpo.yaml'
trainer_config_path: 'examples/dpo_humanlike/train_dpo.yaml'
monitor:
cache_root_dir: ""
project: "dpo_example"
Expand Down
File renamed without changes.
7 changes: 7 additions & 0 deletions examples/grpo_alfworld/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# GRPO on ALFWorld Dataset

This example shows the usage of GRPO on the ALFWorld dataset.

For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_multi_turn.md).

The config files are located in [`alfworld.yaml`](alfworld.yaml) and [`train_alfworld.yaml`](train_alfworld.yaml).
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ synchronizer:
trainer:
trainer_type: 'verl'
algorithm_type: ppo
trainer_config_path: 'scripts/config/train_alfworld.yaml'
trainer_config_path: 'examples/grpo_alfworld/train_alfworld.yaml'
monitor:
cache_root_dir: ""
project: "ALFWORLD"
Expand Down
7 changes: 7 additions & 0 deletions examples/grpo_gsm8k/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# GRPO on GSM8K dataset

This example shows the usage of GRPO on the GSM8K dataset.

For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_reasoning_basic.md).

The config files are located in [`gsm8k.yaml`](gsm8k.yaml) and [`train_gsm8k.yaml`](train_gsm8k.yaml).
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ synchronizer:
trainer:
trainer_type: 'verl'
algorithm_type: ppo
trainer_config_path: 'scripts/config/train_gsm8k.yaml'
trainer_config_path: 'examples/grpo_gsm8k/train_gsm8k.yaml'
sft_warmup_iteration: 0 # Set to integer to enable sft warmup
eval_interval: 50
monitor:
Expand Down
File renamed without changes.
7 changes: 7 additions & 0 deletions examples/grpo_math/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Example: PPO on MATH dataset

This example shows the usage of PPO on the MATH dataset.

For more detailed information, please refer to the [documentation](../../docs/sphinx_doc/source/tutorial/example_reasoning_basic.md).

The config files are located in [`math.yaml`](math.yaml) and [`train_math.yaml`](train_math.yaml).
63 changes: 63 additions & 0 deletions examples/grpo_math/math.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
data:
# basic info
dataset_path: /PATH/TO/DATASET/
# dataset_config:
train_split: train
eval_split: test
format_config:
prompt_key: 'question'
response_key: 'gt_answer'
# db related
db_url: ''
# downstream loading related
total_epoch: 20
batch_size: 288
default_workflow_type: 'math_workflow'
model:
model_path: /PATH/TO/MODEL/
max_prompt_tokens: 1024
max_response_tokens: 3072
checkpoint_path: /PATH/TO/CHECKPOINT/
load_checkpoint: true
cluster:
node_num: 1
gpu_per_node: 8
buffer:
max_retry_times: 3
max_retry_interval: 1
train_dataset:
name: math_buffer
storage_type: queue
algorithm_type: ppo
path: 'sqlite:////math.db'
explorer:
engine_type: vllm_async
engine_num: 2
runner_num: 32
tensor_parallel_size: 1
enable_prefix_caching: false
enforce_eager: true
dtype: bfloat16
temperature: 1.0
top_p: 1.0
top_k: -1
seed: 42
logprobs: 0
repeat_times: 8
use_ray: false
backend: 'nccl'
max_pending_requests: 32
max_waiting_steps: 4
synchronizer:
sync_method: 'online'
sync_iteration_interval: 2
trainer:
trainer_type: 'verl'
algorithm_type: ppo
trainer_config_path: 'examples/grpo_math/train_math.yaml'
sft_warmup_iteration: 0 # Set to integer to enable sft warmup
eval_interval: 10
monitor:
cache_root_dir: ""
project: grpo_math
name: grpo_math_example
Loading