Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 50 additions & 10 deletions docs/sphinx_doc/source/tutorial/example_dpo.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Offline DPO
# Offline DPO and SFT

This example describes DPO based on the Qwen-2.5-1.5B-Instruct model and [Human-like-DPO-dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset).
This example describes DPO and SFT based on the Qwen-2.5-1.5B-Instruct model.

## Step 1: Model and Data Preparation

Expand All @@ -20,7 +20,7 @@ More details of model downloading are referred to [ModelScope](https://modelscop

### Data Preparation

Download the Human-Like-DPO-Dataset dataset to the local directory `$DATASET_PATH/human_like_dpo_dataset`:
For DPO, we download the [Human-like-DPO-dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) to the local directory `$DATASET_PATH/human_like_dpo_dataset`:

```shell
# Using Modelscope
Expand All @@ -34,9 +34,11 @@ More details of dataset downloading are referred to [ModelScope](https://modelsc

Note that the dataset has the keys `prompt`, `chosen` and `rejected`. If not, pass the proper keys to the config.

## Step 2: Setup Configuration and Run Experiment
For SFT, we download the dataset to the local directory `/PATH/TO/SFT_DATASET/`, which usually contains message-based data.

### Configuration
## Step 2: Setup Configuration

### Configuration for DPO

We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) and [`train_dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/train_dpo.yaml) for this experiment. Some important setups are listed in the following:

Expand All @@ -53,7 +55,7 @@ algorithm:
kl_coef: 0.1 # value of beta in DPO
checkpoint_root_dir: /PATH/TO/CHECKPOINT/
model:
model_path: /PATH/TO/MODEL/
model_path: $MODEL_PATH/Qwen2.5-1.5B-Instruct
cluster:
node_num: 1
gpu_per_node: 8
Expand All @@ -62,9 +64,9 @@ buffer:
batch_size: 64
trainer_input:
experience_buffer:
name: dpo_buffer
name: human_like_dpo
storage_type: file
path: /PATH/TO/DATASET/
path: $DATASET_PATH/human_like_dpo_dataset
format:
prompt_type: plaintext # plaintext/messages/chatpair
prompt_key: prompt
Expand All @@ -75,10 +77,48 @@ trainer:
save_interval: 30
```

### Run the Experiment
### Configuration for SFT

We set the `algorithm_type` as `sft` to run SFT process. Then we modify the config file `sft.yaml` with the following changes:

```yaml
project: <project_name>
name: <experiment_name>
mode: train
algorithm:
algorithm_type: sft
checkpoint_root_dir: /PATH/TO/CHECKPOINT/
model:
model_path: /PATH/TO/MODEL/
cluster:
node_num: 1
gpu_per_node: 2
buffer:
total_epochs: 5
batch_size: 64
trainer_input:
experience_buffer:
name: <sft_dataset_name>
storage_type: file
path: /PATH/TO/SFT_DATASET/
split: train
format:
prompt_type: messages
messages_key: messages
trainer:
trainer_config_path: /PATH/TO/TRAIN_CONFIG_YAML/
save_interval: 50
```

## Step 3: Run the Experiment

Run RFT process with the following command:
Run DPO process with the following command:

```shell
trinity run --config examples/dpo_humanlike/dpo.yaml
```
or, for SFT:

```shell
trinity run --config /PATH/TO/sft.yaml
```