Skip to content

Commit 0e56607

Browse files
authored
Add doc for SFT (#81)
1 parent dc8cb0c commit 0e56607

File tree

1 file changed

+50
-10
lines changed

1 file changed

+50
-10
lines changed

docs/sphinx_doc/source/tutorial/example_dpo.md

Lines changed: 50 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# Offline DPO
1+
# Offline DPO and SFT
22

3-
This example describes DPO based on the Qwen-2.5-1.5B-Instruct model and [Human-like-DPO-dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset).
3+
This example describes DPO and SFT based on the Qwen-2.5-1.5B-Instruct model.
44

55
## Step 1: Model and Data Preparation
66

@@ -20,7 +20,7 @@ More details of model downloading are referred to [ModelScope](https://modelscop
2020

2121
### Data Preparation
2222

23-
Download the Human-Like-DPO-Dataset dataset to the local directory `$DATASET_PATH/human_like_dpo_dataset`:
23+
For DPO, we download the [Human-like-DPO-dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) to the local directory `$DATASET_PATH/human_like_dpo_dataset`:
2424

2525
```shell
2626
# Using Modelscope
@@ -34,9 +34,11 @@ More details of dataset downloading are referred to [ModelScope](https://modelsc
3434

3535
Note that the dataset has the keys `prompt`, `chosen` and `rejected`. If not, pass the proper keys to the config.
3636

37-
## Step 2: Setup Configuration and Run Experiment
37+
For SFT, we download the dataset to the local directory `/PATH/TO/SFT_DATASET/`, which usually contains message-based data.
3838

39-
### Configuration
39+
## Step 2: Setup Configuration
40+
41+
### Configuration for DPO
4042

4143
We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) and [`train_dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/train_dpo.yaml) for this experiment. Some important setups are listed in the following:
4244

@@ -53,7 +55,7 @@ algorithm:
5355
kl_coef: 0.1 # value of beta in DPO
5456
checkpoint_root_dir: /PATH/TO/CHECKPOINT/
5557
model:
56-
model_path: /PATH/TO/MODEL/
58+
model_path: $MODEL_PATH/Qwen2.5-1.5B-Instruct
5759
cluster:
5860
node_num: 1
5961
gpu_per_node: 8
@@ -62,9 +64,9 @@ buffer:
6264
batch_size: 64
6365
trainer_input:
6466
experience_buffer:
65-
name: dpo_buffer
67+
name: human_like_dpo
6668
storage_type: file
67-
path: /PATH/TO/DATASET/
69+
path: $DATASET_PATH/human_like_dpo_dataset
6870
format:
6971
prompt_type: plaintext # plaintext/messages/chatpair
7072
prompt_key: prompt
@@ -75,10 +77,48 @@ trainer:
7577
save_interval: 30
7678
```
7779
78-
### Run the Experiment
80+
### Configuration for SFT
81+
82+
We set the `algorithm_type` as `sft` to run SFT process. Then we modify the config file `sft.yaml` with the following changes:
83+
84+
```yaml
85+
project: <project_name>
86+
name: <experiment_name>
87+
mode: train
88+
algorithm:
89+
algorithm_type: sft
90+
checkpoint_root_dir: /PATH/TO/CHECKPOINT/
91+
model:
92+
model_path: /PATH/TO/MODEL/
93+
cluster:
94+
node_num: 1
95+
gpu_per_node: 2
96+
buffer:
97+
total_epochs: 5
98+
batch_size: 64
99+
trainer_input:
100+
experience_buffer:
101+
name: <sft_dataset_name>
102+
storage_type: file
103+
path: /PATH/TO/SFT_DATASET/
104+
split: train
105+
format:
106+
prompt_type: messages
107+
messages_key: messages
108+
trainer:
109+
trainer_config_path: /PATH/TO/TRAIN_CONFIG_YAML/
110+
save_interval: 50
111+
```
112+
113+
## Step 3: Run the Experiment
79114

80-
Run RFT process with the following command:
115+
Run DPO process with the following command:
81116

82117
```shell
83118
trinity run --config examples/dpo_humanlike/dpo.yaml
84119
```
120+
or, for SFT:
121+
122+
```shell
123+
trinity run --config /PATH/TO/sft.yaml
124+
```

0 commit comments

Comments
 (0)