Add doc for SFT (#81)

hiyuchang · web-flow · commit 0e566079b9f4 · 2025-06-13T18:14:20.000+08:00
diff --git a/docs/sphinx_doc/source/tutorial/example_dpo.md b/docs/sphinx_doc/source/tutorial/example_dpo.md
@@ -1,6 +1,6 @@
-# Offline DPO
+# Offline DPO and SFT
 
-This example describes DPO based on the Qwen-2.5-1.5B-Instruct model and [Human-like-DPO-dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset).
+This example describes DPO and SFT based on the Qwen-2.5-1.5B-Instruct model.
 
 ## Step 1: Model and Data Preparation
 
@@ -20,7 +20,7 @@ More details of model downloading are referred to [ModelScope](https://modelscop
 
 ### Data Preparation
 
-Download the Human-Like-DPO-Dataset dataset to the local directory `$DATASET_PATH/human_like_dpo_dataset`:
+For DPO, we download the [Human-like-DPO-dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) to the local directory `$DATASET_PATH/human_like_dpo_dataset`:
 
 ```shell
 # Using Modelscope
@@ -34,9 +34,11 @@ More details of dataset downloading are referred to [ModelScope](https://modelsc
 
 Note that the dataset has the keys `prompt`, `chosen` and `rejected`. If not, pass the proper keys to the config.
 
-## Step 2: Setup Configuration and Run Experiment
+For SFT, we download the dataset to the local directory `/PATH/TO/SFT_DATASET/`, which usually contains message-based data.
 
-### Configuration
+## Step 2: Setup Configuration
+
+### Configuration for DPO
 
 We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) and [`train_dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/train_dpo.yaml) for this experiment. Some important setups are listed in the following:
 
@@ -53,7 +55,7 @@ algorithm:
     kl_coef: 0.1  # value of beta in DPO
 checkpoint_root_dir: /PATH/TO/CHECKPOINT/
 model:
-  model_path: /PATH/TO/MODEL/
+  model_path: $MODEL_PATH/Qwen2.5-1.5B-Instruct
 cluster:
   node_num: 1
   gpu_per_node: 8
@@ -62,9 +64,9 @@ buffer:
   batch_size: 64
   trainer_input:
     experience_buffer:
-      name: dpo_buffer
+      name: human_like_dpo
       storage_type: file
-      path: /PATH/TO/DATASET/
+      path: $DATASET_PATH/human_like_dpo_dataset
       format:
         prompt_type: plaintext # plaintext/messages/chatpair
         prompt_key: prompt
@@ -75,10 +77,48 @@ trainer:
   save_interval: 30
 ```
 
-### Run the Experiment
+### Configuration for SFT
+
+We set the `algorithm_type` as `sft` to run SFT process. Then we modify the config file `sft.yaml` with the following changes:
+
+```yaml
+project: <project_name>
+name: <experiment_name>
+mode: train
+algorithm:
+  algorithm_type: sft
+checkpoint_root_dir: /PATH/TO/CHECKPOINT/
+model:
+  model_path: /PATH/TO/MODEL/
+cluster:
+  node_num: 1
+  gpu_per_node: 2
+buffer:
+  total_epochs: 5
+  batch_size: 64
+  trainer_input:
+    experience_buffer:
+      name: <sft_dataset_name>
+      storage_type: file
+      path: /PATH/TO/SFT_DATASET/
+      split: train
+      format:
+        prompt_type: messages
+        messages_key: messages
+trainer:
+  trainer_config_path: /PATH/TO/TRAIN_CONFIG_YAML/
+  save_interval: 50
+```
+
+## Step 3: Run the Experiment
 
-Run RFT process with the following command:
+Run DPO process with the following command:
 
 ```shell
 trinity run --config examples/dpo_humanlike/dpo.yaml
 ```
+or, for SFT:
+
+```shell
+trinity run --config /PATH/TO/sft.yaml
+```