diff --git a/docs/sphinx_doc/source/tutorial/example_dpo.md b/docs/sphinx_doc/source/tutorial/example_dpo.md index 44543ff2bc..b5846bc24b 100644 --- a/docs/sphinx_doc/source/tutorial/example_dpo.md +++ b/docs/sphinx_doc/source/tutorial/example_dpo.md @@ -1,6 +1,6 @@ -# Offline DPO +# Offline DPO and SFT -This example describes DPO based on the Qwen-2.5-1.5B-Instruct model and [Human-like-DPO-dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset). +This example describes DPO and SFT based on the Qwen-2.5-1.5B-Instruct model. ## Step 1: Model and Data Preparation @@ -20,7 +20,7 @@ More details of model downloading are referred to [ModelScope](https://modelscop ### Data Preparation -Download the Human-Like-DPO-Dataset dataset to the local directory `$DATASET_PATH/human_like_dpo_dataset`: +For DPO, we download the [Human-like-DPO-dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) to the local directory `$DATASET_PATH/human_like_dpo_dataset`: ```shell # Using Modelscope @@ -34,9 +34,11 @@ More details of dataset downloading are referred to [ModelScope](https://modelsc Note that the dataset has the keys `prompt`, `chosen` and `rejected`. If not, pass the proper keys to the config. -## Step 2: Setup Configuration and Run Experiment +For SFT, we download the dataset to the local directory `/PATH/TO/SFT_DATASET/`, which usually contains message-based data. -### Configuration +## Step 2: Setup Configuration + +### Configuration for DPO We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) and [`train_dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/train_dpo.yaml) for this experiment. Some important setups are listed in the following: @@ -53,7 +55,7 @@ algorithm: kl_coef: 0.1 # value of beta in DPO checkpoint_root_dir: /PATH/TO/CHECKPOINT/ model: - model_path: /PATH/TO/MODEL/ + model_path: $MODEL_PATH/Qwen2.5-1.5B-Instruct cluster: node_num: 1 gpu_per_node: 8 @@ -62,9 +64,9 @@ buffer: batch_size: 64 trainer_input: experience_buffer: - name: dpo_buffer + name: human_like_dpo storage_type: file - path: /PATH/TO/DATASET/ + path: $DATASET_PATH/human_like_dpo_dataset format: prompt_type: plaintext # plaintext/messages/chatpair prompt_key: prompt @@ -75,10 +77,48 @@ trainer: save_interval: 30 ``` -### Run the Experiment +### Configuration for SFT + +We set the `algorithm_type` as `sft` to run SFT process. Then we modify the config file `sft.yaml` with the following changes: + +```yaml +project: +name: +mode: train +algorithm: + algorithm_type: sft +checkpoint_root_dir: /PATH/TO/CHECKPOINT/ +model: + model_path: /PATH/TO/MODEL/ +cluster: + node_num: 1 + gpu_per_node: 2 +buffer: + total_epochs: 5 + batch_size: 64 + trainer_input: + experience_buffer: + name: + storage_type: file + path: /PATH/TO/SFT_DATASET/ + split: train + format: + prompt_type: messages + messages_key: messages +trainer: + trainer_config_path: /PATH/TO/TRAIN_CONFIG_YAML/ + save_interval: 50 +``` + +## Step 3: Run the Experiment -Run RFT process with the following command: +Run DPO process with the following command: ```shell trinity run --config examples/dpo_humanlike/dpo.yaml ``` +or, for SFT: + +```shell +trinity run --config /PATH/TO/sft.yaml +```