modelscope
diff --git a/‎README.md‎
Lines changed: 7 additions & 4 deletions b/‎README.md‎
Lines changed: 7 additions & 4 deletions
diff --git a/‎docs/sphinx_doc/source/conf.py‎
Lines changed: 2 additions & 1 deletion b/‎docs/sphinx_doc/source/conf.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/sphinx_doc/source/index.rst‎
Lines changed: 11 additions & 2 deletions b/‎docs/sphinx_doc/source/index.rst‎
Lines changed: 11 additions & 2 deletions
diff --git a/‎docs/sphinx_doc/source/main.md‎
Lines changed: 10 additions & 12 deletions b/‎docs/sphinx_doc/source/main.md‎
Lines changed: 10 additions & 12 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/example_async_mode.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/sphinx_doc/source/tutorial/example_async_mode.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/sphinx_doc/source/tutorial/example_data_functionalities.md‎
Lines changed: 7 additions & 7 deletions b/‎docs/sphinx_doc/source/tutorial/example_data_functionalities.md‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/example_dpo.md‎
Lines changed: 54 additions & 13 deletions b/‎docs/sphinx_doc/source/tutorial/example_dpo.md‎
Lines changed: 54 additions & 13 deletions
@@ -148,8 +148,11 @@ pip install -e .\[dev\]
 
 # Install flash-attn after all dependencies are installed
 # Note: flash-attn will take a long time to compile, please be patient.
-pip install flash-attn -v
-# Try the following command if you encounter errors during installation
+# for bash
+pip install -e .[flash_attn]
+# for zsh
+pip install -e .\[flash_attn\]
+# Try the following command if you encounter errors during flash-attn installation
 # pip install flash-attn -v --no-build-isolation
 ```
 
@@ -263,7 +266,7 @@ Then, for command-line users, run the RFT process with the following command:
 trinity run --config <config_path>
 ```
 
-> For example, below is the command for fine-tuning Qwen-2.5-1.5B-Instruct on GSM8k dataset using GRPO algorithm:
+> For example, below is the command for fine-tuning Qwen2.5-1.5B-Instruct on GSM8k dataset using GRPO algorithm:
 > ```shell
 > trinity run --config examples/grpo_gsm8k/gsm8k.yaml
 > ```
@@ -276,7 +279,7 @@ For more detailed examples about how to use Trinity-RFT, please refer to the fol
 + [Off-policy mode of RFT](./docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md)
 + [Asynchronous mode of RFT](./docs/sphinx_doc/source/tutorial/example_async_mode.md)
 + [Multi-turn tasks](./docs/sphinx_doc/source/tutorial/example_multi_turn.md)
-+ [Offline learning by DPO](./docs/sphinx_doc/source/tutorial/example_dpo.md)
++ [Offline learning by DPO or SFT](./docs/sphinx_doc/source/tutorial/example_dpo.md)
 + [Advanced data processing / human-in-the-loop](./docs/sphinx_doc/source/tutorial/example_data_functionalities.md)
 
 
 
@@ -22,12 +22,13 @@
     "sphinx.ext.napoleon",
     "sphinx.ext.autosectionlabel",
     "myst_parser",
+    "sphinx.ext.mathjax",
 ]
 source_suffix = {
     ".rst": "restructuredtext",
     ".md": "markdown",
 }
-myst_enable_extensions = ["colon_fence"]
+myst_enable_extensions = ["colon_fence", "amsmath", "dollarmath"]
 
 # Prefix document path to section labels, otherwise autogenerated labels would
 # look like 'heading' rather than 'path/to/file:heading'
 
@@ -14,16 +14,24 @@ Welcome to Trinity-RFT's documentation!
    :maxdepth: 1
    :glob:
    :hidden:
-   :caption: Tutorial
+   :caption: Examples
 
    tutorial/example_reasoning_basic.md
    tutorial/example_reasoning_advanced.md
    tutorial/example_async_mode.md
    tutorial/example_multi_turn.md
    tutorial/example_dpo.md
    tutorial/example_data_functionalities.md
-   tutorial/trinity_configs.md
+
+.. toctree::
+   :maxdepth: 2
+   :glob:
+   :hidden:
+   :caption: Guidelines
+
    tutorial/trinity_programming_guide.md
+   tutorial/trinity_configs.md
+   tutorial/example_mix_algo.md
 
 .. toctree::
    :maxdepth: 1
@@ -33,6 +41,7 @@ Welcome to Trinity-RFT's documentation!
    build_api/trinity.buffer
    build_api/trinity.explorer
    build_api/trinity.trainer
+   build_api/trinity.algorithm
    build_api/trinity.manager
    build_api/trinity.common
    build_api/trinity.utils
@@ -84,15 +84,18 @@ e.g., utilizing NCCL (when feasible) for model weight synchronization, sequence
 
 ## Getting started
 
-
-*Note: this project is currently under active development; comments and suggestions are welcome!*
-
+```{note}
+Note: This project is currently under active development; comments and suggestions are welcome!
+```
 
 
 
 ### Step 1: preparations
 
-
+Trinity-RFT requires
+Python version >= 3.10,
+CUDA version >= 12.4,
+and at least 2 GPUs.
 
 
 Installation from source (recommended):
@@ -146,11 +149,6 @@ docker build -f scripts/docker/Dockerfile -t trinity-rft:latest .
 docker run -it --gpus all --shm-size="64g" --rm -v $PWD:/workspace -v <root_path_of_data_and_checkpoints>:/data trinity-rft:latest
 ```
 
-Trinity-RFT requires
-Python version >= 3.10,
-CUDA version >= 12.4,
-and at least 2 GPUs.
-
 
 ### Step 2: prepare dataset and model
 
@@ -243,15 +241,15 @@ trinity run --config <config_path>
 
 
 
-For example, below is the command for fine-tuning Qwen-2.5-1.5B-Instruct on GSM8k dataset using GRPO algorithm:
+For example, below is the command for fine-tuning Qwen2.5-1.5B-Instruct on GSM8k dataset using GRPO algorithm:
 
 ```shell
 trinity run --config examples/grpo_gsm8k/gsm8k.yaml
 ```
 
 
 
-More example config files can be found in `examples`.
+More example config files can be found in [`examples`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/).
 
 
 
@@ -260,7 +258,7 @@ For more detailed examples about how to use Trinity-RFT, please refer to the fol
 + [Off-policy mode of RFT](tutorial/example_reasoning_advanced.md)
 + [Asynchronous mode of RFT](tutorial/example_async_mode.md)
 + [Multi-turn tasks](tutorial/example_multi_turn.md)
-+ [Offline learning by DPO](tutorial/example_dpo.md)
++ [Offline learning by DPO or SFT](tutorial/example_dpo.md)
 + [Advanced data processing / human-in-the-loop](tutorial/example_data_functionalities.md)
 
 
 
@@ -1,6 +1,6 @@
 # Asynchronous RFT
 
-This example shows how to run RFT in a fully asynchronous mode with the GRPO algorithm, Qwen-2.5-1.5B-Instruct model and GSM8K dataset.
+This example shows how to run RFT in a fully asynchronous mode with the GRPO algorithm, Qwen2.5-1.5B-Instruct model and GSM8K dataset.
 
 Trinity-RFT supports an asynchronous mode by running the trainer and explorer in separate processes.
 
 
@@ -38,12 +38,12 @@ data_processor:
     # I/O buffers
     input_buffers:
       - name: 'raw_input'
-        path: 'openai/gsm8k'
+        path: /PATH/TO/GSM8K/
         storage_type: 'file'
         raw: true
     output_buffer:
       name: 'raw_output'
-      path: './outputs/task_pipeline_output/prioritized_gsm8k.jsonl'
+      path: /PATH/TO/OUTPUT/JSONL/FILE
       storage_type: 'file'
     # format mapping
     format:
@@ -72,12 +72,12 @@ data_processor:
     # I/O buffers
     input_buffers:
       - name: 'raw_input'
-        path: 'openai/gsm8k'
+        path: /PATH/TO/GSM8K/
         storage_type: 'file'
         raw: true
     output_buffer:
       name: 'raw_output'
-      path: './outputs/task_pipeline_output/prioritized_gsm8k.jsonl'
+      path: /PATH/TO/OUTPUT/JSONL/FILE
       storage_type: 'file'
     # format mapping
     format:
@@ -122,12 +122,12 @@ data_processor:
     # I/O buffers
     input_buffers:
       - name: 'raw_input'
-        path: 'openai/gsm8k'
+        path: /PATH/TO/GSM8K/
         storage_type: 'file'
         raw: true
     output_buffer:
       name: 'raw_output'
-      path: './outputs/task_pipeline_output/prioritized_gsm8k.jsonl'
+      path: /PATH/TO/OUTPUT/JSONL/FILE
       storage_type: 'file'
     # format mapping
     format:
@@ -217,7 +217,7 @@ data_processor:
 
 Here you can set the basic information for the example dataset, database information that is used to store the result dataset, and some other items about downstream dataset loading for exploring and training, which is similar to the example above.
 
-For this example, we assume that you are somehow familiar with the basic usage of Data-Juicer, so we need to prepare a Data-Juicer data processing recipe in `tests/test_configs/human_annotator_test_dj_cfg.yaml` that includes an OP of `human_preference_annotation_mapper`. For example:
+For this example, we assume that you are somehow familiar with the basic usage of Data-Juicer, so we need to prepare a Data-Juicer data processing recipe in [`tests/test_configs/human_annotator_test_dj_cfg.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/tests/test_configs/human_annotator_test_dj_cfg.yaml) that includes an OP of `human_preference_annotation_mapper`. For example:
 
 ```yaml
 project_name: 'demo-human-annotator'
 
@@ -1,12 +1,12 @@
-# Offline DPO
+# Offline DPO and SFT
 
-This example describes DPO based on the Qwen-2.5-1.5B-Instruct model and [Human-like-DPO-dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset).
+This example describes DPO and SFT based on the Qwen2.5-1.5B-Instruct model.
 
 ## Step 1: Model and Data Preparation
 
 ### Model Preparation
 
-Download the Qwen-2.5-1.5B-Instruct model to the local directory `$MODEL_PATH/Qwen2.5-1.5B-Instruct`:
+Download the Qwen2.5-1.5B-Instruct model to the local directory `$MODEL_PATH/Qwen2.5-1.5B-Instruct`:
 
 ```shell
 # Using Modelscope
@@ -20,7 +20,7 @@ More details of model downloading are referred to [ModelScope](https://modelscop
 
 ### Data Preparation
 
-Download the Human-Like-DPO-Dataset dataset to the local directory `$DATASET_PATH/human_like_dpo_dataset`:
+For DPO, we download the [Human-like-DPO-dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) to the local directory `$DATASET_PATH/human_like_dpo_dataset`:
 
 ```shell
 # Using Modelscope
@@ -34,9 +34,11 @@ More details of dataset downloading are referred to [ModelScope](https://modelsc
 
 Note that the dataset has the keys `prompt`, `chosen` and `rejected`. If not, pass the proper keys to the config.
 
-## Step 2: Setup Configuration and Run Experiment
+For SFT, we download the dataset to the local directory `/PATH/TO/SFT_DATASET/`, which usually contains message-based data.
 
-### Configuration
+## Step 2: Setup Configuration
+
+### Configuration for DPO
 
 We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) and [`train_dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/train_dpo.yaml) for this experiment. Some important setups are listed in the following:
 
@@ -48,9 +50,12 @@ name: <experiment_name>
 mode: train
 algorithm:
   algorithm_type: dpo
+  kl_loss_fn: k1
+  kl_loss_fn_args:
+    kl_coef: 0.1  # value of beta in DPO
 checkpoint_root_dir: /PATH/TO/CHECKPOINT/
 model:
-  model_path: /PATH/TO/MODEL/
+  model_path: $MODEL_PATH/Qwen2.5-1.5B-Instruct
 cluster:
   node_num: 1
   gpu_per_node: 8
@@ -59,9 +64,9 @@ buffer:
   batch_size: 64
   trainer_input:
     experience_buffer:
-      name: dpo_buffer
+      name: human_like_dpo
       storage_type: file
-      path: /PATH/TO/DATASET/
+      path: $DATASET_PATH/human_like_dpo_dataset
       format:
         prompt_type: plaintext # plaintext/messages/chatpair
         prompt_key: prompt
@@ -70,14 +75,50 @@ buffer:
 trainer:
   trainer_config_path: 'examples/dpo_humanlike/train_dpo.yaml'
   save_interval: 30
-  actor_use_kl_loss: True
-  actor_kl_loss_coef: 0.1  # value of beta in DPO
 ```
 
-### Run the Experiment
+### Configuration for SFT
+
+We set the `algorithm_type` as `sft` to run SFT process. Then we modify the config file `sft.yaml` with the following changes:
+
+```yaml
+project: <project_name>
+name: <experiment_name>
+mode: train
+algorithm:
+  algorithm_type: sft
+checkpoint_root_dir: /PATH/TO/CHECKPOINT/
+model:
+  model_path: /PATH/TO/MODEL/
+cluster:
+  node_num: 1
+  gpu_per_node: 2
+buffer:
+  total_epochs: 5
+  batch_size: 64
+  trainer_input:
+    experience_buffer:
+      name: <sft_dataset_name>
+      storage_type: file
+      path: /PATH/TO/SFT_DATASET/
+      split: train
+      format:
+        prompt_type: messages
+        messages_key: messages
+trainer:
+  trainer_config_path: /PATH/TO/TRAIN_CONFIG_YAML/
+  save_interval: 50
+```
+
+## Step 3: Run the Experiment
 
-Run RFT process with the following command:
+Run DPO process with the following command:
 
 ```shell
 trinity run --config examples/dpo_humanlike/dpo.yaml
 ```
+or, for SFT:
+
+```shell
+trinity run --config /PATH/TO/sft.yaml
+```
Original file line number	Diff line number	Diff line change
`@@ -22,12 +22,13 @@`
`22`	`22`	`"sphinx.ext.napoleon",`
`23`	`23`	`"sphinx.ext.autosectionlabel",`
`24`	`24`	`"myst_parser",`
	`25`	`+ "sphinx.ext.mathjax",`
`25`	`26`	`]`
`26`	`27`	`source_suffix = {`
`27`	`28`	`".rst": "restructuredtext",`
`28`	`29`	`".md": "markdown",`
`29`	`30`	`}`
`30`		`-myst_enable_extensions = ["colon_fence"]`
	`31`	`+myst_enable_extensions = ["colon_fence", "amsmath", "dollarmath"]`
`31`	`32`
`32`	`33`	`# Prefix document path to section labels, otherwise autogenerated labels would`
`33`	`34`	`# look like 'heading' rather than 'path/to/file:heading'`