[Doc] add a new page with example list from the dataset perspective (#434)

HYLcool · web-flow · commit ffc6cacf8a95 · 2025-12-10T21:13:57.000+08:00
diff --git a/docs/sphinx_doc/source/index.rst b/docs/sphinx_doc/source/index.rst
@@ -44,6 +44,7 @@ Welcome to Trinity-RFT's documentation!
    tutorial/example_dpo.md
    tutorial/example_megatron.md
    tutorial/example_data_functionalities.md
+   tutorial/example_dataset_perspective.md
 
 .. toctree::
    :maxdepth: 2
diff --git a/docs/sphinx_doc/source/tutorial/example_dataset_perspective.md b/docs/sphinx_doc/source/tutorial/example_dataset_perspective.md
@@ -0,0 +1,47 @@
+# Example Summary
+
+> From the Dataset Perspective
+
+This guide provides an example list from the dataset perspective, where you can find out what datasets the examples have covered easily.
+
+| Dataset                                                                                                      | Algorithm | Use Case                                                                               | References |
+|--------------------------------------------------------------------------------------------------------------| --- |----------------------------------------------------------------------------------------| --- |
+| [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)                                                 | GRPO | Regular RFT                                                                            | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html) |
+|                                                                                                              | GRPO | Asynchronous training                                                                  | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/async_gsm8k), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_async_mode.html) |
+|                                                                                                              | Multi-Step GRPO | AgentScope ReAct agent training                                                        | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_react), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_react.html) |
+|                                                                                                              | AsymRE | Regular RFT                                                                            | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/asymre_gsm8k) |
+|                                                                                                              | CISPO | Regular RFT                                                                            | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/cispo_gsm8k) |
+|                                                                                                              | GRPO | Training with prioritized tasks                                                        | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_task_pipeline), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html#example-data-processor-for-task-pipeline) |
+|                                                                                                              | GRPO | Training with reward reshaping on experiences                                          | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_experience_pipeline), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html#example-data-processor-for-experience-pipeline) |
+|                                                                                                              | GRPO | Training with RULER (Relative Universal LLM-Elicited Rewards)                          | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler) |
+|                                                                                                              | GRPO | Training a policy model as its own reward model                                        | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler) |
+|                                                                                                              | GRPO | Training using LoRA                                                                    | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_lora_gsm8k) |
+|                                                                                                              | OPMD | Off-policy RFT                                                                         | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/opmd_gsm8k), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_advanced.html) |
+|                                                                                                              | REC | Training with group-relative reinforce variants                                        | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) |
+|                                                                                                              | sPPO | Training with sPPO algorithm                                                           | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/sppo_gsm8k) |
+|                                                                                                              | TOPR | Tapered off-policy RFT                                                                 | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/topr_gsm8k) |
+| Math category tasks                                                                                          | GRPO | Training with rewards from RM-Gallery                                                  | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_math) |
+|                                                                                                              | AsymRE | Regular RFT                                                                            | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/asymre_math) |
+|                                                                                                              | MIX | Training with "expert" data generated by a more advanced LLM                           | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_math), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_mix_algo.html) |
+| [ALFWorld](https://github.com/alfworld/alfworld)                                                             | GRPO | Concatenated multi-turn RFT                                                            | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html) |
+|                                                                                                              | Multi-Step GRPO | General multi-step RFT                                                                 | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld_general_multi_step), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_step_wise.html) |
+| [SciWorld](https://github.com/allenai/ScienceWorld)                                                          | GRPO | Concatenated multi-turn RFT                                                            | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_sciworld) |
+| [WebShop](https://github.com/princeton-nlp/WebShop)                                                          | GRPO | Concatenated multi-turn RFT                                                            | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_webshop), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html) |
+| [callanwu/WebWalkerQA](https://huggingface.co/datasets/callanwu/WebWalkerQA)                                 | Multi-Step GRPO | Multi-turn web search agent training                                                   | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) |
+| [corbt/enron-emails](https://huggingface.co/datasets/corbt/enron-emails)                                     | Multi-Step GRPO | Multi-turn email search agent training                                                 | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_email_search), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_search_email.html) |
+| [open-r1/DAPO-Math-17k-Processed](https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed)           | GRPO | Regular RFT                                                                            | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dapo_math) |
+| [LLM360/guru-RL-92k](https://huggingface.co/datasets/LLM360/guru-RL-92k)                                     | GRPO | Training with bayesian online task selection                                           | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) |
+| [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/)                               | GRPO | Concatenated multi-turn RFT                                                            | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_frozen_lake) |
+| [anisha2102/RaR-Medicine](https://huggingface.co/datasets/anisha2102/RaR-Medicine)                           | GRPO | Training with rewards from LLM judge and rubrics for a non-verifiable medicine QA task | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) |
+| [Team-ACE/ToolACE](https://huggingface.co/datasets/Team-ACE/ToolACE)                                         | GRPO | Regular RFT for tool calling                                                           | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_toolcall) |
+| [hiyouga/geometry3k](https://huggingface.co/datasets/hiyouga/geometry3k)                                     | GRPO | Regular RFT for VLM                                                                    | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_vlm) |
+|                                                                                                              | MIX | Training with "expert" data generated by a more advanced LLM                           | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_vlm) |
+| [datajuicer/RealMedConv](https://huggingface.co/datasets/datajuicer/RealMedConv)                             | GRPO | Regular RFT for learning to ask in a proactive way                                     | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) |
+| [datajuicer/Trinity-ToolAce-RL-split](https://huggingface.co/datasets/datajuicer/Trinity-ToolAce-RL-split)   | CHORD | Training with dynamic SFT + RL integration                                             | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord) |
+| [datajuicer/Trinity-ToolAce-SFT-split](https://huggingface.co/datasets/datajuicer/Trinity-ToolAce-SFT-split) | CHORD | Training with dynamic SFT + RL integration                                             | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord) |
+| [Jiayi-Pan/Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4)             | PPO | Training based on the critic model                                                     | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown) |
+|                                                                                                              | PPO | Training with Megatron-LM as the backend.                                              | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_megatron) |
+|                                                                                                              | PPO | Training with experience replay                                                        | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay) |
+| [open-r1/Mixture-of-Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts)                   | SFT | Regular SFT                                                                            | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/sft_mot), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html#configuration-for-sft) |
+| [HumanLLMs/Human-Like-DPO-Dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset)         | DPO | Training based on prepared human preferences                                           | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html) |
+| toy dataset                                                                                                  | DPO | Training based on human-in-the-loop real-time preference annotation                    | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_human_in_the_loop), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html#example-human-in-the-loop) |
diff --git a/docs/sphinx_doc/source_zh/index.rst b/docs/sphinx_doc/source_zh/index.rst
@@ -42,6 +42,7 @@
    tutorial/example_dpo.md
    tutorial/example_megatron.md
    tutorial/example_data_functionalities.md
+   tutorial/example_dataset_perspective.md
 
 .. toctree::
    :maxdepth: 2
diff --git a/docs/sphinx_doc/source_zh/tutorial/example_dataset_perspective.md b/docs/sphinx_doc/source_zh/tutorial/example_dataset_perspective.md