diff --git a/docs/sphinx_doc/source/index.rst b/docs/sphinx_doc/source/index.rst index 21794b4138..6e56b032b9 100644 --- a/docs/sphinx_doc/source/index.rst +++ b/docs/sphinx_doc/source/index.rst @@ -42,6 +42,7 @@ Welcome to Trinity-RFT's documentation! tutorial/example_dpo.md tutorial/example_megatron.md tutorial/example_data_functionalities.md + tutorial/example_dataset_perspective.md .. toctree:: :maxdepth: 2 diff --git a/docs/sphinx_doc/source/tutorial/example_dataset_perspective.md b/docs/sphinx_doc/source/tutorial/example_dataset_perspective.md new file mode 100644 index 0000000000..635ab7fb23 --- /dev/null +++ b/docs/sphinx_doc/source/tutorial/example_dataset_perspective.md @@ -0,0 +1,47 @@ +# Example Summary + +> From the Dataset Perspective + +This guide provides an example list from the dataset perspective, where you can find out what datasets the examples have covered easily. + +| Dataset | Algorithm | Use Case | References | +|--------------------------------------------------------------------------------------------------------------| --- |----------------------------------------------------------------------------------------| --- | +| [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k) | GRPO | Regular RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html) | +| | GRPO | Asynchronous training | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/async_gsm8k), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_async_mode.html) | +| | Multi-Step GRPO | AgentScope ReAct agent training | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_react), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_react.html) | +| | AsymRE | Regular RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/asymre_gsm8k) | +| | CISPO | Regular RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/cispo_gsm8k) | +| | GRPO | Training with prioritized tasks | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_task_pipeline), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html#example-data-processor-for-task-pipeline) | +| | GRPO | Training with reward reshaping on experiences | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_experience_pipeline), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html#example-data-processor-for-experience-pipeline) | +| | GRPO | Training with RULER (Relative Universal LLM-Elicited Rewards) | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler) | +| | GRPO | Training a policy model as its own reward model | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler) | +| | GRPO | Training using LoRA | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_lora_gsm8k) | +| | OPMD | Off-policy RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/opmd_gsm8k), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_advanced.html) | +| | REC | Training with group-relative reinforce variants | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) | +| | sPPO | Training with sPPO algorithm | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/sppo_gsm8k) | +| | TOPR | Tapered off-policy RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/topr_gsm8k) | +| Math category tasks | GRPO | Training with rewards from RM-Gallery | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_math) | +| | AsymRE | Regular RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/asymre_math) | +| | MIX | Training with "expert" data generated by a more advanced LLM | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_math), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_mix_algo.html) | +| [ALFWorld](https://github.com/alfworld/alfworld) | GRPO | Concatenated multi-turn RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html) | +| | Multi-Step GRPO | General multi-step RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld_general_multi_step), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_step_wise.html) | +| [SciWorld](https://github.com/allenai/ScienceWorld) | GRPO | Concatenated multi-turn RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_sciworld) | +| [WebShop](https://github.com/princeton-nlp/WebShop) | GRPO | Concatenated multi-turn RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_webshop), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html) | +| [callanwu/WebWalkerQA](https://huggingface.co/datasets/callanwu/WebWalkerQA) | Multi-Step GRPO | Multi-turn web search agent training | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) | +| [corbt/enron-emails](https://huggingface.co/datasets/corbt/enron-emails) | Multi-Step GRPO | Multi-turn email search agent training | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_email_search), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_search_email.html) | +| [open-r1/DAPO-Math-17k-Processed](https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed) | GRPO | Regular RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dapo_math) | +| [LLM360/guru-RL-92k](https://huggingface.co/datasets/LLM360/guru-RL-92k) | GRPO | Training with bayesian online task selection | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) | +| [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) | GRPO | Concatenated multi-turn RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_frozen_lake) | +| [anisha2102/RaR-Medicine](https://huggingface.co/datasets/anisha2102/RaR-Medicine) | GRPO | Training with rewards from LLM judge and rubrics for a non-verifiable medicine QA task | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | +| [Team-ACE/ToolACE](https://huggingface.co/datasets/Team-ACE/ToolACE) | GRPO | Regular RFT for tool calling | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_toolcall) | +| [hiyouga/geometry3k](https://huggingface.co/datasets/hiyouga/geometry3k) | GRPO | Regular RFT for VLM | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_vlm) | +| | MIX | Training with "expert" data generated by a more advanced LLM | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_vlm) | +| [datajuicer/RealMedConv](https://huggingface.co/datasets/datajuicer/RealMedConv) | GRPO | Regular RFT for learning to ask in a proactive way | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) | +| [datajuicer/Trinity-ToolAce-RL-split](https://huggingface.co/datasets/datajuicer/Trinity-ToolAce-RL-split) | CHORD | Training with dynamic SFT + RL integration | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord) | +| [datajuicer/Trinity-ToolAce-SFT-split](https://huggingface.co/datasets/datajuicer/Trinity-ToolAce-SFT-split) | CHORD | Training with dynamic SFT + RL integration | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord) | +| [Jiayi-Pan/Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4) | PPO | Training based on the critic model | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown) | +| | PPO | Training with Megatron-LM as the backend. | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_megatron) | +| | PPO | Training with experience replay | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay) | +| [open-r1/Mixture-of-Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts) | SFT | Regular SFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/sft_mot), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html#configuration-for-sft) | +| [HumanLLMs/Human-Like-DPO-Dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) | DPO | Training based on prepared human preferences | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html) | +| toy dataset | DPO | Training based on human-in-the-loop real-time preference annotation | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_human_in_the_loop), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html#example-human-in-the-loop) | diff --git a/docs/sphinx_doc/source_zh/index.rst b/docs/sphinx_doc/source_zh/index.rst index 3e4fbc276f..4b36aafbe6 100644 --- a/docs/sphinx_doc/source_zh/index.rst +++ b/docs/sphinx_doc/source_zh/index.rst @@ -40,6 +40,7 @@ tutorial/example_dpo.md tutorial/example_megatron.md tutorial/example_data_functionalities.md + tutorial/example_dataset_perspective.md .. toctree:: :maxdepth: 2 diff --git a/docs/sphinx_doc/source_zh/tutorial/example_dataset_perspective.md b/docs/sphinx_doc/source_zh/tutorial/example_dataset_perspective.md new file mode 100644 index 0000000000..24b85906b3 --- /dev/null +++ b/docs/sphinx_doc/source_zh/tutorial/example_dataset_perspective.md @@ -0,0 +1,47 @@ +# 样例总览 + +> 从数据集视角出发 + +该文档从数据集视角提供了一个样例列表,用户可以轻松了解哪些数据集已经在样例中覆盖和支持了。 + +| 数据集 | 算法 | 使用场景 | 参考文档 | +|--------------------------------------------------------------------------------------------------------------|-----------------|---------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k) | GRPO | 常规 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html) | +| | GRPO | 异步训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/async_gsm8k), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_async_mode.html) | +| | Multi-Step GRPO | AgentScope ReAct 智能体训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_react), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_react.html) | +| | AsymRE | 常规 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/asymre_gsm8k) | +| | CISPO | 常规 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/cispo_gsm8k) | +| | GRPO | 使用优先级任务进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_task_pipeline), [相关文档](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html#example-data-processor-for-task-pipeline) | +| | GRPO | 在经验上进行奖励重塑的训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_experience_pipeline), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_data_functionalities.html#example-data-processor-for-experience-pipeline) | +| | GRPO | 使用 RULER (Relative Universal LLM-Elicited Rewards) 进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler) | +| | GRPO | 训练策略模型作为其自身的奖励模型 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler) | +| | GRPO | 使用 LoRA 进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_lora_gsm8k) | +| | OPMD | 异策略 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/opmd_gsm8k), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_advanced.html) | +| | REC | 使用组相对强化变体进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) | +| | sPPO | 使用 sPPO 算法进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/sppo_gsm8k) | +| | TOPR | 渐减式异策略 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/topr_gsm8k) | +| 数学类型任务 | GRPO | 使用 RM-Gallery 的奖励进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_math) | +| | AsymRE | 常规 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/asymre_math) | +| | MIX | 使用更先进大模型生成的“专家”数据进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_math), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_mix_algo.html) | +| [ALFWorld](https://github.com/alfworld/alfworld) | GRPO | 拼接多轮 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_multi_turn.html) | +| | Multi-Step GRPO | 通用多轮 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld_general_multi_step), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_step_wise.html) | +| [SciWorld](https://github.com/allenai/ScienceWorld) | GRPO | 拼接多轮 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_sciworld) | +| [WebShop](https://github.com/princeton-nlp/WebShop) | GRPO | 拼接多轮 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_webshop), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_multi_turn.html) | +| [callanwu/WebWalkerQA](https://huggingface.co/datasets/callanwu/WebWalkerQA) | Multi-Step GRPO | 多轮网页搜索智能体训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) | +| [corbt/enron-emails](https://huggingface.co/datasets/corbt/enron-emails) | Multi-Step GRPO | 多轮邮件搜索智能体训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_email_search), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_search_email.html) | +| [open-r1/DAPO-Math-17k-Processed](https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed) | GRPO | 常规 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dapo_math) | +| [LLM360/guru-RL-92k](https://huggingface.co/datasets/LLM360/guru-RL-92k) | GRPO | 使用贝叶斯在线任务选择进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) | +| [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) | GRPO | 拼接多轮 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_frozen_lake) | +| [anisha2102/RaR-Medicine](https://huggingface.co/datasets/anisha2102/RaR-Medicine) | GRPO | 针对不可验证医学问答任务,使用大模型裁判和评分标准提供奖励进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | +| [Team-ACE/ToolACE](https://huggingface.co/datasets/Team-ACE/ToolACE) | GRPO | 针对工具调用的常规 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_toolcall) | +| [hiyouga/geometry3k](https://huggingface.co/datasets/hiyouga/geometry3k) | GRPO | 针对视觉语言模型的常规 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_vlm) | +| | MIX | 使用更先进大模型生成的“专家”数据进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_vlm) | +| [datajuicer/RealMedConv](https://huggingface.co/datasets/datajuicer/RealMedConv) | GRPO | 学习主动提问的常规 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) | +| [datajuicer/Trinity-ToolAce-RL-split](https://huggingface.co/datasets/datajuicer/Trinity-ToolAce-RL-split) | CHORD | 动态 SFT 与 RL 联合训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord) | +| [datajuicer/Trinity-ToolAce-SFT-split](https://huggingface.co/datasets/datajuicer/Trinity-ToolAce-SFT-split) | CHORD | 动态 SFT 与 RL 联合训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord) | +| [Jiayi-Pan/Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4) | PPO | 基于 critic 模型的训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown) | +| | PPO | 使用 Megatron-LM 作为训练后端 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_megatron) | +| | PPO | 使用经验回放进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay) | +| [open-r1/Mixture-of-Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts) | SFT | 常规 SFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/sft_mot), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_dpo.html#configuration-for-sft) | +| [HumanLLMs/Human-Like-DPO-Dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) | DPO | 基于预设人类偏好的训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_dpo.html) | +| 示例数据 | DPO | 基于训练环路中人类实时偏好标注的训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_human_in_the_loop), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_data_functionalities.html#example-human-in-the-loop) |