diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 32f5894edc..367719d17e 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -4,7 +4,7 @@ Thank you for your interest in Trinity-RFT! Our framework is built on a decouple ## Where to Contribute -Trinity-RFT provides modular interfaces for different technical interests. Please refer to our [Developer Guide](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/develop_overview.html) for detailed implementation standards: +Trinity-RFT provides modular interfaces for different technical interests. Please refer to our [Developer Guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/develop_overview.html) for detailed implementation standards: | Focus Area | Interface/Code Directory | Potential Tasks | | :--- | :--- | :--- | @@ -41,10 +41,10 @@ To ensure a smooth review process, please complete the following: ## Additional Guidelines -- **Bug Reports & Feature Requests**: Please use [GitHub Issues](https://github.com/modelscope/Trinity-RFT/issues). For bugs, include reproduction steps, environment info, and error logs. +- **Bug Reports & Feature Requests**: Please use [GitHub Issues](https://github.com/agentscope-ai/Trinity-RFT/issues). For bugs, include reproduction steps, environment info, and error logs. - **Major Changes**: For significant architectural changes or large features, please open an issue first to discuss the design with the maintainers. - **Documentation**: We highly value improvements to our tutorials, docstrings, and translations. -*For a deep dive into the framework's architecture, please refer to the [Full Doc](https://modelscope.github.io/Trinity-RFT/en/main/index.html).* +*For a deep dive into the framework's architecture, please refer to the [Full Doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/index.html).* **Thank you for helping us build a better Reinforcement Fine-Tuning framework!** diff --git a/README.md b/README.md index 4e68842c90..19fbcc46fa 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -[**中文主页**](https://github.com/modelscope/Trinity-RFT/blob/main/README_zh.md) | [**Tutorial**](https://modelscope.github.io/Trinity-RFT/) | [**FAQ**](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/faq.html) +[**中文主页**](https://github.com/agentscope-ai/Trinity-RFT/blob/main/README_zh.md) | [**Tutorial**](https://agentscope-ai.github.io/Trinity-RFT/) | [**FAQ**](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/faq.html)
Trinity-RFT @@ -9,7 +9,7 @@
[![paper](http://img.shields.io/badge/cs.LG-2505.17826-B31B1B?logo=arxiv&logoColor=red)](https://arxiv.org/abs/2505.17826) -[![doc](https://img.shields.io/badge/Docs-blue?logo=markdown)](https://modelscope.github.io/Trinity-RFT/) +[![doc](https://img.shields.io/badge/Docs-blue?logo=markdown)](https://agentscope-ai.github.io/Trinity-RFT/) [![pypi](https://img.shields.io/pypi/v/trinity-rft?logo=pypi&color=026cad)](https://pypi.org/project/trinity-rft/) ![license](https://img.shields.io/badge/license-Apache--2.0-000000.svg) @@ -26,21 +26,21 @@ It decouples RFT into three components that work in coordination: Trinity-RFT provides functionalities for users with different backgrounds and objectives: -* 🤖 **Agent application developers:** Train LLM-powered agents and improve their capabilities in specific domains [[tutorial]](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/develop_workflow.html) -* 🧠 **Reinforcement learning researchers:** Design, implement and validate new RL algorithms using compact, plug-and-play modules that allow non-invasive customization [[tutorial]](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/develop_algorithm.html) -* 📊 **Data engineers:** Create RFT datasets and build data pipelines for cleaning, augmentation, and human-in-the-loop scenarios [[tutorial]](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/develop_operator.html) +* 🤖 **Agent application developers:** Train LLM-powered agents and improve their capabilities in specific domains [[tutorial]](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/develop_workflow.html) +* 🧠 **Reinforcement learning researchers:** Design, implement and validate new RL algorithms using compact, plug-and-play modules that allow non-invasive customization [[tutorial]](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/develop_algorithm.html) +* 📊 **Data engineers:** Create RFT datasets and build data pipelines for cleaning, augmentation, and human-in-the-loop scenarios [[tutorial]](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/develop_operator.html) ## 🚀 News -* [2026-01] [[Release Notes]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.1) Trinity-RFT v0.4.1 released: upgraded verl to v0.7.0, Tinker backend supports OpenAI API, bug fixes. +* [2026-01] [[Release Notes]](https://github.com/agentscope-ai/Trinity-RFT/releases/tag/v0.4.1) Trinity-RFT v0.4.1 released: upgraded verl to v0.7.0, Tinker backend supports OpenAI API, bug fixes. * [2026-01] Introducing [R3L](https://github.com/shiweijiezero/R3L): a systematic reflect-then-retry RL mechanism with efficient language-guided exploration and stable off-policy learning ([paper](https://arxiv.org/abs/2601.03715)). -* [2025-12] [[Release Notes]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 released: added [Tinker](https://thinkingmachines.ai/tinker/) backend for users **without GPUs**, add more benchmarks, enhance online RL and more. +* [2025-12] [[Release Notes]](https://github.com/agentscope-ai/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 released: added [Tinker](https://thinkingmachines.ai/tinker/) backend for users **without GPUs**, add more benchmarks, enhance online RL and more. * [2025-12] Trinity-RFT powers the medical and health business of "Taobao Shangou", enabling the AI agent to understand vague symptoms, proactively ask follow-up questions, and provide precise recommendations ([News](https://tech.china.com.cn/sx/20251201/411376.shtml)). -* [2025-11] [[Release Notes](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3)] Trinity-RFT v0.3.3 released: bug fixes. -* [2025-11] Introducing [Learn-to-Ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask): a framework for training proactive dialogue agents from offline expert data ([paper](https://arxiv.org/pdf/2510.25441)). -* [2025-11] Introducing [BOTS](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots): online RL task selection for efficient LLM fine-tuning ([paper](https://arxiv.org/pdf/2510.26374)). -* [2025-09] [Our paper](https://arxiv.org/pdf/2509.24203) reveals a novel off-policy interpretation for group-relative REINFORCE and its variants like GRPO and AsymRE ([implementation](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k)). -* [2025-08] Introducing [CHORD](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord): dynamic SFT + RL integration for advanced LLM fine-tuning ([paper](https://arxiv.org/pdf/2508.11408)). +* [2025-11] [[Release Notes](https://github.com/agentscope-ai/Trinity-RFT/releases/tag/v0.3.3)] Trinity-RFT v0.3.3 released: bug fixes. +* [2025-11] Introducing [Learn-to-Ask](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/learn_to_ask): a framework for training proactive dialogue agents from offline expert data ([paper](https://arxiv.org/pdf/2510.25441)). +* [2025-11] Introducing [BOTS](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/bots): online RL task selection for efficient LLM fine-tuning ([paper](https://arxiv.org/pdf/2510.26374)). +* [2025-09] [Our paper](https://arxiv.org/pdf/2509.24203) reveals a novel off-policy interpretation for group-relative REINFORCE and its variants like GRPO and AsymRE ([implementation](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/rec_gsm8k)). +* [2025-08] Introducing [CHORD](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/mix_chord): dynamic SFT + RL integration for advanced LLM fine-tuning ([paper](https://arxiv.org/pdf/2508.11408)).
More...
    @@ -60,15 +60,15 @@ Trinity-RFT provides functionalities for users with different backgrounds and ob | Category | Tutorial / Guideline | |-----------------------------------|------------------------------------------------------------------------------------------------------------------| -| *Run diverse RFT modes* | • [Quick start: GRPO on GSM8k](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)
    • [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_advanced.html)
    • [Fully asynchronous RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_async_mode.html)
    • [Offline learning by DPO or SFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html)
    • [RFT without local GPU (Tinker Backend)](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_tinker_backend.html) | -| *Multi-step agentic RL* | • [Concatenated multi-turn workflow](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html)
    • [General multi-step workflow](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_step_wise.html)
    • [ReAct workflow with an agent framework](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_react.html)
    • [Example: train a web-search agent](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) | -| *Full-lifecycle data pipelines* | • [Rollout task mixing and selection](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/develop_selector.html)
    • [Online task curriculum](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) (📝 [paper](https://arxiv.org/pdf/2510.26374))
    • [Research project: learn-to-ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [paper](https://arxiv.org/pdf/2510.25441))
    • [Experience replay with prioritization](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)
    • [Advanced data processing & human-in-the-loop](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html) | -| *Algorithm development* | • [RL algorithm development with Trinity-RFT](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_mix_algo.html) (📝 [paper](https://arxiv.org/pdf/2508.11408))
    • [Research project: R3L (reflect-then-retry RL)](https://github.com/shiweijiezero/R3L) (📝 [paper](https://arxiv.org/abs/2601.03715))
    • [Research project: group-relative REINFORCE](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [paper](https://arxiv.org/abs/2509.24203))
    • Non-verifiable domains: [RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [trainable RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | -| *Benchmarks* | • [Benchmark toolkit (quick verification & experimentation)](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/README.md)
    • [Guru-Math benchmark & comparison with veRL](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/reports/guru_math.md)
    • [FrozenLake benchmark & comparison with rLLM](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/reports/frozenlake.md)
    • [Alfworld benchmark & comparison with rLLM](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/reports/alfworld.md) | -| *Going deeper into Trinity-RFT* | • [Full configurations](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html)
    • [GPU resource and training configuration guide](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_gpu_configs.html)
    • [Understand the coordination between explorer and trainer](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/synchronizer.html)
    • [How to align configuration with veRL](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/align_with_verl.html) | +| *Run diverse RFT modes* | • [Quick start: GRPO on GSM8k](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)
    • [Off-policy RFT](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_advanced.html)
    • [Fully asynchronous RFT](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_async_mode.html)
    • [Offline learning by DPO or SFT](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html)
    • [RFT without local GPU (Tinker Backend)](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_tinker_backend.html) | +| *Multi-step agentic RL* | • [Concatenated multi-turn workflow](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html)
    • [General multi-step workflow](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_step_wise.html)
    • [ReAct workflow with an agent framework](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_react.html)
    • [Example: train a web-search agent](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/agentscope_websearch) | +| *Full-lifecycle data pipelines* | • [Rollout task mixing and selection](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/develop_selector.html)
    • [Online task curriculum](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/bots) (📝 [paper](https://arxiv.org/pdf/2510.26374))
    • [Research project: learn-to-ask](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [paper](https://arxiv.org/pdf/2510.25441))
    • [Experience replay with prioritization](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)
    • [Advanced data processing & human-in-the-loop](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html) | +| *Algorithm development* | • [RL algorithm development with Trinity-RFT](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_mix_algo.html) (📝 [paper](https://arxiv.org/pdf/2508.11408))
    • [Research project: R3L (reflect-then-retry RL)](https://github.com/shiweijiezero/R3L) (📝 [paper](https://arxiv.org/abs/2601.03715))
    • [Research project: group-relative REINFORCE](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [paper](https://arxiv.org/abs/2509.24203))
    • Non-verifiable domains: [RULER](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [trainable RULER](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | +| *Benchmarks* | • [Benchmark toolkit (quick verification & experimentation)](https://github.com/agentscope-ai/Trinity-RFT/tree/main/benchmark/README.md)
    • [Guru-Math benchmark & comparison with veRL](https://github.com/agentscope-ai/Trinity-RFT/tree/main/benchmark/reports/guru_math.md)
    • [FrozenLake benchmark & comparison with rLLM](https://github.com/agentscope-ai/Trinity-RFT/tree/main/benchmark/reports/frozenlake.md)
    • [Alfworld benchmark & comparison with rLLM](https://github.com/agentscope-ai/Trinity-RFT/tree/main/benchmark/reports/alfworld.md) | +| *Going deeper into Trinity-RFT* | • [Full configurations](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html)
    • [GPU resource and training configuration guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_gpu_configs.html)
    • [Understand the coordination between explorer and trainer](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/synchronizer.html)
    • [How to align configuration with veRL](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/align_with_verl.html) | > [!NOTE] -> For more tutorials, please refer to the [Trinity-RFT documentation](https://modelscope.github.io/Trinity-RFT/). +> For more tutorials, please refer to the [Trinity-RFT documentation](https://agentscope-ai.github.io/Trinity-RFT/). ## 🌟 Key Features @@ -98,19 +98,19 @@ Trinity-RFT provides functionalities for users with different backgrounds and ob | Algorithm | Doc / Example | Source Code | Key Configurations | |------------------------|-------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|--------------------------------| -| PPO [[Paper](https://arxiv.org/pdf/1707.06347)] | [[Doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)] [[Countdown Example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/ppo_policy_loss.py)] | `algorithm_type: ppo` | -| GRPO [[Paper](https://arxiv.org/pdf/2402.03300)] | [[Doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)] [[GSM8K Example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/grpo_advantage.py)] | `algorithm_type: grpo` | -| CHORD 💡 [[Paper](https://arxiv.org/pdf/2508.11408)] | [[Doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_mix_algo.html)] [[ToolACE Example](https://github.com/modelscope/Trinity-RFT/blob/main/examples/mix_chord/mix_chord_toolace.yaml)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/chord_policy_loss.py)] | `algorithm_type: mix_chord` | -| REC Series 💡 [[Paper](https://arxiv.org/pdf/2509.24203)] | [[GSM8K Example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/rec_policy_loss.py)] | `algorithm_type: rec` | -| RLOO [[Paper](https://arxiv.org/pdf/2402.14740)] | - | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/rloo_advantage.py)] | `algorithm_type: rloo` | -| REINFORCE++ [[Paper](https://arxiv.org/pdf/2501.03262)] | - | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/reinforce_advantage.py)] | `algorithm_type: reinforceplusplus` | -| GSPO [[Paper](https://arxiv.org/pdf/2507.18071)] | - | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/gspo_policy_loss.py)] | `algorithm_type: gspo` | -| TOPR [[Paper](https://arxiv.org/pdf/2503.14286)] | [[GSM8K Example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/topr_gsm8k)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/topr_policy_loss.py)] | `algorithm_type: topr` | -| sPPO [[Paper](https://arxiv.org/pdf/2108.05828)] | [[GSM8K Example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/sppo_gsm8k)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sppo_loss_fn.py)] | `algorithm_type: sppo` | -| AsymRE [[Paper](https://arxiv.org/pdf/2506.20520)] | [[GSM8K Example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/asymre_gsm8k)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/asymre_advantage.py)] | `algorithm_type: asymre` | -| CISPO [[Paper](https://arxiv.org/pdf/2506.13585)] | - | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/cispo_policy_loss.py)] | `algorithm_type: cispo` | -| SAPO [[Paper](https://arxiv.org/pdf/2511.20347)] | - | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sapo_policy_loss.py)] | `algorithm_type: sapo` | -| On-Policy Distillation [[Blog](https://thinkingmachines.ai/blog/on-policy-distillation/)] [[Paper](https://arxiv.org/pdf/2306.13649)] | [[GSM8K Example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/on_policy_distill)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/common/workflows/on_policy_distill_workflow.py)] | `algorithm_type: on_policy_distill` | +| PPO [[Paper](https://arxiv.org/pdf/1707.06347)] | [[Doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)] [[Countdown Example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/ppo_countdown)] | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/ppo_policy_loss.py)] | `algorithm_type: ppo` | +| GRPO [[Paper](https://arxiv.org/pdf/2402.03300)] | [[Doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)] [[GSM8K Example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k)] | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/grpo_advantage.py)] | `algorithm_type: grpo` | +| CHORD 💡 [[Paper](https://arxiv.org/pdf/2508.11408)] | [[Doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_mix_algo.html)] [[ToolACE Example](https://github.com/agentscope-ai/Trinity-RFT/blob/main/examples/mix_chord/mix_chord_toolace.yaml)] | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/chord_policy_loss.py)] | `algorithm_type: mix_chord` | +| REC Series 💡 [[Paper](https://arxiv.org/pdf/2509.24203)] | [[GSM8K Example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/rec_gsm8k)] | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/rec_policy_loss.py)] | `algorithm_type: rec` | +| RLOO [[Paper](https://arxiv.org/pdf/2402.14740)] | - | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/rloo_advantage.py)] | `algorithm_type: rloo` | +| REINFORCE++ [[Paper](https://arxiv.org/pdf/2501.03262)] | - | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/reinforce_advantage.py)] | `algorithm_type: reinforceplusplus` | +| GSPO [[Paper](https://arxiv.org/pdf/2507.18071)] | - | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/gspo_policy_loss.py)] | `algorithm_type: gspo` | +| TOPR [[Paper](https://arxiv.org/pdf/2503.14286)] | [[GSM8K Example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/topr_gsm8k)] | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/topr_policy_loss.py)] | `algorithm_type: topr` | +| sPPO [[Paper](https://arxiv.org/pdf/2108.05828)] | [[GSM8K Example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/sppo_gsm8k)] | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sppo_loss_fn.py)] | `algorithm_type: sppo` | +| AsymRE [[Paper](https://arxiv.org/pdf/2506.20520)] | [[GSM8K Example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/asymre_gsm8k)] | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/asymre_advantage.py)] | `algorithm_type: asymre` | +| CISPO [[Paper](https://arxiv.org/pdf/2506.13585)] | - | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/cispo_policy_loss.py)] | `algorithm_type: cispo` | +| SAPO [[Paper](https://arxiv.org/pdf/2511.20347)] | - | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sapo_policy_loss.py)] | `algorithm_type: sapo` | +| On-Policy Distillation [[Blog](https://thinkingmachines.ai/blog/on-policy-distillation/)] [[Paper](https://arxiv.org/pdf/2306.13649)] | [[GSM8K Example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/on_policy_distill)] | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/common/workflows/on_policy_distill_workflow.py)] | `algorithm_type: on_policy_distill` | --- @@ -152,7 +152,7 @@ Run a simple example: trinity run --config examples/tinker/tinker.yaml ``` -This example is designed to run on CPU-only machines. See the complete [Tinker training example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/tinker) for more details. +This example is designed to run on CPU-only machines. See the complete [Tinker training example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/tinker) for more details. To run Trinity-RFT on GPU machines instead, please follow the steps below. @@ -179,7 +179,7 @@ If you plan to customize or contribute to Trinity-RFT, this is the best option. First, clone the repository: ```bash -git clone https://github.com/modelscope/Trinity-RFT +git clone https://github.com/agentscope-ai/Trinity-RFT cd Trinity-RFT ``` @@ -259,7 +259,7 @@ uv pip install trinity-rft uv pip install flash-attn==2.8.1 ``` -> For training with **Megatron-LM**, please refer to [Megatron-LM Backend](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_megatron.html). +> For training with **Megatron-LM**, please refer to [Megatron-LM Backend](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_megatron.html). ### Step 2: prepare dataset and model Trinity-RFT supports most datasets and models from Huggingface and ModelScope. @@ -333,7 +333,7 @@ ray start --head ray start --address= ``` -(Optional) You may use [Wandb](https://docs.wandb.ai/quickstart/) / [TensorBoard](https://www.tensorflow.org/tensorboard) / [MLFlow](https://mlflow.org) for better monitoring. Please refer to [this documentation](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html#monitor-configuration) for the corresponding configurations. +(Optional) You may use [Wandb](https://docs.wandb.ai/quickstart/) / [TensorBoard](https://www.tensorflow.org/tensorboard) / [MLFlow](https://mlflow.org) for better monitoring. Please refer to [this documentation](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html#monitor-configuration) for the corresponding configurations. For example, to log in to Wandb: ```shell @@ -369,7 +369,7 @@ We welcome contributions of all kinds, including: If you're new to the project, documentation and example updates are a great place to start. -See [CONTRIBUTING.md](./CONTRIBUTING.md) for detailed contribution guidelines, as well as our [good-first-issue list](https://github.com/modelscope/Trinity-RFT/issues/470). +See [CONTRIBUTING.md](./CONTRIBUTING.md) for detailed contribution guidelines, as well as our [good-first-issue list](https://github.com/agentscope-ai/Trinity-RFT/issues/470). ## Acknowledgements diff --git a/README_zh.md b/README_zh.md index 305f89b68f..9348f6c907 100644 --- a/README_zh.md +++ b/README_zh.md @@ -1,4 +1,4 @@ -[**English Homepage**](https://github.com/modelscope/Trinity-RFT/blob/main/README.md) | [**中文文档**](https://modelscope.github.io/Trinity-RFT/zh/) | [**常见问题**](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/faq.html) +[**English Homepage**](https://github.com/agentscope-ai/Trinity-RFT/blob/main/README.md) | [**中文文档**](https://agentscope-ai.github.io/Trinity-RFT/zh/) | [**常见问题**](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/faq.html)
    Trinity-RFT @@ -12,7 +12,7 @@
    [![paper](http://img.shields.io/badge/cs.LG-2505.17826-B31B1B?logo=arxiv&logoColor=red)](https://arxiv.org/abs/2505.17826) -[![doc](https://img.shields.io/badge/Docs-blue?logo=markdown)](https://modelscope.github.io/Trinity-RFT/) +[![doc](https://img.shields.io/badge/Docs-blue?logo=markdown)](https://agentscope-ai.github.io/Trinity-RFT/) [![pypi](https://img.shields.io/pypi/v/trinity-rft?logo=pypi&color=026cad)](https://pypi.org/project/trinity-rft/) ![license](https://img.shields.io/badge/license-Apache--2.0-000000.svg) @@ -31,26 +31,26 @@ Trinity-RFT 是一个通用、灵活、用户友好的大语言模型(LLM) Trinity-RFT 面向不同背景和目标的用户提供相应功能: -* 🤖 **智能体应用开发者:** 训练智能体应用,以增强其在特定领域中完成任务的能力 [[教程]](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/develop_workflow.html) +* 🤖 **智能体应用开发者:** 训练智能体应用,以增强其在特定领域中完成任务的能力 [[教程]](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/develop_workflow.html) -* 🧠 **强化学习算法研究者:** 通过定制化简洁、可插拔的模块,设计、实现与验证新的强化学习算法 [[教程]](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/develop_algorithm.html) +* 🧠 **强化学习算法研究者:** 通过定制化简洁、可插拔的模块,设计、实现与验证新的强化学习算法 [[教程]](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/develop_algorithm.html) -* 📊 **数据工程师:** 设计针对任务定制的数据集,构建处理流水线以支持数据清洗、增强以及人类参与场景 [[教程]](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/develop_operator.html) +* 📊 **数据工程师:** 设计针对任务定制的数据集,构建处理流水线以支持数据清洗、增强以及人类参与场景 [[教程]](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/develop_operator.html) ## 🚀 新闻 -* [2026-01] [[发布说明]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.1) Trinity-RFT v0.4.1 发布:升级 verl 至 v0.7.0,Tinker 后端支持 OpenAI API,修复若干 Bug。 +* [2026-01] [[发布说明]](https://github.com/agentscope-ai/Trinity-RFT/releases/tag/v0.4.1) Trinity-RFT v0.4.1 发布:升级 verl 至 v0.7.0,Tinker 后端支持 OpenAI API,修复若干 Bug。 * [2026-01] 推出 [R3L](https://github.com/shiweijiezero/R3L):基于反思-重试的强化学习机制,由自然语言反馈引导高效探索,并达成稳定的 off-policy 学习([论文](https://arxiv.org/abs/2601.03715))。 -* [2025-12] [[发布说明]](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 发布:新增[Tinker](https://thinkingmachines.ai/tinker/) 后端以支持在 **无 GPU** 的设备上训练,增加更多基准测试,增强在线 RL 等功能。 +* [2025-12] [[发布说明]](https://github.com/agentscope-ai/Trinity-RFT/releases/tag/v0.4.0) Trinity-RFT v0.4.0 发布:新增[Tinker](https://thinkingmachines.ai/tinker/) 后端以支持在 **无 GPU** 的设备上训练,增加更多基准测试,增强在线 RL 等功能。 * [2025-12] Trinity-RFT 已支持 [tinker](https://thinkingmachines.ai/tinker/) 训练后端,可在**无 GPU 的设备**上进行模型训练。 * [2025-12] Trinity-RFT 助力淘宝闪购医药健康业务,让 AI 智能体能够理解模糊症状、主动询问后续问题,并提供精准推荐([新闻](https://tech.china.com.cn/sx/20251201/411376.shtml))。 -* [2025-11] [[发布说明](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3)] Trinity-RFT v0.3.3 发布:修复若干 Bug。 -* [2025-11] 推出 [Learn-to-Ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask):利用离线专家数据,训练具备主动问询能力的对话智能体([论文](https://arxiv.org/pdf/2510.25441)). -* [2025-11] 推出 [BOTS](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots):在线 RL 任务选择,实现高效 LLM 微调([论文](https://arxiv.org/pdf/2510.26374))。 -* [2025-09] 我们的 [论文](https://arxiv.org/pdf/2509.24203) 揭示了 group-relative REINFORCE 及其变种(如 GRPO 和 AsymRE)的 off-policy 解释([代码](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k))。 -* [2025-08] 推出 [CHORD](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord):动态 SFT + RL 集成,实现进阶 LLM 微调([论文](https://arxiv.org/pdf/2508.11408))。 +* [2025-11] [[发布说明](https://github.com/agentscope-ai/Trinity-RFT/releases/tag/v0.3.3)] Trinity-RFT v0.3.3 发布:修复若干 Bug。 +* [2025-11] 推出 [Learn-to-Ask](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/learn_to_ask):利用离线专家数据,训练具备主动问询能力的对话智能体([论文](https://arxiv.org/pdf/2510.25441)). +* [2025-11] 推出 [BOTS](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/bots):在线 RL 任务选择,实现高效 LLM 微调([论文](https://arxiv.org/pdf/2510.26374))。 +* [2025-09] 我们的 [论文](https://arxiv.org/pdf/2509.24203) 揭示了 group-relative REINFORCE 及其变种(如 GRPO 和 AsymRE)的 off-policy 解释([代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/rec_gsm8k))。 +* [2025-08] 推出 [CHORD](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/mix_chord):动态 SFT + RL 集成,实现进阶 LLM 微调([论文](https://arxiv.org/pdf/2508.11408))。
    More...
      @@ -73,16 +73,16 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能: | 类别 | 教程 / 指南 | | --- | ----| -| *运行各种 RFT 模式* | + [快速开始:在 GSM8k 上运行 GRPO](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)
      + [Off-policy RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_advanced.html)
      + [全异步 RFT](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_async_mode.html)
      + [通过 DPO 或 SFT 进行离线学习](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_dpo.html)
      + [在无GPU环境下运行RFT训练(Tinker 后端)](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_tinker_backend.html) | -| *多轮智能体强化学习* | + [拼接多轮任务](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_multi_turn.html)
      + [通用多轮任务](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_step_wise.html)
      + [调用智能体框架中的 ReAct 工作流](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_react.html)
      + [例子:训练一个网络搜索智能体](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) | -| *全生命周期的数据流水线* | + [Rollout 任务混合与选取](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/develop_selector.html)
      + [在线任务选择](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) (📝 [论文](https://arxiv.org/pdf/2510.26374))
      + [研究项目:learn-to-ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [论文](https://arxiv.org/pdf/2510.25441))
      + [经验回放机制](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)
      + [高级数据处理能力 & Human-in-the-loop](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_data_functionalities.html) | -| *强化学习算法开发* | + [使用 Trinity-RFT 进行 RL 算法开发](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_mix_algo.html) (📝 [论文](https://arxiv.org/pdf/2508.11408))
      + [研究项目: R3L (基于反思-重试的强化学习)](https://github.com/shiweijiezero/R3L) (📝 [论文](https://arxiv.org/abs/2601.03715))
      + [研究项目: group-relative REINFORCE](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [论文](https://arxiv.org/abs/2509.24203))
      + 不可验证的领域: [RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [可训练 RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | -| *基准测试* | + [基准测试工具 (快速验证与实验)](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/README.md)
      + [Guru-Math 测试 & 对比 veRL](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/reports/guru_math.md)
      + [FrozenLake 测试 & 对比 rLLM](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/reports/frozenlake.md)
      + [Alfworld 测试 & 对比 rLLM](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/reports/alfworld.md) | -| *深入认识 Trinity-RFT* | + [完整配置指南](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_configs.html)
      + [GPU 资源与训练配置对应指南](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_gpu_configs.html)
      + [理解 explorer-trainer 同步逻辑](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/synchronizer.html)
      + [如何与 verl 对齐配置](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/align_with_verl.html) | +| *运行各种 RFT 模式* | + [快速开始:在 GSM8k 上运行 GRPO](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)
      + [Off-policy RFT](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_advanced.html)
      + [全异步 RFT](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_async_mode.html)
      + [通过 DPO 或 SFT 进行离线学习](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_dpo.html)
      + [在无GPU环境下运行RFT训练(Tinker 后端)](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_tinker_backend.html) | +| *多轮智能体强化学习* | + [拼接多轮任务](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_multi_turn.html)
      + [通用多轮任务](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_step_wise.html)
      + [调用智能体框架中的 ReAct 工作流](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_react.html)
      + [例子:训练一个网络搜索智能体](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/agentscope_websearch) | +| *全生命周期的数据流水线* | + [Rollout 任务混合与选取](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/develop_selector.html)
      + [在线任务选择](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/bots) (📝 [论文](https://arxiv.org/pdf/2510.26374))
      + [研究项目:learn-to-ask](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [论文](https://arxiv.org/pdf/2510.25441))
      + [经验回放机制](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)
      + [高级数据处理能力 & Human-in-the-loop](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_data_functionalities.html) | +| *强化学习算法开发* | + [使用 Trinity-RFT 进行 RL 算法开发](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_mix_algo.html) (📝 [论文](https://arxiv.org/pdf/2508.11408))
      + [研究项目: R3L (基于反思-重试的强化学习)](https://github.com/shiweijiezero/R3L) (📝 [论文](https://arxiv.org/abs/2601.03715))
      + [研究项目: group-relative REINFORCE](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [论文](https://arxiv.org/abs/2509.24203))
      + 不可验证的领域: [RULER](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [可训练 RULER](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | +| *基准测试* | + [基准测试工具 (快速验证与实验)](https://github.com/agentscope-ai/Trinity-RFT/tree/main/benchmark/README.md)
      + [Guru-Math 测试 & 对比 veRL](https://github.com/agentscope-ai/Trinity-RFT/tree/main/benchmark/reports/guru_math.md)
      + [FrozenLake 测试 & 对比 rLLM](https://github.com/agentscope-ai/Trinity-RFT/tree/main/benchmark/reports/frozenlake.md)
      + [Alfworld 测试 & 对比 rLLM](https://github.com/agentscope-ai/Trinity-RFT/tree/main/benchmark/reports/alfworld.md) | +| *深入认识 Trinity-RFT* | + [完整配置指南](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/trinity_configs.html)
      + [GPU 资源与训练配置对应指南](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/trinity_gpu_configs.html)
      + [理解 explorer-trainer 同步逻辑](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/synchronizer.html)
      + [如何与 verl 对齐配置](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/align_with_verl.html) | > [!NOTE] -> 更多教程请参考 [Trinity-RFT 文档](https://modelscope.github.io/Trinity-RFT/)。 +> 更多教程请参考 [Trinity-RFT 文档](https://agentscope-ai.github.io/Trinity-RFT/)。 @@ -117,23 +117,23 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能: ## 🔨 算法支持 -下表列出了 Trinity-RFT 支持的算法,更多算法请参考 [算法模块](https://github.com/modelscope/Trinity-RFT/blob/main/trinity/algorithm/algorithm.py)。您也可以通过自定义不同的模块来构建新算法,参见 [教程](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/develop_algorithm.html)。 +下表列出了 Trinity-RFT 支持的算法,更多算法请参考 [算法模块](https://github.com/agentscope-ai/Trinity-RFT/blob/main/trinity/algorithm/algorithm.py)。您也可以通过自定义不同的模块来构建新算法,参见 [教程](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/develop_algorithm.html)。 | 算法 | 文档/示例 | 核心代码 | 关键配置 | |:-----------|:-----------|:---------------|:-----------| -| PPO [[论文](https://arxiv.org/pdf/1707.06347)] | [[文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)] [[Countdown 例子](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown)] | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/ppo_policy_loss.py)] | `algorithm_type: ppo` | -| GRPO [[论文](https://arxiv.org/pdf/2402.03300)] | [[文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)] [[GSM8K 例子](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k)]| [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/grpo_advantage.py)] | `algorithm_type: grpo` | -| CHORD 💡 [[论文](https://arxiv.org/pdf/2508.11408)] | [[文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_mix_algo.html)] [[ToolACE 例子](https://github.com/modelscope/Trinity-RFT/blob/main/examples/mix_chord/mix_chord_toolace.yaml)] | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/chord_policy_loss.py)] | `algorithm_type: mix_chord` | -| REC Series 💡 [[论文](https://arxiv.org/pdf/2509.24203)] | [[GSM8K 例子](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k)] | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/rec_policy_loss.py)] | `algorithm_type: rec` | -| RLOO [[论文](https://arxiv.org/pdf/2402.14740)] | - | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/rloo_advantage.py)] | `algorithm_type: rloo` | -| REINFORCE++ [[论文](https://arxiv.org/pdf/2501.03262)] | - | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/reinforce_advantage.py)] | `algorithm_type: reinforceplusplus` | -| GSPO [[论文](https://arxiv.org/pdf/2507.18071)] | - | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/gspo_policy_loss.py)] | `algorithm_type: gspo` | -| TOPR [[论文](https://arxiv.org/pdf/2503.14286)] | [[GSM8K 例子](https://github.com/modelscope/Trinity-RFT/tree/main/examples/topr_gsm8k)] | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/topr_policy_loss.py)] | `algorithm_type: topr` | -| sPPO [[论文](https://arxiv.org/pdf/2108.05828)] | [[GSM8K 例子](https://github.com/modelscope/Trinity-RFT/tree/main/examples/sppo_gsm8k)] | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sppo_loss_fn.py)] | `algorithm_type: sppo` | -| AsymRE [[论文](https://arxiv.org/pdf/2506.20520)] | [[GSM8K 例子](https://github.com/modelscope/Trinity-RFT/tree/main/examples/asymre_gsm8k)] | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/asymre_advantage.py)] | `algorithm_type: asymre` | -| CISPO [[论文](https://arxiv.org/pdf/2506.13585)] | - | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/cispo_policy_loss.py)] | `algorithm_type: cispo` | -| SAPO [[论文](https://arxiv.org/pdf/2511.20347)] | - | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sapo_policy_loss.py)] | `algorithm_type: sapo` | -| On-Policy Distillation [[博客](https://thinkingmachines.ai/blog/on-policy-distillation/)] [[论文](https://arxiv.org/pdf/2306.13649)] | [[GSM8K 示例](https://github.com/modelscope/Trinity-RFT/tree/main/examples/on_policy_distill)] | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/common/workflows/on_policy_distill_workflow.py)] | `algorithm_type: on_policy_distill` | +| PPO [[论文](https://arxiv.org/pdf/1707.06347)] | [[文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)] [[Countdown 例子](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/ppo_countdown)] | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/ppo_policy_loss.py)] | `algorithm_type: ppo` | +| GRPO [[论文](https://arxiv.org/pdf/2402.03300)] | [[文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)] [[GSM8K 例子](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k)]| [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/grpo_advantage.py)] | `algorithm_type: grpo` | +| CHORD 💡 [[论文](https://arxiv.org/pdf/2508.11408)] | [[文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_mix_algo.html)] [[ToolACE 例子](https://github.com/agentscope-ai/Trinity-RFT/blob/main/examples/mix_chord/mix_chord_toolace.yaml)] | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/chord_policy_loss.py)] | `algorithm_type: mix_chord` | +| REC Series 💡 [[论文](https://arxiv.org/pdf/2509.24203)] | [[GSM8K 例子](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/rec_gsm8k)] | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/rec_policy_loss.py)] | `algorithm_type: rec` | +| RLOO [[论文](https://arxiv.org/pdf/2402.14740)] | - | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/rloo_advantage.py)] | `algorithm_type: rloo` | +| REINFORCE++ [[论文](https://arxiv.org/pdf/2501.03262)] | - | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/reinforce_advantage.py)] | `algorithm_type: reinforceplusplus` | +| GSPO [[论文](https://arxiv.org/pdf/2507.18071)] | - | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/gspo_policy_loss.py)] | `algorithm_type: gspo` | +| TOPR [[论文](https://arxiv.org/pdf/2503.14286)] | [[GSM8K 例子](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/topr_gsm8k)] | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/topr_policy_loss.py)] | `algorithm_type: topr` | +| sPPO [[论文](https://arxiv.org/pdf/2108.05828)] | [[GSM8K 例子](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/sppo_gsm8k)] | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sppo_loss_fn.py)] | `algorithm_type: sppo` | +| AsymRE [[论文](https://arxiv.org/pdf/2506.20520)] | [[GSM8K 例子](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/asymre_gsm8k)] | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/asymre_advantage.py)] | `algorithm_type: asymre` | +| CISPO [[论文](https://arxiv.org/pdf/2506.13585)] | - | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/cispo_policy_loss.py)] | `algorithm_type: cispo` | +| SAPO [[论文](https://arxiv.org/pdf/2511.20347)] | - | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sapo_policy_loss.py)] | `algorithm_type: sapo` | +| On-Policy Distillation [[博客](https://thinkingmachines.ai/blog/on-policy-distillation/)] [[论文](https://arxiv.org/pdf/2306.13649)] | [[GSM8K 示例](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/on_policy_distill)] | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/common/workflows/on_policy_distill_workflow.py)] | `algorithm_type: on_policy_distill` | @@ -161,7 +161,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能: > > **没有 GPU?没问题!** 您仍然可以尝试使用: > 1. 按照安装步骤进行操作(可跳过 `flash-attn` 等 GPU 专用的软件包) -> 2. 运行 **[Tinker 训练示例](https://github.com/modelscope/Trinity-RFT/tree/main/examples/tinker)**,该示例专为仅使用 CPU 的系统设计。 +> 2. 运行 **[Tinker 训练示例](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/tinker)**,该示例专为仅使用 CPU 的系统设计。 ### 第一步:安装 @@ -179,7 +179,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能: ### 1. 克隆仓库 ```bash -git clone https://github.com/modelscope/Trinity-RFT +git clone https://github.com/agentscope-ai/Trinity-RFT cd Trinity-RFT ``` @@ -266,7 +266,7 @@ uv pip install trinity-rft uv pip install flash-attn==2.8.1 ``` -> 如需使用 **Megatron-LM** 进行训练,请参考 [Megatron-LM 支持](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_megatron.html) +> 如需使用 **Megatron-LM** 进行训练,请参考 [Megatron-LM 支持](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_megatron.html) ### 第二步:准备数据集和模型 @@ -350,7 +350,7 @@ ray start --address= ``` (可选)您可以使用 [Wandb](https://docs.wandb.ai/quickstart/) / [TensorBoard](https://www.tensorflow.org/tensorboard) / [MLFlow](https://mlflow.org) 等工具,更方便地监控训练流程。 -相应的配置方法请参考 [这个文档](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html#monitor-configuration)。 +相应的配置方法请参考 [这个文档](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html#monitor-configuration)。 比如使用 Wandb 时,您需要先登录: ```shell diff --git a/benchmark/README.md b/benchmark/README.md index 15fcc5ec1e..9c65994836 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -70,7 +70,7 @@ python bench.py gsm8k --model_path /path/to/Qwen/Qwen2.5-1.5B-Instruct #### GSM8K Results -The chart below shows performance based on this [commit](https://github.com/modelscope/Trinity-RFT/tree/068da409d215bb2450d93b6b7a56740d4751669d). +The chart below shows performance based on this [commit](https://github.com/agentscope-ai/Trinity-RFT/tree/068da409d215bb2450d93b6b7a56740d4751669d). ![View Results](../docs/sphinx_doc/assets/gsm8k-bench.png) ### 2. Countdown @@ -83,7 +83,7 @@ python bench.py countdown --model_path /path/to/Qwen/Qwen2.5-1.5B-Instruct #### Countdown Results -The chart below shows performance based on this [commit](https://github.com/modelscope/Trinity-RFT/tree/068da409d215bb2450d93b6b7a56740d4751669d). +The chart below shows performance based on this [commit](https://github.com/agentscope-ai/Trinity-RFT/tree/068da409d215bb2450d93b6b7a56740d4751669d). ![View Results](../docs/sphinx_doc/assets/countdown-bench.png) ### 3. Guru-Math @@ -96,7 +96,7 @@ python bench.py guru_math --model_path /path/to/Qwen/Qwen2.5-7B #### Guru Results -The chart below shows performance based on this [commit](https://github.com/modelscope/Trinity-RFT/tree/fbf6c967bcd637bfd9f81fb4d7dd4961d7d5a407). +The chart below shows performance based on this [commit](https://github.com/agentscope-ai/Trinity-RFT/tree/fbf6c967bcd637bfd9f81fb4d7dd4961d7d5a407). ![View Results](../docs/sphinx_doc/assets/guru-bench.png) See [full report](./reports/guru_math.md) for details. @@ -111,7 +111,7 @@ python bench.py frozen_lake --model_path /path/to/Qwen/Qwen2.5-3B #### Frozen Lake Results -The chart below shows performance based on this [commit](https://github.com/modelscope/Trinity-RFT/tree/3861859cbd9c40de07429db2d9b19fd3d4d31703). +The chart below shows performance based on this [commit](https://github.com/agentscope-ai/Trinity-RFT/tree/3861859cbd9c40de07429db2d9b19fd3d4d31703). ![View Results](../docs/sphinx_doc/assets/bench_frozenlake_step.png) See [full report](./reports/frozenlake.md) for details. @@ -122,7 +122,7 @@ Please follow the instructions in [Alfworld report](./reports/alfworld.md) to ru #### ALFWorld Results -The chart below shows performance based on this [commit](https://github.com/modelscope/Trinity-RFT/tree/3861859cbd9c40de07429db2d9b19fd3d4d31703). +The chart below shows performance based on this [commit](https://github.com/agentscope-ai/Trinity-RFT/tree/3861859cbd9c40de07429db2d9b19fd3d4d31703). ![View Results](../docs/sphinx_doc/assets/bench_alfworld_step.png) diff --git a/benchmark/reports/alfworld.md b/benchmark/reports/alfworld.md index e9328662a6..ea9eb2ad3e 100644 --- a/benchmark/reports/alfworld.md +++ b/benchmark/reports/alfworld.md @@ -10,11 +10,11 @@ The environment is configured as follows: * Reward Structure: +1 for successfully completing the task, -0.1 otherwise * Maximum Steps: 30 (configurable via `max_env_steps`) -See the [documentation](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html) for data preparation. +See the [documentation](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html) for data preparation. ## 2. Experimental Settings -We evaluate the performance of the following methods in Trinity-RFT framework with version [0.3.3](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3) (verl==0.5.0, vllm==0.11.0) and compare against the latest release of rLLM with commit ID [ef6451f](https://github.com/rllm-org/rllm/commit/ef6451fbd7eba224c4a87e3fd944d7c0e2bcc0ea) (verl==0.5.0) as of Nov. 6, 2025. +We evaluate the performance of the following methods in Trinity-RFT framework with version [0.3.3](https://github.com/agentscope-ai/Trinity-RFT/releases/tag/v0.3.3) (verl==0.5.0, vllm==0.11.0) and compare against the latest release of rLLM with commit ID [ef6451f](https://github.com/rllm-org/rllm/commit/ef6451fbd7eba224c4a87e3fd944d7c0e2bcc0ea) (verl==0.5.0) as of Nov. 6, 2025. Since rLLM does not support ALFWorld environment yet, we implement this task in rLLM for comparison. In Trinity-RFT and rLLM, we respectively evaluate the performance using GRPO algorithm on this task. diff --git a/benchmark/reports/frozenlake.md b/benchmark/reports/frozenlake.md index f9fb864c3c..cef501fe8a 100644 --- a/benchmark/reports/frozenlake.md +++ b/benchmark/reports/frozenlake.md @@ -21,7 +21,7 @@ To filter the unsolvable tasks, we restrict the game map to have a valid path wi ## 2. Experimental Settings -We evaluate the performance of the following methods in Trinity-RFT framework with version [0.3.3](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3) (verl==0.5.0, vllm==0.11.0) and compare against the latest release of rLLM with commit ID [ef6451f](https://github.com/rllm-org/rllm/commit/ef6451fbd7eba224c4a87e3fd944d7c0e2bcc0ea) (verl==0.5.0) as of Nov. 6, 2025. +We evaluate the performance of the following methods in Trinity-RFT framework with version [0.3.3](https://github.com/agentscope-ai/Trinity-RFT/releases/tag/v0.3.3) (verl==0.5.0, vllm==0.11.0) and compare against the latest release of rLLM with commit ID [ef6451f](https://github.com/rllm-org/rllm/commit/ef6451fbd7eba224c4a87e3fd944d7c0e2bcc0ea) (verl==0.5.0) as of Nov. 6, 2025. We fine-tune a Qwen2.5-3B-Instruct model using the training tasks with GRPO. For all experiments, we fix key parameters to `batch_size=64`, `repeat_times=8`, and `lr=1e-6`. We run each experiment for three times and report the average results. diff --git a/benchmark/reports/guru_math.md b/benchmark/reports/guru_math.md index 2eb251c8ec..1c0b5127f5 100644 --- a/benchmark/reports/guru_math.md +++ b/benchmark/reports/guru_math.md @@ -6,7 +6,7 @@ Guru-Math is the mathematics task derived from the [Guru](https://huggingface.co ## 2. Experimental Settings -We evaluate the performance of the following methods within the Trinity-RFT framework using version [0.3.3](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3) (verl==0.5.0, vllm==0.10.2). For comparison, we ported relevant code from [Reasoning360](https://github.com/LLM360/Reasoning360) to be compatible with verl==0.5.0. +We evaluate the performance of the following methods within the Trinity-RFT framework using version [0.3.3](https://github.com/agentscope-ai/Trinity-RFT/releases/tag/v0.3.3) (verl==0.5.0, vllm==0.10.2). For comparison, we ported relevant code from [Reasoning360](https://github.com/LLM360/Reasoning360) to be compatible with verl==0.5.0. Within both Trinity-RFT and veRL, we evaluate performance using the GRPO algorithm on this task. We fine-tune a base `Qwen2.5-7B` model that has not undergone prior fine-tuning. diff --git a/docs/sphinx_doc/source/conf.py b/docs/sphinx_doc/source/conf.py index 146588ff34..2e6307133c 100644 --- a/docs/sphinx_doc/source/conf.py +++ b/docs/sphinx_doc/source/conf.py @@ -90,7 +90,7 @@ def get_recent_tags(n: int) -> list: "article_header_end": "article_header_customized.html", "use_download_button": True, "use_fullscreen_button": True, - "repository_url": "https://github.com/modelscope/Trinity-RFT", + "repository_url": "https://github.com/agentscope-ai/Trinity-RFT", "use_repository_button": True, } diff --git a/docs/sphinx_doc/source/main.md b/docs/sphinx_doc/source/main.md index 93af224d95..5923f84b0a 100644 --- a/docs/sphinx_doc/source/main.md +++ b/docs/sphinx_doc/source/main.md @@ -28,11 +28,11 @@ Trinity-RFT provides functionalities for users with different backgrounds and ob | Category | Tutorial / Guideline | | --- | ----| | *Run diverse RFT modes* | + [Quick start: GRPO on GSM8k](/tutorial/example_reasoning_basic.md)
      + [Off-policy RFT](/tutorial/example_reasoning_advanced.md)
      + [Fully asynchronous RFT](/tutorial/example_async_mode.md)
      + [Offline learning by DPO or SFT](/tutorial/example_dpo.md) | -| *Multi-step agentic RL* | + [Concatenated multi-turn workflow](/tutorial/example_multi_turn.md)
      + [General multi-step workflow](/tutorial/example_step_wise.md)
      + [ReAct workflow with an agent framework](/tutorial/example_react.md)
      + [Example: train a web-search agent](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) | -| *Full-lifecycle data pipelines* | + [Rollout task mixing and selection](/tutorial/develop_selector.md)
      + [Online task curriculum](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) (📝 [paper](https://arxiv.org/pdf/2510.26374))
      + [Research project: learn-to-ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [paper](https://arxiv.org/pdf/2510.25441))
      + [Experience replay with prioritization](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)
      + [Advanced data processing & human-in-the-loop](/tutorial/example_data_functionalities.md) | -| *Algorithm development* | + [RL algorithm development with Trinity-RFT](/tutorial/example_mix_algo.md) (📝 [paper](https://arxiv.org/pdf/2508.11408))
      + [Research project: R3L (reflect-then-retry RL)](https://github.com/shiweijiezero/R3L) (📝 [paper](https://arxiv.org/abs/2601.03715))
      + [Research project: group-relative REINFORCE](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [paper](https://arxiv.org/abs/2509.24203))
      + Non-verifiable domains: [RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [trainable RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | -| *Benchmarks* | + [Benchmark toolkit (quick verification & experimentation)](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/README.md)
      + [Guru-Math benchmark & comparison with veRL](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/reports/guru_math.md)
      + [FrozenLake benchmark & comparison with rLLM](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/reports/frozenlake.md)
      + [Alfworld benchmark & comparison with rLLM](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/reports/alfworld.md) | -| *Going deeper into Trinity-RFT* | + [Full configurations](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html)
      + [GPU resource and training configuration guide](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_gpu_configs.html)
      + [Understand the coordination between explorer and trainer](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/synchronizer.html)
      + [How to align configuration with veRL](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/align_with_verl.html) | +| *Multi-step agentic RL* | + [Concatenated multi-turn workflow](/tutorial/example_multi_turn.md)
      + [General multi-step workflow](/tutorial/example_step_wise.md)
      + [ReAct workflow with an agent framework](/tutorial/example_react.md)
      + [Example: train a web-search agent](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/agentscope_websearch) | +| *Full-lifecycle data pipelines* | + [Rollout task mixing and selection](/tutorial/develop_selector.md)
      + [Online task curriculum](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/bots) (📝 [paper](https://arxiv.org/pdf/2510.26374))
      + [Research project: learn-to-ask](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [paper](https://arxiv.org/pdf/2510.25441))
      + [Experience replay with prioritization](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)
      + [Advanced data processing & human-in-the-loop](/tutorial/example_data_functionalities.md) | +| *Algorithm development* | + [RL algorithm development with Trinity-RFT](/tutorial/example_mix_algo.md) (📝 [paper](https://arxiv.org/pdf/2508.11408))
      + [Research project: R3L (reflect-then-retry RL)](https://github.com/shiweijiezero/R3L) (📝 [paper](https://arxiv.org/abs/2601.03715))
      + [Research project: group-relative REINFORCE](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [paper](https://arxiv.org/abs/2509.24203))
      + Non-verifiable domains: [RULER](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [trainable RULER](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | +| *Benchmarks* | + [Benchmark toolkit (quick verification & experimentation)](https://github.com/agentscope-ai/Trinity-RFT/tree/main/benchmark/README.md)
      + [Guru-Math benchmark & comparison with veRL](https://github.com/agentscope-ai/Trinity-RFT/tree/main/benchmark/reports/guru_math.md)
      + [FrozenLake benchmark & comparison with rLLM](https://github.com/agentscope-ai/Trinity-RFT/tree/main/benchmark/reports/frozenlake.md)
      + [Alfworld benchmark & comparison with rLLM](https://github.com/agentscope-ai/Trinity-RFT/tree/main/benchmark/reports/alfworld.md) | +| *Going deeper into Trinity-RFT* | + [Full configurations](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html)
      + [GPU resource and training configuration guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_gpu_configs.html)
      + [Understand the coordination between explorer and trainer](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/synchronizer.html)
      + [How to align configuration with veRL](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/align_with_verl.html) | @@ -70,23 +70,23 @@ Trinity-RFT provides functionalities for users with different backgrounds and ob ## 🔧 Supported Algorithms -We list some algorithms supported by Trinity-RFT in the following table. For more details, the concrete configurations are shown in the [Algorithm module](https://github.com/modelscope/Trinity-RFT/blob/main/trinity/algorithm/algorithm.py). You can also set up new algorithms by customizing different components, see [tutorial](/tutorial/develop_algorithm.md). +We list some algorithms supported by Trinity-RFT in the following table. For more details, the concrete configurations are shown in the [Algorithm module](https://github.com/agentscope-ai/Trinity-RFT/blob/main/trinity/algorithm/algorithm.py). You can also set up new algorithms by customizing different components, see [tutorial](/tutorial/develop_algorithm.md). | Algorithm | Doc / Example | Source Code | Key Configurations | |:-----------|:-----------|:---------------|:-----------| -| PPO [[Paper](https://arxiv.org/pdf/1707.06347)] | [[Doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)] [[Countdown Example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/ppo_policy_loss.py)] | `algorithm_type: ppo` | -| GRPO [[Paper](https://arxiv.org/pdf/2402.03300)] | [[Doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)] [[GSM8K Example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k)]| [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/grpo_advantage.py)] | `algorithm_type: grpo` | -| CHORD 💡 [[Paper](https://arxiv.org/pdf/2508.11408)] | [[Doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_mix_algo.html)] [[ToolACE Example](https://github.com/modelscope/Trinity-RFT/blob/main/examples/mix_chord/mix_chord_toolace.yaml)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/chord_policy_loss.py)] | `algorithm_type: mix_chord` | -| REC Series 💡 [[Paper](https://arxiv.org/pdf/2509.24203)] | [[GSM8K Example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/rec_policy_loss.py)] | `algorithm_type: rec` | -| RLOO [[Paper](https://arxiv.org/pdf/2402.14740)] | - | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/rloo_advantage.py)] | `algorithm_type: rloo` | -| REINFORCE++ [[Paper](https://arxiv.org/pdf/2501.03262)] | - | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/reinforce_advantage.py)] | `algorithm_type: reinforceplusplus` | -| GSPO [[Paper](https://arxiv.org/pdf/2507.18071)] | - | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/gspo_policy_loss.py)] | `algorithm_type: gspo` | -| TOPR [[Paper](https://arxiv.org/pdf/2503.14286)] | [[GSM8K Example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/topr_gsm8k)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/topr_policy_loss.py)] | `algorithm_type: topr` | -| sPPO [[Paper](https://arxiv.org/pdf/2108.05828)] | [[GSM8K Example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/sppo_gsm8k)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sppo_loss_fn.py)] | `algorithm_type: sppo` | -| AsymRE [[Paper](https://arxiv.org/pdf/2506.20520)] | [[GSM8K Example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/asymre_gsm8k)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/asymre_advantage.py)] | `algorithm_type: asymre` | -| CISPO [[Paper](https://arxiv.org/pdf/2506.13585)] | - | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/cispo_policy_loss.py)] | `algorithm_type: cispo` | -| SAPO [[Paper](https://arxiv.org/pdf/2511.20347)] | - | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sapo_policy_loss.py)] | `algorithm_type: sapo` | -| On-Policy Distillation [[Blog](https://thinkingmachines.ai/blog/on-policy-distillation/)] [[Paper](https://arxiv.org/pdf/2306.13649)] | [[GSM8K Example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/on_policy_distill)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/common/workflows/on_policy_distill_workflow.py)] | `algorithm_type: on_policy_distill` | +| PPO [[Paper](https://arxiv.org/pdf/1707.06347)] | [[Doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)] [[Countdown Example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/ppo_countdown)] | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/ppo_policy_loss.py)] | `algorithm_type: ppo` | +| GRPO [[Paper](https://arxiv.org/pdf/2402.03300)] | [[Doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html)] [[GSM8K Example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k)]| [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/grpo_advantage.py)] | `algorithm_type: grpo` | +| CHORD 💡 [[Paper](https://arxiv.org/pdf/2508.11408)] | [[Doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_mix_algo.html)] [[ToolACE Example](https://github.com/agentscope-ai/Trinity-RFT/blob/main/examples/mix_chord/mix_chord_toolace.yaml)] | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/chord_policy_loss.py)] | `algorithm_type: mix_chord` | +| REC Series 💡 [[Paper](https://arxiv.org/pdf/2509.24203)] | [[GSM8K Example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/rec_gsm8k)] | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/rec_policy_loss.py)] | `algorithm_type: rec` | +| RLOO [[Paper](https://arxiv.org/pdf/2402.14740)] | - | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/rloo_advantage.py)] | `algorithm_type: rloo` | +| REINFORCE++ [[Paper](https://arxiv.org/pdf/2501.03262)] | - | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/reinforce_advantage.py)] | `algorithm_type: reinforceplusplus` | +| GSPO [[Paper](https://arxiv.org/pdf/2507.18071)] | - | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/gspo_policy_loss.py)] | `algorithm_type: gspo` | +| TOPR [[Paper](https://arxiv.org/pdf/2503.14286)] | [[GSM8K Example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/topr_gsm8k)] | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/topr_policy_loss.py)] | `algorithm_type: topr` | +| sPPO [[Paper](https://arxiv.org/pdf/2108.05828)] | [[GSM8K Example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/sppo_gsm8k)] | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sppo_loss_fn.py)] | `algorithm_type: sppo` | +| AsymRE [[Paper](https://arxiv.org/pdf/2506.20520)] | [[GSM8K Example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/asymre_gsm8k)] | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/asymre_advantage.py)] | `algorithm_type: asymre` | +| CISPO [[Paper](https://arxiv.org/pdf/2506.13585)] | - | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/cispo_policy_loss.py)] | `algorithm_type: cispo` | +| SAPO [[Paper](https://arxiv.org/pdf/2511.20347)] | - | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sapo_policy_loss.py)] | `algorithm_type: sapo` | +| On-Policy Distillation [[Blog](https://thinkingmachines.ai/blog/on-policy-distillation/)] [[Paper](https://arxiv.org/pdf/2306.13649)] | [[GSM8K Example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/on_policy_distill)] | [[Code](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/common/workflows/on_policy_distill_workflow.py)] | `algorithm_type: on_policy_distill` | diff --git a/docs/sphinx_doc/source/tutorial/align_with_verl.md b/docs/sphinx_doc/source/tutorial/align_with_verl.md index 6290b59be1..2a48798de4 100644 --- a/docs/sphinx_doc/source/tutorial/align_with_verl.md +++ b/docs/sphinx_doc/source/tutorial/align_with_verl.md @@ -24,7 +24,7 @@ Roughly speaking, the parameters in veRL are mapped to the following modules in | Some global configurations | `trainer` | `monitor`, `synchronizer`, `cluster`, etc | -In the following, we show how to map the parameters in veRL to the ones in Trinity-RFT. Please refer to the [documentation](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html) for the detailed parameter configuration of Trinity-RFT. +In the following, we show how to map the parameters in veRL to the ones in Trinity-RFT. Please refer to the [documentation](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html) for the detailed parameter configuration of Trinity-RFT. ```{note} To match the default training setup of veRL, we set `synchronizer.sync_style=fixed` and `synchronizer.sync_offset=0` in Trinity-RFT. @@ -142,7 +142,7 @@ explorer: max_response_tokens: 1024 max_model_len: 20480 ``` -Please refer to the [configuration](https://github.com/modelscope/Trinity-RFT/blob/main/examples/grpo_rubric_as_reward/rubric.yaml) and [workflow](https://github.com/modelscope/Trinity-RFT/blob/main/trinity/common/workflows/rubric_judge_workflow.py) with LLM-as-a-judge for more details. +Please refer to the [configuration](https://github.com/agentscope-ai/Trinity-RFT/blob/main/examples/grpo_rubric_as_reward/rubric.yaml) and [workflow](https://github.com/agentscope-ai/Trinity-RFT/blob/main/trinity/common/workflows/rubric_judge_workflow.py) with LLM-as-a-judge for more details. ### Trainer diff --git a/docs/sphinx_doc/source/tutorial/develop_operator.md b/docs/sphinx_doc/source/tutorial/develop_operator.md index 640624775a..456d183aa7 100644 --- a/docs/sphinx_doc/source/tutorial/develop_operator.md +++ b/docs/sphinx_doc/source/tutorial/develop_operator.md @@ -6,7 +6,7 @@ In Trinity-RFT, the operator module is responsible for processing experience data in the buffer module. It supports existing data processing capabilities from [Data-Juicer](https://github.com/modelscope/data-juicer) naturally, and allows developers to implement their own operators as well. By customizing operators, developers can implement various data processing functionalities, such as data augmentation, filtering, and transformation. You can even implement advantages/returns calculation as operators, as shown in {ref}`Algorithms ` section. -- **DataJuicerOperator** ({class}`trinity.buffer.operators.DataJuicerOperator`): The operator that wraps the data processing operators from Data-Juicer. It provides a simple interface for developers to list the Data-Juicer operators they want to use. The full list of Data-Juicer operators can be found [here](https://modelscope.github.io/data-juicer/en/main/docs/Operators.html). +- **DataJuicerOperator** ({class}`trinity.buffer.operators.DataJuicerOperator`): The operator that wraps the data processing operators from Data-Juicer. It provides a simple interface for developers to list the Data-Juicer operators they want to use. The full list of Data-Juicer operators can be found [here](https://agentscope-ai.github.io/data-juicer/en/main/docs/Operators.html). - **ExperienceOperator** ({class}`trinity.buffer.operators.ExperienceOperator`): The base class for all operators used in experience data processing. It defines the interface and common functionalities that all operators should have. Each operator processes a batch of experience data and returns the processed data with metrics for logging. - **ExperiencePipeline** ({class}`trinity.buffer.pipelines.ExperiencePipeline`): The experience data processing pipeline that manages a sequence of operators. It takes raw experiences from the `Explorer`, passes them through each operator in the pipeline, and writes the final processed experiences into the input buffer of the `Trainer`. diff --git a/docs/sphinx_doc/source/tutorial/example_async_mode.md b/docs/sphinx_doc/source/tutorial/example_async_mode.md index fbecc00c68..5a7a7aaf8b 100644 --- a/docs/sphinx_doc/source/tutorial/example_async_mode.md +++ b/docs/sphinx_doc/source/tutorial/example_async_mode.md @@ -4,7 +4,7 @@ This example demonstrates how to run RFT in fully asynchronous mode using the GR Trinity-RFT supports Asynchronous RFT by running the trainer and explorer in separate processes. -For this purpose, we provide two main configuration files: [`explorer.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/async_gsm8k/explorer.yaml) and [`trainer.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/async_gsm8k/trainer.yaml). +For this purpose, we provide two main configuration files: [`explorer.yaml`](https://github.com/agentscope-ai/Trinity-RFT/blob/main/examples/async_gsm8k/explorer.yaml) and [`trainer.yaml`](https://github.com/agentscope-ai/Trinity-RFT/blob/main/examples/async_gsm8k/trainer.yaml). The primary difference between them is that in `explorer.yaml` we set `mode` as `explore`, while in `trainer.yaml` we set `mode` as `train`. The model weights of the explorer and trainer are synchronized once every `sync_interval * batch_size` tasks. diff --git a/docs/sphinx_doc/source/tutorial/example_data_functionalities.md b/docs/sphinx_doc/source/tutorial/example_data_functionalities.md index 1ff321b451..7663bcd41a 100644 --- a/docs/sphinx_doc/source/tutorial/example_data_functionalities.md +++ b/docs/sphinx_doc/source/tutorial/example_data_functionalities.md @@ -4,7 +4,7 @@ ## Overview Trinity-RFT provides a unified data processor to process the raw dataset and experiences for the task pipeline and the experience pipeline. -- For tasks, the data processing capabilities come from [Data-Juicer](https://github.com/modelscope/data-juicer). You can use data processing operators from Data-Juicer. The full list of Data-Juicer operators can be found [here](https://modelscope.github.io/data-juicer/en/main/docs/Operators.html) +- For tasks, the data processing capabilities come from [Data-Juicer](https://github.com/modelscope/data-juicer). You can use data processing operators from Data-Juicer. The full list of Data-Juicer operators can be found [here](https://agentscope-ai.github.io/data-juicer/en/main/docs/Operators.html) - For experiences, in addition to Data-Juicer operators, Trinity-RFT provides several RFT-related operators and allows developers to implement their own operators. For implementing your own data processor, you can refer to this [document](trinity_programming_guide.md#operators-for-data-developers). @@ -73,7 +73,7 @@ It's worth noticing that we don't need to set the output path usually, cause it The data processing of Data-Juicer is maintained as a service. Thus we need to config the data-juicer service. Luckily, Trinity-RFT provides an auto-start way to start the data processor server automatically. All you need to do is to set the `auto_start` of `data-juicer` service to `true` in the `service` section. -All config items in the `data_processor` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM8K can be found in [the config file](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_task_pipeline/gsm8k.yaml). +All config items in the `data_processor` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM8K can be found in [the config file](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_task_pipeline/gsm8k.yaml). ```{note} Only when one of `xxx_pipeline` is provided, and one of `dj_process_desc` and `dj_config_path` in the pipeline config is provided, the data processor and the data active iterator will be activated. Otherwise, this part will be skipped and it will enter into the exploring stage directly. @@ -174,7 +174,7 @@ process: field_names: ["prompt", "response"] ``` -All config items in the `data_processor` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM8K can be found in [the config file](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_experience_pipeline/gsm8k.yaml). +All config items in the `data_processor` section can be found [here](trinity_configs.md). A prepared config file for this example of GSM8K can be found in [the config file](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_experience_pipeline/gsm8k.yaml). ### Exploring & Training After preparing the config files of Trinity-RFT, you can start your ray cluster and run the RFT process including the data active iterator part with the following commands: @@ -253,7 +253,7 @@ The difference is that we use the data-juicer OP `human_preference_annotation_ma You can set more config items for this OP (e.g. notification when annotation is finished). For more details, please refer to this [doc](https://github.com/modelscope/data-juicer/tree/main/configs/annotation). -All config items in the `data_processor` section can be found [here](trinity_configs.md). A prepared config file for this example can be found in [the config file](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_human_in_the_loop/dpo.yaml). +All config items in the `data_processor` section can be found [here](trinity_configs.md). A prepared config file for this example can be found in [the config file](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/dpo_human_in_the_loop/dpo.yaml). ### Start Running diff --git a/docs/sphinx_doc/source/tutorial/example_dataset_perspective.md b/docs/sphinx_doc/source/tutorial/example_dataset_perspective.md index 635ab7fb23..521e1aa107 100644 --- a/docs/sphinx_doc/source/tutorial/example_dataset_perspective.md +++ b/docs/sphinx_doc/source/tutorial/example_dataset_perspective.md @@ -6,42 +6,42 @@ This guide provides an example list from the dataset perspective, where you can | Dataset | Algorithm | Use Case | References | |--------------------------------------------------------------------------------------------------------------| --- |----------------------------------------------------------------------------------------| --- | -| [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k) | GRPO | Regular RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html) | -| | GRPO | Asynchronous training | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/async_gsm8k), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_async_mode.html) | -| | Multi-Step GRPO | AgentScope ReAct agent training | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_react), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_react.html) | -| | AsymRE | Regular RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/asymre_gsm8k) | -| | CISPO | Regular RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/cispo_gsm8k) | -| | GRPO | Training with prioritized tasks | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_task_pipeline), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html#example-data-processor-for-task-pipeline) | -| | GRPO | Training with reward reshaping on experiences | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_experience_pipeline), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html#example-data-processor-for-experience-pipeline) | -| | GRPO | Training with RULER (Relative Universal LLM-Elicited Rewards) | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler) | -| | GRPO | Training a policy model as its own reward model | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler) | -| | GRPO | Training using LoRA | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_lora_gsm8k) | -| | OPMD | Off-policy RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/opmd_gsm8k), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_advanced.html) | -| | REC | Training with group-relative reinforce variants | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) | -| | sPPO | Training with sPPO algorithm | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/sppo_gsm8k) | -| | TOPR | Tapered off-policy RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/topr_gsm8k) | -| Math category tasks | GRPO | Training with rewards from RM-Gallery | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_math) | -| | AsymRE | Regular RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/asymre_math) | -| | MIX | Training with "expert" data generated by a more advanced LLM | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_math), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_mix_algo.html) | -| [ALFWorld](https://github.com/alfworld/alfworld) | GRPO | Concatenated multi-turn RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html) | -| | Multi-Step GRPO | General multi-step RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld_general_multi_step), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_step_wise.html) | -| [SciWorld](https://github.com/allenai/ScienceWorld) | GRPO | Concatenated multi-turn RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_sciworld) | -| [WebShop](https://github.com/princeton-nlp/WebShop) | GRPO | Concatenated multi-turn RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_webshop), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html) | -| [callanwu/WebWalkerQA](https://huggingface.co/datasets/callanwu/WebWalkerQA) | Multi-Step GRPO | Multi-turn web search agent training | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) | -| [corbt/enron-emails](https://huggingface.co/datasets/corbt/enron-emails) | Multi-Step GRPO | Multi-turn email search agent training | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_email_search), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_search_email.html) | -| [open-r1/DAPO-Math-17k-Processed](https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed) | GRPO | Regular RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dapo_math) | -| [LLM360/guru-RL-92k](https://huggingface.co/datasets/LLM360/guru-RL-92k) | GRPO | Training with bayesian online task selection | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) | -| [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) | GRPO | Concatenated multi-turn RFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_frozen_lake) | -| [anisha2102/RaR-Medicine](https://huggingface.co/datasets/anisha2102/RaR-Medicine) | GRPO | Training with rewards from LLM judge and rubrics for a non-verifiable medicine QA task | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | -| [Team-ACE/ToolACE](https://huggingface.co/datasets/Team-ACE/ToolACE) | GRPO | Regular RFT for tool calling | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_toolcall) | -| [hiyouga/geometry3k](https://huggingface.co/datasets/hiyouga/geometry3k) | GRPO | Regular RFT for VLM | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_vlm) | -| | MIX | Training with "expert" data generated by a more advanced LLM | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_vlm) | -| [datajuicer/RealMedConv](https://huggingface.co/datasets/datajuicer/RealMedConv) | GRPO | Regular RFT for learning to ask in a proactive way | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) | -| [datajuicer/Trinity-ToolAce-RL-split](https://huggingface.co/datasets/datajuicer/Trinity-ToolAce-RL-split) | CHORD | Training with dynamic SFT + RL integration | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord) | -| [datajuicer/Trinity-ToolAce-SFT-split](https://huggingface.co/datasets/datajuicer/Trinity-ToolAce-SFT-split) | CHORD | Training with dynamic SFT + RL integration | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord) | -| [Jiayi-Pan/Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4) | PPO | Training based on the critic model | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown) | -| | PPO | Training with Megatron-LM as the backend. | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_megatron) | -| | PPO | Training with experience replay | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay) | -| [open-r1/Mixture-of-Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts) | SFT | Regular SFT | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/sft_mot), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html#configuration-for-sft) | -| [HumanLLMs/Human-Like-DPO-Dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) | DPO | Training based on prepared human preferences | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html) | -| toy dataset | DPO | Training based on human-in-the-loop real-time preference annotation | [example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_human_in_the_loop), [doc](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html#example-human-in-the-loop) | +| [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k) | GRPO | Regular RFT | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k), [doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_basic.html) | +| | GRPO | Asynchronous training | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/async_gsm8k), [doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_async_mode.html) | +| | Multi-Step GRPO | AgentScope ReAct agent training | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/agentscope_react), [doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_react.html) | +| | AsymRE | Regular RFT | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/asymre_gsm8k) | +| | CISPO | Regular RFT | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/cispo_gsm8k) | +| | GRPO | Training with prioritized tasks | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_task_pipeline), [doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html#example-data-processor-for-task-pipeline) | +| | GRPO | Training with reward reshaping on experiences | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_experience_pipeline), [doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html#example-data-processor-for-experience-pipeline) | +| | GRPO | Training with RULER (Relative Universal LLM-Elicited Rewards) | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler) | +| | GRPO | Training a policy model as its own reward model | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler) | +| | GRPO | Training using LoRA | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_lora_gsm8k) | +| | OPMD | Off-policy RFT | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/opmd_gsm8k), [doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_reasoning_advanced.html) | +| | REC | Training with group-relative reinforce variants | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/rec_gsm8k) | +| | sPPO | Training with sPPO algorithm | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/sppo_gsm8k) | +| | TOPR | Tapered off-policy RFT | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/topr_gsm8k) | +| Math category tasks | GRPO | Training with rewards from RM-Gallery | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_math) | +| | AsymRE | Regular RFT | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/asymre_math) | +| | MIX | Training with "expert" data generated by a more advanced LLM | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/mix_math), [doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_mix_algo.html) | +| [ALFWorld](https://github.com/alfworld/alfworld) | GRPO | Concatenated multi-turn RFT | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_alfworld), [doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html) | +| | Multi-Step GRPO | General multi-step RFT | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_alfworld_general_multi_step), [doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_step_wise.html) | +| [SciWorld](https://github.com/allenai/ScienceWorld) | GRPO | Concatenated multi-turn RFT | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_sciworld) | +| [WebShop](https://github.com/princeton-nlp/WebShop) | GRPO | Concatenated multi-turn RFT | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_webshop), [doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_multi_turn.html) | +| [callanwu/WebWalkerQA](https://huggingface.co/datasets/callanwu/WebWalkerQA) | Multi-Step GRPO | Multi-turn web search agent training | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/agentscope_websearch) | +| [corbt/enron-emails](https://huggingface.co/datasets/corbt/enron-emails) | Multi-Step GRPO | Multi-turn email search agent training | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_email_search), [doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_search_email.html) | +| [open-r1/DAPO-Math-17k-Processed](https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed) | GRPO | Regular RFT | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/dapo_math) | +| [LLM360/guru-RL-92k](https://huggingface.co/datasets/LLM360/guru-RL-92k) | GRPO | Training with bayesian online task selection | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/bots) | +| [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) | GRPO | Concatenated multi-turn RFT | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_frozen_lake) | +| [anisha2102/RaR-Medicine](https://huggingface.co/datasets/anisha2102/RaR-Medicine) | GRPO | Training with rewards from LLM judge and rubrics for a non-verifiable medicine QA task | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | +| [Team-ACE/ToolACE](https://huggingface.co/datasets/Team-ACE/ToolACE) | GRPO | Regular RFT for tool calling | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_toolcall) | +| [hiyouga/geometry3k](https://huggingface.co/datasets/hiyouga/geometry3k) | GRPO | Regular RFT for VLM | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_vlm) | +| | MIX | Training with "expert" data generated by a more advanced LLM | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/mix_vlm) | +| [datajuicer/RealMedConv](https://huggingface.co/datasets/datajuicer/RealMedConv) | GRPO | Regular RFT for learning to ask in a proactive way | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/learn_to_ask) | +| [datajuicer/Trinity-ToolAce-RL-split](https://huggingface.co/datasets/datajuicer/Trinity-ToolAce-RL-split) | CHORD | Training with dynamic SFT + RL integration | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/mix_chord) | +| [datajuicer/Trinity-ToolAce-SFT-split](https://huggingface.co/datasets/datajuicer/Trinity-ToolAce-SFT-split) | CHORD | Training with dynamic SFT + RL integration | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/mix_chord) | +| [Jiayi-Pan/Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4) | PPO | Training based on the critic model | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/ppo_countdown) | +| | PPO | Training with Megatron-LM as the backend. | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/ppo_countdown_megatron) | +| | PPO | Training with experience replay | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay) | +| [open-r1/Mixture-of-Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts) | SFT | Regular SFT | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/sft_mot), [doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html#configuration-for-sft) | +| [HumanLLMs/Human-Like-DPO-Dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) | DPO | Training based on prepared human preferences | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/dpo_humanlike), [doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_dpo.html) | +| toy dataset | DPO | Training based on human-in-the-loop real-time preference annotation | [example](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/dpo_human_in_the_loop), [doc](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html#example-human-in-the-loop) | diff --git a/docs/sphinx_doc/source/tutorial/example_dpo.md b/docs/sphinx_doc/source/tutorial/example_dpo.md index 3376f43161..50a67775b1 100644 --- a/docs/sphinx_doc/source/tutorial/example_dpo.md +++ b/docs/sphinx_doc/source/tutorial/example_dpo.md @@ -50,7 +50,7 @@ For SFT, we download the `open-r1/Mixture-of-Thoughts` dataset to the local dire ### Configuration for DPO -We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) for this experiment. Some important setups are listed in the following: +We use the configurations in [`dpo.yaml`](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) for this experiment. Some important setups are listed in the following: We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config `mode` to `train` and pass the data path to the trainer. @@ -105,7 +105,7 @@ For more configuration options, please refer to the {ref}`Configuration Guide diff --git a/docs/sphinx_doc/source/tutorial/example_megatron.md b/docs/sphinx_doc/source/tutorial/example_megatron.md index 8cdea8497f..b8d0831d2a 100644 --- a/docs/sphinx_doc/source/tutorial/example_megatron.md +++ b/docs/sphinx_doc/source/tutorial/example_megatron.md @@ -20,9 +20,10 @@ pip install -e ".[megatron]" # uv sync -extra megatron ``` -Then, install NVIDIA's Apex library for mixed-precision training: +Then, install mbridge and NVIDIA's Apex library for mixed-precision training: ```bash +pip install git+https://github.com/ISEEKYAN/mbridge.git@20e9ffbbe72ae7b1df83bfe1bc3c11f7382f2612 pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \ --config-settings "--build-option=--cpp_ext" \ --config-settings "--build-option=--cuda_ext" \ diff --git a/docs/sphinx_doc/source/tutorial/example_mix_algo.md b/docs/sphinx_doc/source/tutorial/example_mix_algo.md index 7cd545c9f8..d0ac532633 100644 --- a/docs/sphinx_doc/source/tutorial/example_mix_algo.md +++ b/docs/sphinx_doc/source/tutorial/example_mix_algo.md @@ -267,7 +267,7 @@ class MIXPolicyLossFn(PolicyLossFn): With the above newly-defined classes and functions, we can run the experiments without modifying other process. An example showing some important configurations is shown below, including the weighting factor $\mu$ as `algorithm.policy_loss_fn_args['mu']` and the batch size of expert experiences $B'$, calculated as the product of `buffer.batch_size`, `algorithm.sample_strategy_args['expert_data_ratio']` and `algorithm.repeat_times`. -For the full configuration, please refer to [`mix_math.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_math/mix_math.yaml). +For the full configuration, please refer to [`mix_math.yaml`](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/mix_math/mix_math.yaml). ```yaml algorithm: diff --git a/docs/sphinx_doc/source/tutorial/example_multi_turn.md b/docs/sphinx_doc/source/tutorial/example_multi_turn.md index e30c52c488..0136abe073 100644 --- a/docs/sphinx_doc/source/tutorial/example_multi_turn.md +++ b/docs/sphinx_doc/source/tutorial/example_multi_turn.md @@ -66,7 +66,7 @@ The task is described as an environment instead of a single prompt. ## Step 2: Config preparation and run the experiment -You can refer to [Quick Start](./example_reasoning_basic.md) to setup the config and others. The default config files are [`alfworld.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld/alfworld.yaml) and [`webshop.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_webshop/webshop.yaml), respectively. +You can refer to [Quick Start](./example_reasoning_basic.md) to setup the config and others. The default config files are [`alfworld.yaml`](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_alfworld/alfworld.yaml) and [`webshop.yaml`](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_webshop/webshop.yaml), respectively. You may revise the configurations properly and run the experiment! ```bash diff --git a/docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md b/docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md index 3f30d48349..97db147b6b 100644 --- a/docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md +++ b/docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md @@ -5,7 +5,7 @@ Let's continue with the [previous GSM8k example](./example_reasoning_basic.md), but switch from on-policy to off-policy RFT. In this example, we consider an off-policy RL algorithm termed as OPMD (Online Policy Mirror Descent) in Trinity-RFT. The algorithm design and analysis can be found in Section 2.2 of [our paper](https://arxiv.org/abs/2509.24203). -The config file is [`opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/opmd_gsm8k/opmd_gsm8k.yaml). +The config file is [`opmd_gsm8k.yaml`](https://github.com/agentscope-ai/Trinity-RFT/blob/main/examples/opmd_gsm8k/opmd_gsm8k.yaml). To try out the OPMD algorithm: ```shell diff --git a/docs/sphinx_doc/source/tutorial/example_reasoning_basic.md b/docs/sphinx_doc/source/tutorial/example_reasoning_basic.md index 19e683dcee..78db3831f2 100644 --- a/docs/sphinx_doc/source/tutorial/example_reasoning_basic.md +++ b/docs/sphinx_doc/source/tutorial/example_reasoning_basic.md @@ -46,7 +46,7 @@ We run the experiment in a synchronous mode where the Explorer and Trainer opera ### Use GRPO Algorithm -We use the configurations in [`gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml) for this experiment. Some important setups of `gsm8k.yaml` are listed in the following: +We use the configurations in [`gsm8k.yaml`](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml) for this experiment. Some important setups of `gsm8k.yaml` are listed in the following: ```yaml diff --git a/docs/sphinx_doc/source/tutorial/example_search_email.md b/docs/sphinx_doc/source/tutorial/example_search_email.md index 6be9e71320..8c56fe7d08 100644 --- a/docs/sphinx_doc/source/tutorial/example_search_email.md +++ b/docs/sphinx_doc/source/tutorial/example_search_email.md @@ -34,7 +34,7 @@ If you want to choose a new database path, you can modify the `DEFAULT_DB_PATH` ### Step 2: Run the Workflow -The config file is located in [`email_search.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_email_search/email_search.yaml). +The config file is located in [`email_search.yaml`](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_email_search/email_search.yaml). To run this example, you can run the following command: ```bash diff --git a/docs/sphinx_doc/source/tutorial/example_step_wise.md b/docs/sphinx_doc/source/tutorial/example_step_wise.md index ba31c9f91f..0fa8fe4373 100644 --- a/docs/sphinx_doc/source/tutorial/example_step_wise.md +++ b/docs/sphinx_doc/source/tutorial/example_step_wise.md @@ -177,7 +177,7 @@ The task is described as an environment instead of a single prompt. The task des ### Config preparation and run the experiment -The default config file is [`alfworld.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld_general_multi_step/alfworld.yaml). +The default config file is [`alfworld.yaml`](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_alfworld_general_multi_step/alfworld.yaml). You may revise the configurations properly and run the experiment! ```bash diff --git a/docs/sphinx_doc/source/tutorial/example_tinker_backend.md b/docs/sphinx_doc/source/tutorial/example_tinker_backend.md index ec7158db87..5cc0b69f3e 100644 --- a/docs/sphinx_doc/source/tutorial/example_tinker_backend.md +++ b/docs/sphinx_doc/source/tutorial/example_tinker_backend.md @@ -62,7 +62,7 @@ trinity run --config tinker.yaml # Replace with your actual config file path 3. **Multiple stages training** is not supported currently, we will add support for this in the future. -> 💡 A complete example configuration file is available at [`tinker.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/tinker/tinker.yaml). +> 💡 A complete example configuration file is available at [`tinker.yaml`](https://github.com/agentscope-ai/Trinity-RFT/blob/main/examples/tinker/tinker.yaml). ## Results on the Llama-3.2-3B Model diff --git a/docs/sphinx_doc/source/tutorial/faq.md b/docs/sphinx_doc/source/tutorial/faq.md index 6caec69368..8f7c8f51bb 100644 --- a/docs/sphinx_doc/source/tutorial/faq.md +++ b/docs/sphinx_doc/source/tutorial/faq.md @@ -105,7 +105,7 @@ ray start --head - For trainer, adjust `trainer.max_token_len_per_gpu` when `trainer.use_dynamic_bsz=false`; adjust `trainer.ppo_max_token_len_per_gpu` and `trainer.ulysses_sequence_parallel_size` when `trainer.use_dynamic_bsz=true`. Setting `trainer.trainer_config.actor_rollout_ref.actor.entropy_from_logits_with_chunking=true` may also help. - For explorer, adjust `explorer.rollout_model.tensor_parallel_size`. -Besides, Trinity-RFT provides [GPU related configuration guide](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_gpu_configs.html), which you may refer to for suggestions on adjusting the configurations. +Besides, Trinity-RFT provides [GPU related configuration guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_gpu_configs.html), which you may refer to for suggestions on adjusting the configurations. ## Part 3: Debugging Methods @@ -216,4 +216,4 @@ model.load_state_dict(load_fsdp_state_dict_from_verl_checkpoint(ckp_path)) - **Separation of Explorer and Trainer**: Trinity-RFT replaces the rollout model in veRL with a separate Explorer module, which handles agent-environment interactions. This separation allows for more flexible workflow designs and rollout-training scheduling. - **Full-lifecycle Data Pipeline**: Trinity-RFT adds a Buffer module between Explorer and Trainer, providing a complete data pipeline for experience storage, processing, and sampling. This design enables advanced data handling strategies, such as experience replay and prioritized sampling. -We also provide benchmarks comparing Trinity-RFT with veRL and systems built on veRL (e.g., [rLLM](https://github.com/rllm-org/rllm)), which show comparable or better performance and efficiency. Please refer to [Benchmark](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark) for more details. +We also provide benchmarks comparing Trinity-RFT with veRL and systems built on veRL (e.g., [rLLM](https://github.com/rllm-org/rllm)), which show comparable or better performance and efficiency. Please refer to [Benchmark](https://github.com/agentscope-ai/Trinity-RFT/tree/main/benchmark) for more details. diff --git a/docs/sphinx_doc/source/tutorial/trinity_configs.md b/docs/sphinx_doc/source/tutorial/trinity_configs.md index 8f7f1d813a..e1f78a7c3b 100644 --- a/docs/sphinx_doc/source/tutorial/trinity_configs.md +++ b/docs/sphinx_doc/source/tutorial/trinity_configs.md @@ -53,7 +53,7 @@ stages: ... ``` -Each of these sections will be explained in detail below. For additional details about specific parameters not covered here, please refer to the [source code](https://github.com/modelscope/Trinity-RFT/blob/main/trinity/common/config.py). +Each of these sections will be explained in detail below. For additional details about specific parameters not covered here, please refer to the [source code](https://github.com/agentscope-ai/Trinity-RFT/blob/main/trinity/common/config.py). ```{tip} Trinity-RFT uses [OmegaConf](https://omegaconf.readthedocs.io/en/latest/) to load YAML configuration files. diff --git a/docs/sphinx_doc/source/tutorial/trinity_installation.md b/docs/sphinx_doc/source/tutorial/trinity_installation.md index d86a8af903..34f9eec22b 100644 --- a/docs/sphinx_doc/source/tutorial/trinity_installation.md +++ b/docs/sphinx_doc/source/tutorial/trinity_installation.md @@ -29,7 +29,7 @@ This method is best if you plan to customize or contribute to Trinity-RFT. ### 1. Clone the Repository ```bash -git clone https://github.com/modelscope/Trinity-RFT +git clone https://github.com/agentscope-ai/Trinity-RFT cd Trinity-RFT ``` @@ -109,7 +109,7 @@ You can download the Trinity-RFT Docker image from Github Container Registry or ### Pull from GitHub Container Registry (Recommended for beginners) ```bash -git clone https://github.com/modelscope/Trinity-RFT +git clone https://github.com/agentscope-ai/Trinity-RFT cd Trinity-RFT docker pull ghcr.io/modelscope/trinity-rft:latest @@ -131,7 +131,7 @@ The image has include dependencies such as vllm, flash-attn and Megatron-LM, if ### Build Locally ```bash -git clone https://github.com/modelscope/Trinity-RFT +git clone https://github.com/agentscope-ai/Trinity-RFT cd Trinity-RFT # Build the Docker image @@ -156,4 +156,4 @@ For training with **Megatron-LM**, please refer to {ref}`Megatron-LM Backend + [Off-policy RFT](/tutorial/example_reasoning_advanced.md)
      + [全异步 RFT](/tutorial/example_async_mode.md)
      + [通过 DPO 或 SFT 进行离线学习](/tutorial/example_dpo.md) | -| *多轮智能体强化学习* | + [拼接多轮任务](/tutorial/example_multi_turn.md)
      + [通用多轮任务](/tutorial/example_step_wise.md)
      + [调用智能体框架中的 ReAct 工作流](/tutorial/example_react.md)
      + [例子:训练一个网络搜索智能体](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) | -| *全生命周期的数据流水线* | + [Rollout 任务混合与选取](/tutorial/develop_selector.md)
      + [在线任务选择](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) (📝 [论文](https://arxiv.org/pdf/2510.26374))
      + [研究项目:learn-to-ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [论文](https://arxiv.org/pdf/2510.25441))
      + [经验回放机制](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)
      + [高级数据处理能力 & Human-in-the-loop](/tutorial/example_data_functionalities.md) | -| *强化学习算法开发* | + [使用 Trinity-RFT 进行 RL 算法开发](/tutorial/example_mix_algo.md) (📝 [论文](https://arxiv.org/pdf/2508.11408))
      + [研究项目: R3L (基于反思-重试的强化学习)](https://github.com/shiweijiezero/R3L) (📝 [论文](https://arxiv.org/abs/2601.03715))
      + [研究项目: group-relative REINFORCE](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [论文](https://arxiv.org/abs/2509.24203))
      + 不可验证的领域: [RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [可训练 RULER](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | -| *基准测试* | + [基准测试工具 (快速验证与实验)](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/README.md)
      + [Guru-Math 测试 & 对比 veRL](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/reports/guru_math.md)
      + [FrozenLake 测试 & 对比 rLLM](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/reports/frozenlake.md)
      + [Alfworld 测试 & 对比 rLLM](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark/reports/alfworld.md) | -| *深入认识 Trinity-RFT* | + [完整配置指南](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_configs.html)
      + [GPU 资源与训练配置对应指南](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_gpu_configs.html)
      + [理解 explorer-trainer 同步逻辑](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/synchronizer.html)
      + [如何与 verl 对齐配置](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/align_with_verl.html) | +| *多轮智能体强化学习* | + [拼接多轮任务](/tutorial/example_multi_turn.md)
      + [通用多轮任务](/tutorial/example_step_wise.md)
      + [调用智能体框架中的 ReAct 工作流](/tutorial/example_react.md)
      + [例子:训练一个网络搜索智能体](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/agentscope_websearch) | +| *全生命周期的数据流水线* | + [Rollout 任务混合与选取](/tutorial/develop_selector.md)
      + [在线任务选择](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/bots) (📝 [论文](https://arxiv.org/pdf/2510.26374))
      + [研究项目:learn-to-ask](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/learn_to_ask) (📝 [论文](https://arxiv.org/pdf/2510.25441))
      + [经验回放机制](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay)
      + [高级数据处理能力 & Human-in-the-loop](/tutorial/example_data_functionalities.md) | +| *强化学习算法开发* | + [使用 Trinity-RFT 进行 RL 算法开发](/tutorial/example_mix_algo.md) (📝 [论文](https://arxiv.org/pdf/2508.11408))
      + [研究项目: R3L (基于反思-重试的强化学习)](https://github.com/shiweijiezero/R3L) (📝 [论文](https://arxiv.org/abs/2601.03715))
      + [研究项目: group-relative REINFORCE](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/rec_gsm8k) (📝 [论文](https://arxiv.org/abs/2509.24203))
      + 不可验证的领域: [RULER](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler), [可训练 RULER](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler), [rubric-as-reward](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | +| *基准测试* | + [基准测试工具 (快速验证与实验)](https://github.com/agentscope-ai/Trinity-RFT/tree/main/benchmark/README.md)
      + [Guru-Math 测试 & 对比 veRL](https://github.com/agentscope-ai/Trinity-RFT/tree/main/benchmark/reports/guru_math.md)
      + [FrozenLake 测试 & 对比 rLLM](https://github.com/agentscope-ai/Trinity-RFT/tree/main/benchmark/reports/frozenlake.md)
      + [Alfworld 测试 & 对比 rLLM](https://github.com/agentscope-ai/Trinity-RFT/tree/main/benchmark/reports/alfworld.md) | +| *深入认识 Trinity-RFT* | + [完整配置指南](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/trinity_configs.html)
      + [GPU 资源与训练配置对应指南](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/trinity_gpu_configs.html)
      + [理解 explorer-trainer 同步逻辑](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/synchronizer.html)
      + [如何与 verl 对齐配置](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/align_with_verl.html) | ## 🌟 核心特性 @@ -66,23 +66,23 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能: ## 🔨 算法支持 -下表列出了 Trinity-RFT 支持的算法,更多算法请参考 [算法模块](https://github.com/modelscope/Trinity-RFT/blob/main/trinity/algorithm/algorithm.py)。您也可以通过自定义不同的模块来构建新算法,参见 [教程](/tutorial/develop_algorithm.md)。 +下表列出了 Trinity-RFT 支持的算法,更多算法请参考 [算法模块](https://github.com/agentscope-ai/Trinity-RFT/blob/main/trinity/algorithm/algorithm.py)。您也可以通过自定义不同的模块来构建新算法,参见 [教程](/tutorial/develop_algorithm.md)。 | 算法 | 文档/示例 | 核心代码 | 关键配置 | |:-----------|:-----------|:---------------|:-----------| -| PPO [[论文](https://arxiv.org/pdf/1707.06347)] | [[文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)] [[Countdown 例子](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown)] | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/ppo_policy_loss.py)] | `algorithm_type: ppo` | -| GRPO [[论文](https://arxiv.org/pdf/2402.03300)] | [[文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)] [[GSM8K 例子](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k)]| [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/grpo_advantage.py)] | `algorithm_type: grpo` | -| CHORD 💡 [[论文](https://arxiv.org/pdf/2508.11408)] | [[文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_mix_algo.html)] [[ToolACE 例子](https://github.com/modelscope/Trinity-RFT/blob/main/examples/mix_chord/mix_chord_toolace.yaml)] | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/chord_policy_loss.py)] | `algorithm_type: mix_chord` | -| REC Series 💡 [[论文](https://arxiv.org/pdf/2509.24203)] | [[GSM8K 例子](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k)] | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/rec_policy_loss.py)] | `algorithm_type: rec` | -| RLOO [[论文](https://arxiv.org/pdf/2402.14740)] | - | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/rloo_advantage.py)] | `algorithm_type: rloo` | -| REINFORCE++ [[论文](https://arxiv.org/pdf/2501.03262)] | - | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/reinforce_advantage.py)] | `algorithm_type: reinforceplusplus` | -| GSPO [[论文](https://arxiv.org/pdf/2507.18071)] | - | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/gspo_policy_loss.py)] | `algorithm_type: gspo` | -| TOPR [[论文](https://arxiv.org/pdf/2503.14286)] | [[GSM8K 例子](https://github.com/modelscope/Trinity-RFT/tree/main/examples/topr_gsm8k)] | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/topr_policy_loss.py)] | `algorithm_type: topr` | -| sPPO [[论文](https://arxiv.org/pdf/2108.05828)] | [[GSM8K 例子](https://github.com/modelscope/Trinity-RFT/tree/main/examples/sppo_gsm8k)] | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sppo_loss_fn.py)] | `algorithm_type: sppo` | -| AsymRE [[论文](https://arxiv.org/pdf/2506.20520)] | [[GSM8K 例子](https://github.com/modelscope/Trinity-RFT/tree/main/examples/asymre_gsm8k)] | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/asymre_advantage.py)] | `algorithm_type: asymre` | -| CISPO [[论文](https://arxiv.org/pdf/2506.13585)] | - | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/cispo_policy_loss.py)] | `algorithm_type: cispo` | -| SAPO [[论文](https://arxiv.org/pdf/2511.20347)] | - | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sapo_policy_loss.py)] | `algorithm_type: sapo` | -| On-Policy Distillation [[博客](https://thinkingmachines.ai/blog/on-policy-distillation/)] [[论文](https://arxiv.org/pdf/2306.13649)] | [[GSM8K 示例](https://github.com/modelscope/Trinity-RFT/tree/main/examples/on_policy_distill)] | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/common/workflows/on_policy_distill_workflow.py)] | `algorithm_type: on_policy_distill` | +| PPO [[论文](https://arxiv.org/pdf/1707.06347)] | [[文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)] [[Countdown 例子](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/ppo_countdown)] | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/ppo_policy_loss.py)] | `algorithm_type: ppo` | +| GRPO [[论文](https://arxiv.org/pdf/2402.03300)] | [[文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html)] [[GSM8K 例子](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k)]| [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/grpo_advantage.py)] | `algorithm_type: grpo` | +| CHORD 💡 [[论文](https://arxiv.org/pdf/2508.11408)] | [[文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_mix_algo.html)] [[ToolACE 例子](https://github.com/agentscope-ai/Trinity-RFT/blob/main/examples/mix_chord/mix_chord_toolace.yaml)] | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/chord_policy_loss.py)] | `algorithm_type: mix_chord` | +| REC Series 💡 [[论文](https://arxiv.org/pdf/2509.24203)] | [[GSM8K 例子](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/rec_gsm8k)] | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/rec_policy_loss.py)] | `algorithm_type: rec` | +| RLOO [[论文](https://arxiv.org/pdf/2402.14740)] | - | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/rloo_advantage.py)] | `algorithm_type: rloo` | +| REINFORCE++ [[论文](https://arxiv.org/pdf/2501.03262)] | - | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/reinforce_advantage.py)] | `algorithm_type: reinforceplusplus` | +| GSPO [[论文](https://arxiv.org/pdf/2507.18071)] | - | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/gspo_policy_loss.py)] | `algorithm_type: gspo` | +| TOPR [[论文](https://arxiv.org/pdf/2503.14286)] | [[GSM8K 例子](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/topr_gsm8k)] | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/topr_policy_loss.py)] | `algorithm_type: topr` | +| sPPO [[论文](https://arxiv.org/pdf/2108.05828)] | [[GSM8K 例子](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/sppo_gsm8k)] | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sppo_loss_fn.py)] | `algorithm_type: sppo` | +| AsymRE [[论文](https://arxiv.org/pdf/2506.20520)] | [[GSM8K 例子](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/asymre_gsm8k)] | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/asymre_advantage.py)] | `algorithm_type: asymre` | +| CISPO [[论文](https://arxiv.org/pdf/2506.13585)] | - | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/cispo_policy_loss.py)] | `algorithm_type: cispo` | +| SAPO [[论文](https://arxiv.org/pdf/2511.20347)] | - | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sapo_policy_loss.py)] | `algorithm_type: sapo` | +| On-Policy Distillation [[博客](https://thinkingmachines.ai/blog/on-policy-distillation/)] [[论文](https://arxiv.org/pdf/2306.13649)] | [[GSM8K 示例](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/on_policy_distill)] | [[代码](https://github.com/agentscope-ai/Trinity-RFT/tree/main/trinity/common/workflows/on_policy_distill_workflow.py)] | `algorithm_type: on_policy_distill` | diff --git a/docs/sphinx_doc/source_zh/tutorial/align_with_verl.md b/docs/sphinx_doc/source_zh/tutorial/align_with_verl.md index b8a65f0f27..52751e76b0 100644 --- a/docs/sphinx_doc/source_zh/tutorial/align_with_verl.md +++ b/docs/sphinx_doc/source_zh/tutorial/align_with_verl.md @@ -23,7 +23,7 @@ Trinity-RFT 根据功能将强化微调的大量参数分为几个部分,例 | 一些全局配置 | `trainer` | `monitor`、`synchronizer`、`cluster` 等 | -在以下内容中,我们将展示如何将 veRL 中的参数映射到 Trinity-RFT 中的参数。有关 Trinity-RFT 的详细参数配置,请参考[文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/trinity_configs.html)。 +在以下内容中,我们将展示如何将 veRL 中的参数映射到 Trinity-RFT 中的参数。有关 Trinity-RFT 的详细参数配置,请参考[文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/trinity_configs.html)。 ```{note} @@ -142,7 +142,7 @@ explorer: max_response_tokens: 1024 max_model_len: 20480 ``` -请参考使用 LLM-as-a-judge 的[配置](https://github.com/modelscope/Trinity-RFT/blob/main/examples/grpo_rubric_as_reward/rubric.yaml)和[工作流](https://github.com/modelscope/Trinity-RFT/blob/main/trinity/common/workflows/rubric_judge_workflow.py)了解更多详情。 +请参考使用 LLM-as-a-judge 的[配置](https://github.com/agentscope-ai/Trinity-RFT/blob/main/examples/grpo_rubric_as_reward/rubric.yaml)和[工作流](https://github.com/agentscope-ai/Trinity-RFT/blob/main/trinity/common/workflows/rubric_judge_workflow.py)了解更多详情。 ### Trainer diff --git a/docs/sphinx_doc/source_zh/tutorial/develop_operator.md b/docs/sphinx_doc/source_zh/tutorial/develop_operator.md index 6523ded045..14f21ccfa8 100644 --- a/docs/sphinx_doc/source_zh/tutorial/develop_operator.md +++ b/docs/sphinx_doc/source_zh/tutorial/develop_operator.md @@ -7,7 +7,7 @@ Operator 模块负责处理由 Explorer 所生成的轨迹数据(我们称之为 `Experience`)。它原生支持来自 [Data-Juicer](https://github.com/modelscope/data-juicer) 的数据处理功能,也允许开发者实现自己的算子。 通过自定义数据处理算子,开发者可以实现各种数据处理功能,如数据增强、过滤和转换。你甚至可以将优势值/回报值计算实现为 Operator,如 {ref}`算法 ` 部分所示。 -- **DataJuicerOperator** ({class}`trinity.buffer.operators.DataJuicerOperator`):封装后的 Data-Juicer 算子,使用时只需在配置文件中标明想要使用的 Data-Juicer 算子列表即可。完整的 Data-Juicer 算子列表请见 [此处](https://modelscope.github.io/data-juicer/en/main/docs/Operators.html)。 +- **DataJuicerOperator** ({class}`trinity.buffer.operators.DataJuicerOperator`):封装后的 Data-Juicer 算子,使用时只需在配置文件中标明想要使用的 Data-Juicer 算子列表即可。完整的 Data-Juicer 算子列表请见 [此处](https://agentscope-ai.github.io/data-juicer/en/main/docs/Operators.html)。 - **ExperienceOperator** ({class}`trinity.buffer.operators.ExperienceOperator`):用于 experience 数据处理的所有数据处理算子的基类。定义了所有数据处理算子应具备的接口和通用功能。每个算子处理一批 experience 数据,并返回处理后的数据及用于日志记录的指标。 - **ExperiencePipeline** ({class}`trinity.buffer.pipelines.ExperiencePipeline`):管理一系列数据处理算子的 experience 数据处理流水线。它从 `Explorer` 获取原始 experience,通过流水线中的每个算子处理,最后将最终处理过的 experience 写入 `Trainer` 的输入缓冲区。 diff --git a/docs/sphinx_doc/source_zh/tutorial/example_async_mode.md b/docs/sphinx_doc/source_zh/tutorial/example_async_mode.md index 96befe34e8..e5350f702b 100644 --- a/docs/sphinx_doc/source_zh/tutorial/example_async_mode.md +++ b/docs/sphinx_doc/source_zh/tutorial/example_async_mode.md @@ -4,7 +4,7 @@ Trinity-RFT 支持通过在独立进程中运行 trainer 和 explorer 来实现异步 RFT。 -我们提供了两个主要配置文件:[`explorer.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/async_gsm8k/explorer.yaml) 和 [`trainer.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/async_gsm8k/trainer.yaml)。 +我们提供了两个主要配置文件:[`explorer.yaml`](https://github.com/agentscope-ai/Trinity-RFT/blob/main/examples/async_gsm8k/explorer.yaml) 和 [`trainer.yaml`](https://github.com/agentscope-ai/Trinity-RFT/blob/main/examples/async_gsm8k/trainer.yaml)。 两者之间的主要区别是:在 `explorer.yaml` 中将 `mode` 设置为 `explore`,而在 `trainer.yaml` 中将 `mode` 设置为 `train`。 Explorer 与 Trainer 的模型权重每处理 `sync_interval * batch_size` 个任务后同步一次。 diff --git a/docs/sphinx_doc/source_zh/tutorial/example_data_functionalities.md b/docs/sphinx_doc/source_zh/tutorial/example_data_functionalities.md index 673a428713..c1a9a42b93 100644 --- a/docs/sphinx_doc/source_zh/tutorial/example_data_functionalities.md +++ b/docs/sphinx_doc/source_zh/tutorial/example_data_functionalities.md @@ -4,7 +4,7 @@ ## 概述 Trinity-RFT 提供了一个统一的数据处理器,用于处理 task 流水线和 experience 流水线中的原始数据集及 experience 数据。 -- 对于任务,数据处理能力来源于 [Data-Juicer](https://github.com/modelscope/data-juicer)。你可以使用 Data-Juicer 提供的数据处理算子。完整的 Data-Juicer 算子列表可在 [此处](https://modelscope.github.io/data-juicer/en/main/docs/Operators.html) 查看。 +- 对于任务,数据处理能力来源于 [Data-Juicer](https://github.com/modelscope/data-juicer)。你可以使用 Data-Juicer 提供的数据处理算子。完整的 Data-Juicer 算子列表可在 [此处](https://agentscope-ai.github.io/data-juicer/en/main/docs/Operators.html) 查看。 - 对于 experience 数据,除了 Data-Juicer 算子外,Trinity-RFT 还提供了若干与 RFT 相关的算子,并允许开发者实现自定义算子。 如需实现自己的数据处理器,可参考[开发者指南](trinity_programming_guide.md#operators-for-data-developers)。 @@ -71,7 +71,7 @@ service: Data-Juicer 的数据处理以服务形式运行,因此需要配置 data-juicer 服务。幸运的是,Trinity-RFT 提供了自动启动方式,只需在 `service` 部分将 `data-juicer` 服务的 `auto_start` 设为 `true` 即可自动启动数据处理器服务。 -`data_processor` 部分的所有配置项详见 [此处](trinity_configs.md)。本示例对应的 GSM8K 配置文件可在 [该配置文件](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_task_pipeline/gsm8k.yaml) 中找到。 +`data_processor` 部分的所有配置项详见 [此处](trinity_configs.md)。本示例对应的 GSM8K 配置文件可在 [该配置文件](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_task_pipeline/gsm8k.yaml) 中找到。 ```{note} 只有当提供了任一 `xxx_pipeline`,且 pipeline 配置中提供了 `dj_process_desc` 或 `dj_config_path` 之一时,数据处理器和数据主动迭代器才会被激活。否则该部分将被跳过,直接进入探索阶段。 @@ -167,7 +167,7 @@ process: field_names: ["prompt", "response"] ``` -`data_processor` 部分的所有配置项详见 [此处](trinity_configs.md)。本示例对应的 GSM8K 配置文件可在 [该配置文件](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_experience_pipeline/gsm8k.yaml) 中找到。 +`data_processor` 部分的所有配置项详见 [此处](trinity_configs.md)。本示例对应的 GSM8K 配置文件可在 [该配置文件](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_experience_pipeline/gsm8k.yaml) 中找到。 ### 探索与训练 完成 Trinity-RFT 配置文件准备后,可启动 Ray 集群并运行包含数据主动迭代器部分的 RFT 流程: @@ -245,7 +245,7 @@ service: 你还可以为此算子设置更多配置项(例如标注完成时的通知)。更多细节请参考 [此文档](https://github.com/modelscope/data-juicer/tree/main/configs/annotation)。 -`data_processor` 部分的所有配置项详见 [此处](trinity_configs.md)。本示例的配置文件可在 [该配置文件](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_human_in_the_loop/dpo.yaml) 中找到。 +`data_processor` 部分的所有配置项详见 [此处](trinity_configs.md)。本示例的配置文件可在 [该配置文件](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/dpo_human_in_the_loop/dpo.yaml) 中找到。 ### 开始运行 diff --git a/docs/sphinx_doc/source_zh/tutorial/example_dataset_perspective.md b/docs/sphinx_doc/source_zh/tutorial/example_dataset_perspective.md index 24b85906b3..88f4c59cb3 100644 --- a/docs/sphinx_doc/source_zh/tutorial/example_dataset_perspective.md +++ b/docs/sphinx_doc/source_zh/tutorial/example_dataset_perspective.md @@ -6,42 +6,42 @@ | 数据集 | 算法 | 使用场景 | 参考文档 | |--------------------------------------------------------------------------------------------------------------|-----------------|---------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k) | GRPO | 常规 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html) | -| | GRPO | 异步训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/async_gsm8k), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_async_mode.html) | -| | Multi-Step GRPO | AgentScope ReAct 智能体训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_react), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_react.html) | -| | AsymRE | 常规 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/asymre_gsm8k) | -| | CISPO | 常规 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/cispo_gsm8k) | -| | GRPO | 使用优先级任务进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_task_pipeline), [相关文档](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html#example-data-processor-for-task-pipeline) | -| | GRPO | 在经验上进行奖励重塑的训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_experience_pipeline), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_data_functionalities.html#example-data-processor-for-experience-pipeline) | -| | GRPO | 使用 RULER (Relative Universal LLM-Elicited Rewards) 进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler) | -| | GRPO | 训练策略模型作为其自身的奖励模型 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler) | -| | GRPO | 使用 LoRA 进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_lora_gsm8k) | -| | OPMD | 异策略 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/opmd_gsm8k), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_advanced.html) | -| | REC | 使用组相对强化变体进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k) | -| | sPPO | 使用 sPPO 算法进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/sppo_gsm8k) | -| | TOPR | 渐减式异策略 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/topr_gsm8k) | -| 数学类型任务 | GRPO | 使用 RM-Gallery 的奖励进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_math) | -| | AsymRE | 常规 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/asymre_math) | -| | MIX | 使用更先进大模型生成的“专家”数据进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_math), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_mix_algo.html) | -| [ALFWorld](https://github.com/alfworld/alfworld) | GRPO | 拼接多轮 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_multi_turn.html) | -| | Multi-Step GRPO | 通用多轮 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld_general_multi_step), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_step_wise.html) | -| [SciWorld](https://github.com/allenai/ScienceWorld) | GRPO | 拼接多轮 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_sciworld) | -| [WebShop](https://github.com/princeton-nlp/WebShop) | GRPO | 拼接多轮 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_webshop), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_multi_turn.html) | -| [callanwu/WebWalkerQA](https://huggingface.co/datasets/callanwu/WebWalkerQA) | Multi-Step GRPO | 多轮网页搜索智能体训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/agentscope_websearch) | -| [corbt/enron-emails](https://huggingface.co/datasets/corbt/enron-emails) | Multi-Step GRPO | 多轮邮件搜索智能体训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_email_search), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_search_email.html) | -| [open-r1/DAPO-Math-17k-Processed](https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed) | GRPO | 常规 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dapo_math) | -| [LLM360/guru-RL-92k](https://huggingface.co/datasets/LLM360/guru-RL-92k) | GRPO | 使用贝叶斯在线任务选择进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/bots) | -| [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) | GRPO | 拼接多轮 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_frozen_lake) | -| [anisha2102/RaR-Medicine](https://huggingface.co/datasets/anisha2102/RaR-Medicine) | GRPO | 针对不可验证医学问答任务,使用大模型裁判和评分标准提供奖励进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | -| [Team-ACE/ToolACE](https://huggingface.co/datasets/Team-ACE/ToolACE) | GRPO | 针对工具调用的常规 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_toolcall) | -| [hiyouga/geometry3k](https://huggingface.co/datasets/hiyouga/geometry3k) | GRPO | 针对视觉语言模型的常规 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_vlm) | -| | MIX | 使用更先进大模型生成的“专家”数据进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_vlm) | -| [datajuicer/RealMedConv](https://huggingface.co/datasets/datajuicer/RealMedConv) | GRPO | 学习主动提问的常规 RFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask) | -| [datajuicer/Trinity-ToolAce-RL-split](https://huggingface.co/datasets/datajuicer/Trinity-ToolAce-RL-split) | CHORD | 动态 SFT 与 RL 联合训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord) | -| [datajuicer/Trinity-ToolAce-SFT-split](https://huggingface.co/datasets/datajuicer/Trinity-ToolAce-SFT-split) | CHORD | 动态 SFT 与 RL 联合训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord) | -| [Jiayi-Pan/Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4) | PPO | 基于 critic 模型的训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown) | -| | PPO | 使用 Megatron-LM 作为训练后端 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_megatron) | -| | PPO | 使用经验回放进行训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay) | -| [open-r1/Mixture-of-Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts) | SFT | 常规 SFT | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/sft_mot), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_dpo.html#configuration-for-sft) | -| [HumanLLMs/Human-Like-DPO-Dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) | DPO | 基于预设人类偏好的训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_dpo.html) | -| 示例数据 | DPO | 基于训练环路中人类实时偏好标注的训练 | [样例位置](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_human_in_the_loop), [相关文档](https://modelscope.github.io/Trinity-RFT/zh/main/tutorial/example_data_functionalities.html#example-human-in-the-loop) | +| [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k) | GRPO | 常规 RFT | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k), [相关文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_basic.html) | +| | GRPO | 异步训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/async_gsm8k), [相关文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_async_mode.html) | +| | Multi-Step GRPO | AgentScope ReAct 智能体训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/agentscope_react), [相关文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_react.html) | +| | AsymRE | 常规 RFT | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/asymre_gsm8k) | +| | CISPO | 常规 RFT | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/cispo_gsm8k) | +| | GRPO | 使用优先级任务进行训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_task_pipeline), [相关文档](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_data_functionalities.html#example-data-processor-for-task-pipeline) | +| | GRPO | 在经验上进行奖励重塑的训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_experience_pipeline), [相关文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_data_functionalities.html#example-data-processor-for-experience-pipeline) | +| | GRPO | 使用 RULER (Relative Universal LLM-Elicited Rewards) 进行训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_ruler) | +| | GRPO | 训练策略模型作为其自身的奖励模型 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k_trainable_ruler) | +| | GRPO | 使用 LoRA 进行训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_lora_gsm8k) | +| | OPMD | 异策略 RFT | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/opmd_gsm8k), [相关文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_reasoning_advanced.html) | +| | REC | 使用组相对强化变体进行训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/rec_gsm8k) | +| | sPPO | 使用 sPPO 算法进行训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/sppo_gsm8k) | +| | TOPR | 渐减式异策略 RFT | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/topr_gsm8k) | +| 数学类型任务 | GRPO | 使用 RM-Gallery 的奖励进行训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_math) | +| | AsymRE | 常规 RFT | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/asymre_math) | +| | MIX | 使用更先进大模型生成的“专家”数据进行训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/mix_math), [相关文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_mix_algo.html) | +| [ALFWorld](https://github.com/alfworld/alfworld) | GRPO | 拼接多轮 RFT | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_alfworld), [相关文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_multi_turn.html) | +| | Multi-Step GRPO | 通用多轮 RFT | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_alfworld_general_multi_step), [相关文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_step_wise.html) | +| [SciWorld](https://github.com/allenai/ScienceWorld) | GRPO | 拼接多轮 RFT | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_sciworld) | +| [WebShop](https://github.com/princeton-nlp/WebShop) | GRPO | 拼接多轮 RFT | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_webshop), [相关文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_multi_turn.html) | +| [callanwu/WebWalkerQA](https://huggingface.co/datasets/callanwu/WebWalkerQA) | Multi-Step GRPO | 多轮网页搜索智能体训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/agentscope_websearch) | +| [corbt/enron-emails](https://huggingface.co/datasets/corbt/enron-emails) | Multi-Step GRPO | 多轮邮件搜索智能体训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_email_search), [相关文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_search_email.html) | +| [open-r1/DAPO-Math-17k-Processed](https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed) | GRPO | 常规 RFT | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/dapo_math) | +| [LLM360/guru-RL-92k](https://huggingface.co/datasets/LLM360/guru-RL-92k) | GRPO | 使用贝叶斯在线任务选择进行训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/bots) | +| [Frozen Lake](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) | GRPO | 拼接多轮 RFT | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_frozen_lake) | +| [anisha2102/RaR-Medicine](https://huggingface.co/datasets/anisha2102/RaR-Medicine) | GRPO | 针对不可验证医学问答任务,使用大模型裁判和评分标准提供奖励进行训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_rubric_as_reward) | +| [Team-ACE/ToolACE](https://huggingface.co/datasets/Team-ACE/ToolACE) | GRPO | 针对工具调用的常规 RFT | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_toolcall) | +| [hiyouga/geometry3k](https://huggingface.co/datasets/hiyouga/geometry3k) | GRPO | 针对视觉语言模型的常规 RFT | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_vlm) | +| | MIX | 使用更先进大模型生成的“专家”数据进行训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/mix_vlm) | +| [datajuicer/RealMedConv](https://huggingface.co/datasets/datajuicer/RealMedConv) | GRPO | 学习主动提问的常规 RFT | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/learn_to_ask) | +| [datajuicer/Trinity-ToolAce-RL-split](https://huggingface.co/datasets/datajuicer/Trinity-ToolAce-RL-split) | CHORD | 动态 SFT 与 RL 联合训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/mix_chord) | +| [datajuicer/Trinity-ToolAce-SFT-split](https://huggingface.co/datasets/datajuicer/Trinity-ToolAce-SFT-split) | CHORD | 动态 SFT 与 RL 联合训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/mix_chord) | +| [Jiayi-Pan/Countdown-Tasks-3to4](https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4) | PPO | 基于 critic 模型的训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/ppo_countdown) | +| | PPO | 使用 Megatron-LM 作为训练后端 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/ppo_countdown_megatron) | +| | PPO | 使用经验回放进行训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/ppo_countdown_exp_replay) | +| [open-r1/Mixture-of-Thoughts](https://huggingface.co/datasets/open-r1/Mixture-of-Thoughts) | SFT | 常规 SFT | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/sft_mot), [相关文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_dpo.html#configuration-for-sft) | +| [HumanLLMs/Human-Like-DPO-Dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) | DPO | 基于预设人类偏好的训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/dpo_humanlike), [相关文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_dpo.html) | +| 示例数据 | DPO | 基于训练环路中人类实时偏好标注的训练 | [样例位置](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/dpo_human_in_the_loop), [相关文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_data_functionalities.html#example-human-in-the-loop) | diff --git a/docs/sphinx_doc/source_zh/tutorial/example_dpo.md b/docs/sphinx_doc/source_zh/tutorial/example_dpo.md index 4396bad063..b0117fec59 100644 --- a/docs/sphinx_doc/source_zh/tutorial/example_dpo.md +++ b/docs/sphinx_doc/source_zh/tutorial/example_dpo.md @@ -52,7 +52,7 @@ huggingface-cli download HumanLLMs/Human-Like-DPO-Dataset --repo-type dataset -- ### DPO 配置 -我们在实验中使用 [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) 中的配置。以下列出一些重要设置: +我们在实验中使用 [`dpo.yaml`](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) 中的配置。以下列出一些重要设置: 我们在 train 模式下运行实验,因为没有使用 explorer。要启用此模式,需将 `mode` 设置为 `train`,并将数据路径传递给 trainer。 @@ -107,7 +107,7 @@ trainer: ### SFT 配置 -我们将 `algorithm_type` 设为 `sft` 来运行 SFT 流程,并对配置文件 [`examples/sft_mot/sft.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/sft_mot/sft.yaml) 进行如下修改: +我们将 `algorithm_type` 设为 `sft` 来运行 SFT 流程,并对配置文件 [`examples/sft_mot/sft.yaml`](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/sft_mot/sft.yaml) 进行如下修改: ```yaml project: diff --git a/docs/sphinx_doc/source_zh/tutorial/example_megatron.md b/docs/sphinx_doc/source_zh/tutorial/example_megatron.md index 597beb7372..f0eaa63279 100644 --- a/docs/sphinx_doc/source_zh/tutorial/example_megatron.md +++ b/docs/sphinx_doc/source_zh/tutorial/example_megatron.md @@ -20,9 +20,10 @@ pip install -e ".[megatron]" # uv sync -extra megatron ``` -另外还需要从源码安装 NVIDIA 的 Apex 库以支持混合精度训练: +另外还需要从源码安装 mbridge 和 NVIDIA 的 Apex 库以支持混合精度训练: ```bash +pip install git+https://github.com/ISEEKYAN/mbridge.git@20e9ffbbe72ae7b1df83bfe1bc3c11f7382f2612 pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \ --config-settings "--build-option=--cpp_ext" \ --config-settings "--build-option=--cuda_ext" \ diff --git a/docs/sphinx_doc/source_zh/tutorial/example_mix_algo.md b/docs/sphinx_doc/source_zh/tutorial/example_mix_algo.md index b5027bae56..2499048f1a 100644 --- a/docs/sphinx_doc/source_zh/tutorial/example_mix_algo.md +++ b/docs/sphinx_doc/source_zh/tutorial/example_mix_algo.md @@ -257,7 +257,7 @@ class MIXPolicyLossFn(PolicyLossFn): ## 步骤 4:运行实验 通过上述新定义的类和函数,我们可以无需修改其他流程即可运行实验。 -下面展示了一个包含关键配置的示例,包括权重因子 $\mu$(即 `algorithm.policy_loss_fn_args['mu']`)以及专家 experience 的批次大小 $B'$,其值等于 `buffer.batch_size`、`algorithm.sample_strategy_args['expert_data_ratio']` 和 `algorithm.repeat_times` 的乘积。完整配置请参考 [`mix_math.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_math/mix_math.yaml)。 +下面展示了一个包含关键配置的示例,包括权重因子 $\mu$(即 `algorithm.policy_loss_fn_args['mu']`)以及专家 experience 的批次大小 $B'$,其值等于 `buffer.batch_size`、`algorithm.sample_strategy_args['expert_data_ratio']` 和 `algorithm.repeat_times` 的乘积。完整配置请参考 [`mix_math.yaml`](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/mix_math/mix_math.yaml)。 ```yaml algorithm: diff --git a/docs/sphinx_doc/source_zh/tutorial/example_multi_turn.md b/docs/sphinx_doc/source_zh/tutorial/example_multi_turn.md index a2efbce398..1e9311e510 100644 --- a/docs/sphinx_doc/source_zh/tutorial/example_multi_turn.md +++ b/docs/sphinx_doc/source_zh/tutorial/example_multi_turn.md @@ -66,7 +66,7 @@ python examples/grpo_webshop/get_webshop_data.py ## 第二步:配置准备并运行实验 -你可以参考 [快速开始](./example_reasoning_basic.md) 来设置配置和其他内容。默认配置文件分别为 [`alfworld.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld/alfworld.yaml) 和 [`webshop.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_webshop/webshop.yaml)。 +你可以参考 [快速开始](./example_reasoning_basic.md) 来设置配置和其他内容。默认配置文件分别为 [`alfworld.yaml`](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_alfworld/alfworld.yaml) 和 [`webshop.yaml`](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_webshop/webshop.yaml)。 你可以适当修改配置并运行实验! ```bash diff --git a/docs/sphinx_doc/source_zh/tutorial/example_reasoning_advanced.md b/docs/sphinx_doc/source_zh/tutorial/example_reasoning_advanced.md index 5ae59a8c56..b50ee64033 100644 --- a/docs/sphinx_doc/source_zh/tutorial/example_reasoning_advanced.md +++ b/docs/sphinx_doc/source_zh/tutorial/example_reasoning_advanced.md @@ -5,7 +5,7 @@ 让我们继续使用 [之前的 GSM8k 例子](./example_reasoning_basic.md),区别在于从 on-policy 模式切换到 off-policy 模式。 在这个例子中,我们考虑一个名为 OPMD 的 off-policy 强化学习算法。 该算法的设计与分析详见[我们的论文](https://arxiv.org/abs/2509.24203)中的 Section 2.2。 -本例子对应的配置文件为 [`opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/opmd_gsm8k/opmd_gsm8k.yaml)。 +本例子对应的配置文件为 [`opmd_gsm8k.yaml`](https://github.com/agentscope-ai/Trinity-RFT/blob/main/examples/opmd_gsm8k/opmd_gsm8k.yaml)。 要尝试 OPMD 算法,请运行: ```shell diff --git a/docs/sphinx_doc/source_zh/tutorial/example_reasoning_basic.md b/docs/sphinx_doc/source_zh/tutorial/example_reasoning_basic.md index 1a80338b78..a0f51d1015 100644 --- a/docs/sphinx_doc/source_zh/tutorial/example_reasoning_basic.md +++ b/docs/sphinx_doc/source_zh/tutorial/example_reasoning_basic.md @@ -47,7 +47,7 @@ huggingface-cli download openai/gsm8k --repo-type dataset --local-dir $DATASET_P ### 使用 GRPO 算法 -本实验使用 [`gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml) 中的配置。以下是 `gsm8k.yaml` 中一些重要配置项: +本实验使用 [`gsm8k.yaml`](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml) 中的配置。以下是 `gsm8k.yaml` 中一些重要配置项: ```yaml project: diff --git a/docs/sphinx_doc/source_zh/tutorial/example_search_email.md b/docs/sphinx_doc/source_zh/tutorial/example_search_email.md index 79660657ab..66ac15ce03 100644 --- a/docs/sphinx_doc/source_zh/tutorial/example_search_email.md +++ b/docs/sphinx_doc/source_zh/tutorial/example_search_email.md @@ -31,7 +31,7 @@ python trinity/common/workflows/envs/email_searcher/prepare_data.py ### 第二步:运行工作流 -配置文件位于 [`email_search.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_email_search/email_search.yaml)。 +配置文件位于 [`email_search.yaml`](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_email_search/email_search.yaml)。 要运行此示例,可执行以下命令: ```bash diff --git a/docs/sphinx_doc/source_zh/tutorial/example_step_wise.md b/docs/sphinx_doc/source_zh/tutorial/example_step_wise.md index 8030f6e06e..c7e49fd07d 100644 --- a/docs/sphinx_doc/source_zh/tutorial/example_step_wise.md +++ b/docs/sphinx_doc/source_zh/tutorial/example_step_wise.md @@ -171,7 +171,7 @@ python examples/grpo_alfworld/get_alfworld_data.py ### 配置准备并运行实验 -默认配置文件是 [`alfworld.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_alfworld_general_multi_step/alfworld.yaml)。 +默认配置文件是 [`alfworld.yaml`](https://github.com/agentscope-ai/Trinity-RFT/tree/main/examples/grpo_alfworld_general_multi_step/alfworld.yaml)。 你可以根据需要修改配置并运行实验! ```bash diff --git a/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md b/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md index 8f0a71e386..de13db9ae4 100644 --- a/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md +++ b/docs/sphinx_doc/source_zh/tutorial/example_tinker_backend.md @@ -62,7 +62,7 @@ trinity run --config tinker.yaml # 请替换为你的实际配置文件路径 3. **暂不支持多阶段训练**,后续会添加该功能。 -> 💡 完整的示例配置文件见 [`tinker.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/tinker/tinker.yaml)。 +> 💡 完整的示例配置文件见 [`tinker.yaml`](https://github.com/agentscope-ai/Trinity-RFT/blob/main/examples/tinker/tinker.yaml)。 ## Llama-3.2-3B 模型实验结果 diff --git a/docs/sphinx_doc/source_zh/tutorial/faq.md b/docs/sphinx_doc/source_zh/tutorial/faq.md index a04e6b5c66..a0e5c91c7c 100644 --- a/docs/sphinx_doc/source_zh/tutorial/faq.md +++ b/docs/sphinx_doc/source_zh/tutorial/faq.md @@ -104,7 +104,7 @@ ray start --head - 对于 trainer,当 `trainer.use_dynamic_bsz=false` 时,调整 `trainer.max_token_len_per_gpu`;当 `trainer.use_dynamic_bsz=true` 时,调整 `trainer.ppo_max_token_len_per_gpu` 和 `trainer.ulysses_sequence_parallel_size`。设置 `trainer.trainer_config.actor_rollout_ref.actor.entropy_from_logits_with_chunking=true` 也可能有帮助。 - 对于 explorer,调整 `explorer.rollout_model.tensor_parallel_size`。 -此外,Trinity-RFT 提供了[GPU 相关配置指南](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_gpu_configs.html),可参考其中建议。 +此外,Trinity-RFT 提供了[GPU 相关配置指南](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_gpu_configs.html),可参考其中建议。 ## 第三部分:调试方法 @@ -209,4 +209,4 @@ model.load_state_dict(load_fsdp_state_dict_from_verl_checkpoint(ckp_path)) - **Explorer 与 Trainer 分离**:Trinity-RFT 用独立 Explorer 模块替代 veRL 的 rollout model,专门负责 agent 与环境交互,支持更灵活的 workflow 设计和 rollout-training 调度。 - **全生命周期数据通路**:Trinity-RFT 在 Explorer 和 Trainer 之间增加 Buffer 模块,提供完整的数据存储、处理和采样通路,支持经验回放、优先采样等高级数据处理策略。 -我们还提供了 Trinity-RFT 与 veRL 及其衍生系统(如 [rLLM](https://github.com/rllm-org/rllm))的基准对比,详见 [Benchmark](https://github.com/modelscope/Trinity-RFT/tree/main/benchmark)。 +我们还提供了 Trinity-RFT 与 veRL 及其衍生系统(如 [rLLM](https://github.com/rllm-org/rllm))的基准对比,详见 [Benchmark](https://github.com/agentscope-ai/Trinity-RFT/tree/main/benchmark)。 diff --git a/docs/sphinx_doc/source_zh/tutorial/trinity_configs.md b/docs/sphinx_doc/source_zh/tutorial/trinity_configs.md index 8f00929da5..49301d638a 100644 --- a/docs/sphinx_doc/source_zh/tutorial/trinity_configs.md +++ b/docs/sphinx_doc/source_zh/tutorial/trinity_configs.md @@ -53,7 +53,7 @@ stages: ... ``` -每个部分将在下文详细说明。关于此处未涵盖的具体参数的更多细节,请参考[源码](https://github.com/modelscope/Trinity-RFT/blob/main/trinity/common/config.py)。 +每个部分将在下文详细说明。关于此处未涵盖的具体参数的更多细节,请参考[源码](https://github.com/agentscope-ai/Trinity-RFT/blob/main/trinity/common/config.py)。 ```{tip} Trinity-RFT 使用[OmegaConf](https://omegaconf.readthedocs.io/en/latest/) 来加载 YAML 配置文件。 diff --git a/docs/sphinx_doc/source_zh/tutorial/trinity_installation.md b/docs/sphinx_doc/source_zh/tutorial/trinity_installation.md index 1b5e52bc71..6b62808b98 100644 --- a/docs/sphinx_doc/source_zh/tutorial/trinity_installation.md +++ b/docs/sphinx_doc/source_zh/tutorial/trinity_installation.md @@ -29,7 +29,7 @@ ### 1. 克隆仓库 ```bash -git clone https://github.com/modelscope/Trinity-RFT +git clone https://github.com/agentscope-ai/Trinity-RFT cd Trinity-RFT ``` @@ -91,7 +91,7 @@ uv sync --extra vllm --extra dev --extra flash_attn ### 从 Github 拉取预构建镜像 (推荐初学者使用该方法) ```bash -git clone https://github.com/modelscope/Trinity-RFT +git clone https://github.com/agentscope-ai/Trinity-RFT cd Trinity-RFT docker pull ghcr.io/modelscope/trinity-rft:latest @@ -114,7 +114,7 @@ docker run -it \ ```bash -git clone https://github.com/modelscope/Trinity-RFT +git clone https://github.com/agentscope-ai/Trinity-RFT cd Trinity-RFT # 构建 Docker 镜像 @@ -158,4 +158,4 @@ uv pip install flash-attn==2.8.1 ## 常见问题 -如遇安装问题,请参考 FAQ 或 [GitHub Issues](https://github.com/modelscope/Trinity-RFT/issues)。 +如遇安装问题,请参考 FAQ 或 [GitHub Issues](https://github.com/agentscope-ai/Trinity-RFT/issues)。 diff --git a/examples/agentscope_react/README.md b/examples/agentscope_react/README.md index 8b3a75f651..37216c96be 100644 --- a/examples/agentscope_react/README.md +++ b/examples/agentscope_react/README.md @@ -2,4 +2,4 @@ This example demonstrates how to train the [AgentScope](https://github.com/agentscope-ai/agentscope) built-in ReAct Agent using Trinity-RFT. We use the GSM8K dataset as an example. Developers can refer to this example to adapt Trinity-RFT's training to their own agent projects. -Full documentation is available at: https://modelscope.github.io/Trinity-RFT/en/main/tutorial/example_react.html +Full documentation is available at: https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/example_react.html diff --git a/examples/bots/README.md b/examples/bots/README.md index 7ba9072d14..6c55ccaf1c 100644 --- a/examples/bots/README.md +++ b/examples/bots/README.md @@ -22,7 +22,7 @@ For unselected tasks, predicted counts (_implicit evidence_) are produced by a p ##### Step 1: Environment Preparation -Ensure Trinity-RFT is well installed ([Installation Guide](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html)). No extra dependence is required. +Ensure Trinity-RFT is well installed ([Installation Guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html)). No extra dependence is required. ##### Step 2: Model & Dataset Preparation diff --git a/examples/bots/README_zh.md b/examples/bots/README_zh.md index 3a40e9d2b3..716b4d3f06 100644 --- a/examples/bots/README_zh.md +++ b/examples/bots/README_zh.md @@ -21,7 +21,7 @@ BOTS 以任务选择、模型训练和后验概率更新的连续循环运行。 ##### 第一步:环境准备 -确保Trinity-RFT安装好了([安装指南](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html))。不需要额外的依赖。 +确保Trinity-RFT安装好了([安装指南](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html))。不需要额外的依赖。 ##### 第二步:模型和数据准备 diff --git a/examples/grpo_frozen_lake/README.md b/examples/grpo_frozen_lake/README.md index ec79861848..4b3280901d 100644 --- a/examples/grpo_frozen_lake/README.md +++ b/examples/grpo_frozen_lake/README.md @@ -5,7 +5,7 @@ This example shows the usage of GRPO on the [Frozen Lake](https://gymnasium.fara ## Data and Environment Preparation -After setting up the basic environment following the [installation guidance](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html), you need to install the additional dependencies by running the following command: +After setting up the basic environment following the [installation guidance](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html), you need to install the additional dependencies by running the following command: ```bash pip install gymnasium[toy_text] diff --git a/examples/learn_to_ask/README.md b/examples/learn_to_ask/README.md index 7b72b774aa..5025bc745c 100644 --- a/examples/learn_to_ask/README.md +++ b/examples/learn_to_ask/README.md @@ -124,7 +124,7 @@ You may configure the settings then run the pipeline by launching: python examples/learn_to_ask/data_prepare/3_rollout_then_evaluate.py --eval_model_path path/to/trained/model --grader_model_path path/to/qwen2.5-32b-instruct --test_file_path examples/learn_to_ask/data/test.jsonl --rollout_file_path path/to/rollout.jsonl --eval_file_path path/to/output.jsonl ``` -Note: `eval_model_path` is the location of the model you want to evaluate. This model must first be converted into the HuggingFace format. For instructions on converting FSDP checkpoints, see [this guide](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/faq.html). +Note: `eval_model_path` is the location of the model you want to evaluate. This model must first be converted into the HuggingFace format. For instructions on converting FSDP checkpoints, see [this guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/faq.html). ## Citation diff --git a/pyproject.toml b/pyproject.toml index b4342cc4d3..29970181b2 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -81,7 +81,7 @@ megatron = [ # "transformer_engine[pytorch]==2.8.0", # Install mbridge from main branch (unreleased version) - "mbridge @ git+https://github.com/ISEEKYAN/mbridge.git@20e9ffbbe72ae7b1df83bfe1bc3c11f7382f2612", + # "mbridge @ git+https://github.com/ISEEKYAN/mbridge.git@20e9ffbbe72ae7b1df83bfe1bc3c11f7382f2612", ] tinker = [ "tinker; python_version >= '3.11'", @@ -140,5 +140,5 @@ known_third_party = ["wandb"] flash-attn = ["torch", "numpy"] [project.urls] -"Homepage" = "https://github.com/modelscope/Trinity-RFT" -"Documentation" = "https://modelscope.github.io/Trinity-RFT/" +"Homepage" = "https://github.com/agentscope-ai/Trinity-RFT" +"Documentation" = "https://agentscope-ai.github.io/Trinity-RFT/" diff --git a/scripts/context_length_test/README.md b/scripts/context_length_test/README.md index de4c847fc6..505c1a7ee1 100644 --- a/scripts/context_length_test/README.md +++ b/scripts/context_length_test/README.md @@ -6,7 +6,7 @@ This script automates the process of determining the **maximum context length** ## 🧰 Requirements -Ensure Trinity-RFT is well installed ([Installation Guide](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html)). No extra dependence is required. +Ensure Trinity-RFT is well installed ([Installation Guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html)). No extra dependence is required. --- diff --git a/scripts/docker/Dockerfile.megatron b/scripts/docker/Dockerfile.megatron index c5362258a2..f4e1ae29c3 100644 --- a/scripts/docker/Dockerfile.megatron +++ b/scripts/docker/Dockerfile.megatron @@ -32,6 +32,7 @@ RUN pip install --upgrade pip \ && pip install flash_attn==2.8.1 --no-build-isolation \ && pip install -e .[megatron] \ && pip install transformer_engine[pytorch]==2.8.0 --no-build-isolation --no-cache-dir \ + && pip install git+https://github.com/ISEEKYAN/mbridge.git@20e9ffbbe72ae7b1df83bfe1bc3c11f7382f2612 \ && NVCC_APPEND_FLAGS="--threads 4" APEX_PARALLEL_BUILD=8 pip install -v \ --disable-pip-version-check --no-cache-dir --no-build-isolation \ --config-settings "--build-option=--cpp_ext" \ diff --git a/scripts/docker/Dockerfile.uv b/scripts/docker/Dockerfile.uv index 3aafc8c1a0..3b47393147 100644 --- a/scripts/docker/Dockerfile.uv +++ b/scripts/docker/Dockerfile.uv @@ -42,6 +42,7 @@ RUN . /opt/venv/bin/activate && \ RUN . /opt/venv/bin/activate && \ uv pip install -e .[megatron] && \ uv pip install flash_attn==2.8.1 --no-build-isolation && \ + uv pip install git+https://github.com/ISEEKYAN/mbridge.git@20e9ffbbe72ae7b1df83bfe1bc3c11f7382f2612 && \ uv pip install transformer_engine[pytorch]==2.8.0 --no-build-isolation --no-cache-dir && \ NVCC_APPEND_FLAGS="--threads 4" APEX_PARALLEL_BUILD=8 \ uv pip install -v --no-build-isolation \ diff --git a/scripts/multi_exps_plot/README.md b/scripts/multi_exps_plot/README.md index 899f72d22a..741c820510 100644 --- a/scripts/multi_exps_plot/README.md +++ b/scripts/multi_exps_plot/README.md @@ -6,7 +6,7 @@ Due to the stochastic nature of RFT results, multiple experimental runs are nece ## Usage -***Before running this script***, ensure your experiment results are available. For example, after running the [grpo_gsm8k](https://github.com/modelscope/Trinity-RFT/blob/main/examples/grpo_gsm8k/gsm8k.yaml) script **three times**, the result directories will be located under a path pattern such as `/PATH/TO/CHECKPOINT/Trinity-RFT-gsm8k/qwen2.5-1.5B-gsm8k-{1, 2, 3}`. The directory structure for a single run is expected to be as follows: +***Before running this script***, ensure your experiment results are available. For example, after running the [grpo_gsm8k](https://github.com/agentscope-ai/Trinity-RFT/blob/main/examples/grpo_gsm8k/gsm8k.yaml) script **three times**, the result directories will be located under a path pattern such as `/PATH/TO/CHECKPOINT/Trinity-RFT-gsm8k/qwen2.5-1.5B-gsm8k-{1, 2, 3}`. The directory structure for a single run is expected to be as follows: └── qwen2.5-1.5B-gsm8k-1 ├── buffer