modelscope
diff --git a/‎README.md‎
Lines changed: 2 additions & 1 deletion b/‎README.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎README_zh.md‎
Lines changed: 1 addition & 0 deletions b/‎README_zh.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎benchmark/reports/gsm8k.md‎
Lines changed: 1 addition & 1 deletion b/‎benchmark/reports/gsm8k.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/sphinx_doc/assets/opd_acc.png‎
168 KB b/‎docs/sphinx_doc/assets/opd_acc.png‎
168 KB
diff --git a/‎docs/sphinx_doc/assets/opd_kl.png‎
169 KB b/‎docs/sphinx_doc/assets/opd_kl.png‎
169 KB
diff --git a/‎docs/sphinx_doc/source/main.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/sphinx_doc/source/main.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/develop_workflow.md‎
Lines changed: 6 additions & 5 deletions b/‎docs/sphinx_doc/source/tutorial/develop_workflow.md‎
Lines changed: 6 additions & 5 deletions
diff --git a/‎docs/sphinx_doc/source/tutorial/example_react.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/sphinx_doc/source/tutorial/example_react.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/sphinx_doc/source_zh/main.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/sphinx_doc/source_zh/main.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/sphinx_doc/source_zh/tutorial/develop_workflow.md‎
Lines changed: 6 additions & 5 deletions b/‎docs/sphinx_doc/source_zh/tutorial/develop_workflow.md‎
Lines changed: 6 additions & 5 deletions
@@ -130,6 +130,7 @@ We list some algorithms supported by Trinity-RFT in the following table. For mor
 | AsymRE [[Paper](https://arxiv.org/pdf/2506.20520)] | [[GSM8K Example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/asymre_gsm8k)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/asymre_advantage.py)] | `algorithm_type: asymre` |
 | CISPO [[Paper](https://arxiv.org/pdf/2506.13585)] | - | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/cispo_policy_loss.py)] | `algorithm_type: cispo` |
 | SAPO [[Paper](https://arxiv.org/pdf/2511.20347)] | - | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sapo_policy_loss.py)] | `algorithm_type: sapo` |
+| On-Policy Distillation [[Blog](https://thinkingmachines.ai/blog/on-policy-distillation/)] [[Paper](https://arxiv.org/pdf/2306.13649)] | [[GSM8K Example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/on_policy_distill)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/common/workflows/on_policy_distill_workflow.py)] | `algorithm_type: on_policy_distill` |
 
 
 
@@ -142,7 +143,7 @@ We list some algorithms supported by Trinity-RFT in the following table. For mor
   - [Step 2: prepare dataset and model](#step-2-prepare-dataset-and-model)
   - [Step 3: configurations](#step-3-configurations)
   - [Step 4: run the RFT process](#step-4-run-the-rft-process)
-- [Contribution guide](#contribution-guide)
+- [Contribution Guide](#contribution-guide)
 - [Acknowledgements](#acknowledgements)
 - [Citation](#citation)
 
 
@@ -129,6 +129,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能：
 | AsymRE [[论文](https://arxiv.org/pdf/2506.20520)] | [[GSM8K 例子](https://github.com/modelscope/Trinity-RFT/tree/main/examples/asymre_gsm8k)] | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/asymre_advantage.py)] | `algorithm_type: asymre` |
 | CISPO [[论文](https://arxiv.org/pdf/2506.13585)] | - | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/cispo_policy_loss.py)] | `algorithm_type: cispo` |
 | SAPO [[论文](https://arxiv.org/pdf/2511.20347)] | - | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sapo_policy_loss.py)] | `algorithm_type: sapo` |
+| On-Policy Distillation [[博客](https://thinkingmachines.ai/blog/on-policy-distillation/)] [[论文](https://arxiv.org/pdf/2306.13649)] | [[GSM8K 示例](https://github.com/modelscope/Trinity-RFT/tree/main/examples/on_policy_distill)] | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/common/workflows/on_policy_distill_workflow.py)] | `algorithm_type: on_policy_distill` |
 
 
 
 
@@ -188,7 +188,7 @@ class VerlGSM8kWorkflow(Workflow):
         *,
         task: Task,
         model: ModelWrapper,
-        auxiliary_models: Optional[List[openai.OpenAI]] = None,
+        auxiliary_models: Optional[List[ModelWrapper]] = None,
     ):
         self.reset(task)
         super().__init__(
 
@@ -86,6 +86,8 @@ We list some algorithms supported by Trinity-RFT in the following table. For mor
 | AsymRE [[Paper](https://arxiv.org/pdf/2506.20520)] | [[GSM8K Example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/asymre_gsm8k)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/asymre_advantage.py)] | `algorithm_type: asymre` |
 | CISPO [[Paper](https://arxiv.org/pdf/2506.13585)] | - | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/cispo_policy_loss.py)] | `algorithm_type: cispo` |
 | SAPO [[Paper](https://arxiv.org/pdf/2511.20347)] | - | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sapo_policy_loss.py)] | `algorithm_type: sapo` |
+| On-Policy Distillation [[Blog](https://thinkingmachines.ai/blog/on-policy-distillation/)] [[Paper](https://arxiv.org/pdf/2306.13649)] | [[GSM8K Example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/on_policy_distill)] | [[Code](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/common/workflows/on_policy_distill_workflow.py)] | `algorithm_type: on_policy_distill` |
+
 
 
 
 
@@ -93,11 +93,12 @@ class Workflow(ABC):
         *,
         task: Task,
         model: ModelWrapper,
-        auxiliary_models: Optional[List[openai.OpenAI]] = None,
+        auxiliary_models: Optional[List[ModelWrapper]] = None,
     ):
         self.task = task
         self.model = model
-        self.auxiliary_models = auxiliary_models
+        self.auxiliary_model_wrappers = auxiliary_models
+        self.auxiliary_models = ...  # OpenAI clients auto-derived from ModelWrapper
 
     @abstractmethod
     def run(self) -> List[Experience]:
@@ -110,7 +111,7 @@ During initialization, `Workflow` receives the following parameters:
 
 - `task`({class}`trinity.common.workflows.Task`): A single data item from the task dataset.
 - `model`({class}`trinity.common.models.model.ModelWrapper`): The model being trained, which provides an interface similar to OpenAI, capable of receiving a list of conversation messages and returning content generated by the LLM (including reply text `response_text`, full sequence token ids `tokens`, prompt part token length `prompt_length`, and a list of output token logprobs `logprobs`).
-- `auxiliary_models`(`List[openai.OpenAI]`):A list of auxiliary models not involved in training. All are provided via OpenAI-compatible APIs.
+- `auxiliary_models`(`List[ModelWrapper]`): A list of auxiliary model wrappers. You can access OpenAI clients via `self.auxiliary_models` (auto-derived based on workflow's `is_async`).
 
 ```{tip}
 You can switch to using the OpenAI API by setting `explorer.rollout_model.enable_openai_api` to `true` in your config file and calling `model.get_openai_client()` to get an `openai.OpenAI` instance in your workflow.
@@ -440,10 +441,10 @@ class MyWorkflow(Workflow):
         *,
         task: Task,
         model: ModelWrapper,
-        auxiliary_models: Optional[List[openai.OpenAI]] = None,
+        auxiliary_models: Optional[List[ModelWrapper]] = None,
     ):
         super().__init__(task=task, model=model, auxiliary_models=auxiliary_models)
-        self.judge_model = self.auxiliary_models[0]  # Use the first auxiliary model as the judge
+        self.judge_model = self.auxiliary_models[0]  # OpenAI client auto-derived from ModelWrapper
 
     def run(self) -> List[Experience]:
         response = self.do_something()
 
@@ -82,7 +82,7 @@ class AgentScopeReActWorkflow(Workflow):
         *,
         task: Task,
         model: ModelWrapper,
-        auxiliary_models: Optional[List[openai.OpenAI]] = None,
+        auxiliary_models: Optional[List[ModelWrapper]] = None,
     ):
         # initialize the agent
         self.agent = AgentScopeReActAgent(
 
@@ -82,6 +82,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能：
 | AsymRE [[论文](https://arxiv.org/pdf/2506.20520)] | [[GSM8K 例子](https://github.com/modelscope/Trinity-RFT/tree/main/examples/asymre_gsm8k)] | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/advantage_fn/asymre_advantage.py)] | `algorithm_type: asymre` |
 | CISPO [[论文](https://arxiv.org/pdf/2506.13585)] | - | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/cispo_policy_loss.py)] | `algorithm_type: cispo` |
 | SAPO [[论文](https://arxiv.org/pdf/2511.20347)] | - | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/algorithm/policy_loss_fn/sapo_policy_loss.py)] | `algorithm_type: sapo` |
+| On-Policy Distillation [[博客](https://thinkingmachines.ai/blog/on-policy-distillation/)] [[论文](https://arxiv.org/pdf/2306.13649)] | [[GSM8K 示例](https://github.com/modelscope/Trinity-RFT/tree/main/examples/on_policy_distill)] | [[代码](https://github.com/modelscope/Trinity-RFT/tree/main/trinity/common/workflows/on_policy_distill_workflow.py)] | `algorithm_type: on_policy_distill` |
 
 
 
 
@@ -92,11 +92,12 @@ class Workflow(ABC):
         *,
         task: Task,
         model: ModelWrapper,
-        auxiliary_models: Optional[List[openai.OpenAI]] = None,  # 主要用于 LLM-as-a-judge 场景，这里可以忽略
+        auxiliary_models: Optional[List[ModelWrapper]] = None,  # 主要用于 LLM-as-a-judge 场景, 也可以用作distillation的techer
     ):
         self.task = task
         self.model = model
-        self.auxiliary_models = auxiliary_models
+        self.auxiliary_model_wrappers = auxiliary_models
+        self.auxiliary_models = ...  # 从 ModelWrapper 自动派生的 OpenAI client
 
     @abstractmethod
     def run(self) -> List[Experience]:
@@ -109,7 +110,7 @@ class Workflow(ABC):
 
 - `task`({class}`trinity.common.workflows.Task`)：数据集中的单个任务。
 - `model`({class}`trinity.common.models.model.ModelWrapper`)：正在训练的模型，提供类似于 OpenAI 的接口，能够接收对话消息列表并返回 LLM 生成的内容（包括回复文本 `response_text`、完整序列 token id `tokens`、prompt 部分 token 长度 `prompt_length`，以及输出 token 对数概率列表 `logprobs`）。
-- `auxiliary_models`(`List[openai.OpenAI]`)：未参与训练的辅助模型列表。所有模型均通过兼容 OpenAI 的 API 提供，主要用于 LLM-as-a-judge 场景。
+- `auxiliary_models`(`List[ModelWrapper]`)：辅助模型的 ModelWrapper 列表。可通过 `self.auxiliary_models` 访问 OpenAI client（根据 workflow 的 `is_async` 自动派生）。
 
 以下是一个仅使用 `raw_task` 和 `rollout_args` 初始化简单工作流的示例。在更复杂的情况下，你可以使用 `format_args` 进行进一步自定义。
 
@@ -437,10 +438,10 @@ class MyWorkflow(Workflow):
         *,
         task: Task,
         model: ModelWrapper,
-        auxiliary_models: Optional[List[openai.OpenAI]] = None,
+        auxiliary_models: Optional[List[ModelWrapper]] = None,
     ):
         super().__init__(task=task, model=model, auxiliary_models=auxiliary_models)
-        self.judge_model = self.auxiliary_models[0]  # 使用第一个辅助模型作为评判者
+        self.judge_model = self.auxiliary_models[0]  # 从 ModelWrapper 自动派生的 OpenAI client
 
     def run(self) -> List[Experience]:
         response = self.do_something()