update docs and apply reviews

chenyushuo · chenyushuo · commit ba4b34e5afad · 2025-12-26T16:30:10.000+08:00
diff --git a/README.md b/README.md
@@ -42,6 +42,7 @@ Trinity-RFT provides functionalities for users with different backgrounds and ob
 
 ## 🚀 News
 
+* [2025-12] Trinity-RFT has supported [tinker](https://thinkingmachines.ai/tinker/) training backend, which enables model training on devices **without GPUs**.
 * [2025-12] Trinity-RFT powers the medical and health business of "Taobao Shangou", enabling the AI agent to understand vague symptoms, proactively ask follow-up questions, and provide precise recommendations ([News](https://tech.china.com.cn/sx/20251201/411376.shtml)).
 * [2025-11] [[Release Notes](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3)] Trinity-RFT v0.3.3 released: bug fixes.
 * [2025-11] Introducing [Learn-to-Ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask): a framework for training proactive dialogue agents from offline expert data  ([paper](https://arxiv.org/pdf/2510.25441)).
@@ -154,6 +155,10 @@ We list some algorithms supported by Trinity-RFT in the following table. For mor
 
 > [!NOTE]
 > This project is currently under active development. Comments and suggestions are welcome!
+>
+> **No GPU? No problem!** You can still try it out:
+> 1. Follow the installation steps (feel free to skip GPU-specific packages like `flash-attn`)
+> 2. Run the **[Tinker training example](https://github.com/modelscope/Trinity-RFT/tree/main/examples/tinker)**, which is specifically designed to work on CPU-only systems.
 
 
 ### Step 1: installation
diff --git a/README_zh.md b/README_zh.md
@@ -41,6 +41,7 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能：
 
 ## 🚀 新闻
 
+* [2025-12] Trinity-RFT 已支持 [tinker](https://thinkingmachines.ai/tinker/) 训练后端，可在**无 GPU 的设备**上进行模型训练。
 * [2025-12] Trinity-RFT 助力淘宝闪购医药健康业务，让 AI 智能体能够理解模糊症状、主动询问后续问题，并提供精准推荐（[新闻](https://tech.china.com.cn/sx/20251201/411376.shtml)）。
 * [2025-11] [[发布说明](https://github.com/modelscope/Trinity-RFT/releases/tag/v0.3.3)] Trinity-RFT v0.3.3 发布：修复若干 Bug。
 * [2025-11] 推出 [Learn-to-Ask](https://github.com/modelscope/Trinity-RFT/tree/main/examples/learn_to_ask)：利用离线专家数据，训练具备主动问询能力的对话智能体（[论文](https://arxiv.org/pdf/2510.25441)）.
@@ -79,6 +80,10 @@ Trinity-RFT 面向不同背景和目标的用户提供相应功能：
 
 > [!NOTE]
 > 更多教程请参考 [Trinity-RFT 文档](https://modelscope.github.io/Trinity-RFT/)。
+>
+> 没有 GPU？没问题！你仍然可以尝试以下方案：
+> 1. 按照安装步骤操作（可跳过 GPU 专用的软件包，例如 `flash-attn`）
+> 2. 运行 **[Tinker 训练示例](https://github.com/modelscope/Trinity-RFT/tree/main/examples/tinker)**，该示例专为仅使用 CPU 的系统设计。
 
 
 
diff --git a/examples/tinker/README.md b/examples/tinker/README.md
@@ -1,6 +1,7 @@
 # Trinity with Tinker Backend
 
-This example demonstrates how to use Trinity with the [Tinker](https://thinkingmachines.ai/tinker/) backend, which enables model training on devices without GPUs.
+> [!NOTE]
+> This example demonstrates how to use Trinity with the [Tinker](https://thinkingmachines.ai/tinker/) backend, which enables model training on devices **without GPUs**.
 
 ## Setup Instructions
 
@@ -28,7 +29,7 @@ model:
 
 ### 3. Configuration Parameters Explained
 
-- **`tinker`**: Tinker-specific configuration section. **Important**: When Tinker is enabled, any LoRA configuration settings will be ignored.
+- **`tinker`**: Tinker-specific configuration section. **Important**: When Tinker is enabled, any LoRA configuration settings (`model.lora_configs`) will be ignored.
   - **`enable`**: Whether to activate the Tinker backend. Default: `false`
   - **`base_model`**: Path to the base model for Tinker. If not specified (`null`), it defaults to the `model_path` defined elsewhere in your config
   - **`rank`**: The LoRA rank that controls the size of the adaptation matrices. Default: `32`
@@ -50,10 +51,12 @@ trinity run --config tinker.yaml  # Replace with your actual config file path
 
 1. **Entropy loss** is not consistent compared to veRL backends.
 2. **Algorithms requiring `compute_advantage_in_trainer=true` are NOT supported**, including:
-   - `PPOAlgorithm`
-   - `ReinforcePlusPlusAlgorithm`
-   - `RLOOAlgorithm`
-   - `OnPolicyDistillAlgorithm`
+    - PPO (`algorithm.algorithm_type=ppo`)
+    - Reinforce++ (`algorithm.algorithm_type=reinforceplusplus`)
+    - RLOO (`algorithm.algorithm_type=rloo`)
+    - On-policy distillation (`algorithm.algorithm_type=on_policy_distill`)
+
+    Algorithms like `algorithm.algorithm_type=grpo` are supported.
 
 > 💡 A complete example configuration file is available at [`tinker.yaml`](tinker.yaml).
 
@@ -197,10 +200,6 @@ buffer:
       storage_type: queue
       replay_buffer:
         enable: false
-        priority_fn: linear_decay
-        reuse_cooldown_time: null
-        priority_fn_args:
-          decay: 2.0
 explorer:
   runner_per_model: 16
   rollout_model:
@@ -240,8 +239,6 @@ synchronizer:
 
 ### Observations
 
-Since **Llama-3.2-3B** is a base (non-instruct-tuned) model, it has limited ability to follow formatting instructions. Additionally, we trained for only **one epoch**. As a result, both backends achieved final rewards just slightly above **0.1**.
-
-However, the training curves clearly show an **upward trend in reward**, indicating successful learning. The results are visualized below:
+Since Llama-3.2-3B is a base (non-instruct-tuned) model, it has limited ability to follow formatting instructions. Additionally, we trained for only **one epoch**. As a result, both backends achieved final rewards just slightly above 0.1. Nonetheless, the training curves show a clear upward trend in reward, indicating successful learning. The results are visualized below:
 
 ![Training Rewards on GSM8K](../../docs/sphinx_doc/assets/tinker-gsm8k.png)
diff --git a/trinity/common/models/tinker_model.py b/trinity/common/models/tinker_model.py
@@ -195,10 +195,9 @@ def get_model_version(self) -> int:
 
     def get_api_server_url(self) -> Optional[str]:
         """Get the API server URL if available."""
-        # TODO
+        # TODO: tinker will support openai api later
         return None
 
     def get_model_path(self) -> Optional[str]:
         """Get the model path"""
-        # TODO
-        return None
+        return self.config.model_path  # type: ignore [return-value]
diff --git a/trinity/trainer/tinker_trainer.py b/trinity/trainer/tinker_trainer.py
@@ -125,7 +125,7 @@ async def prepare(self):
                 f"global_step_{self.latest_remote_sampler_step}",
                 "remote_sampler_path.txt",
             )
-            with open(sampler_file_path, "r"):
+            with open(sampler_file_path, "r") as f:
                 self.latest_remote_sampler_path = f.read().strip()
         else:
             self.latest_remote_sampler_step = 0

Original file line number	Diff line number	Diff line change
`@@ -125,7 +125,7 @@ async def prepare(self):`
`125`	`125`	`f"global_step_{self.latest_remote_sampler_step}",`
`126`	`126`	`"remote_sampler_path.txt",`
`127`	`127`	`)`
`128`		`- with open(sampler_file_path, "r"):`
	`128`	`+ with open(sampler_file_path, "r") as f:`
`129`	`129`	`self.latest_remote_sampler_path = f.read().strip()`
`130`	`130`	`else:`
`131`	`131`	`self.latest_remote_sampler_step = 0`