Add document and apply reviews

chenyushuo · chenyushuo · commit 9a4927f8490c · 2025-12-26T11:48:30.000+08:00
diff --git a/docs/sphinx_doc/source/tutorial/trinity_configs.md b/docs/sphinx_doc/source/tutorial/trinity_configs.md
@@ -164,9 +164,21 @@ model:
   max_response_tokens: 16384
   min_response_tokens: 1
   enable_prompt_truncation: true
+  repetition_penalty: 1.0
+  lora_configs: null
+  rope_scaling: null
+  rope_theta: null
+  tinker:
+    enable: false
+    base_model: null
+    rank: 32
+    seed: null
+    train_mlp: true
+    train_attn: true
+    train_unembed: true
 ```
 
-- `model_path`: Path to the model being trained.
+- `model_path`: Path to the model being trained. If `tinker` is enabled, this is the path to the local tokenizer.
 - `critic_model_path`: Optional path to a separate critic model. If empty, defaults to `model_path`.
 - `custom_chat_template`: Optional custom chat template in string format. If not specified, the system will use the default chat template from tokenizer.
 - `chat_template_path`: Optional path to the chat template file in jinja2 type; overrides `custom_chat_template` if set. If not specified, the system will use the default chat template from tokenizer.
@@ -175,6 +187,25 @@ model:
 - `max_prompt_tokens`: Maximum number of tokens allowed in prompts. Only for `chat` and `generate` methods in `InferenceModel`.
 - `min_response_tokens`: Minimum number of tokens allowed in generated responses. Only for `chat` and `generate` methods in `InferenceModel`. Default is `1`. It must be less than `max_response_tokens`.
 - `enable_prompt_truncation`: Whether to truncate the prompt. Default is `true`. If set to `true`, the prompt will be truncated to `max_prompt_tokens` tokens; if set to `false`, the prompt will not be truncated and there is a risk that the prompt length plus response length exceeds `max_model_len`. This function does not work with openai api mode.
+- `repetition_penalty`: Repetition penalty factor. Default is `1.0`.
+- `lora_configs`: Optional LoRA configuration. If not specified, defaults to `null`. Currently, only one LoRA configuration is supported.
+  - `name`: Name of the LoRA. Default is `None`.
+  - `path`: Path to the LoRA. Default is `None`.
+  - `base_model_name`: Name of the base model for LoRA. If not specified, defaults to `None`.
+  - `lora_rank`: Rank of the LoRA. Default is `32`.
+  - `lora_alpha`: Alpha value of the LoRA. Default is `32`.
+  - `lora_dtype`: Data type of the LoRA. Default is `auto`.
+  - `target_modules`: List of target modules for LoRA. Default is `all-linear`.
+- `rope_scaling`: Optional RoPE scaling configuration in JSON format. If not specified, defaults to `null`.
+- `rope_theta`: Optional RoPE theta value. If not specified, defaults to `null`.
+- `tinker`: Optional Tinker configuration. Note: LoRA configuration will be ignored if Tinker is enabled.
+  - `enable`: Whether to enable Tinker. Default is `false`.
+  - `base_model`: Path to the base model for Tinker. If not specified, defaults to `model_path`.
+  - `rank`: LoRA rank controlling the size of adaptation matrices. Default is `32`.
+  - `seed`: Random seed for Tinker. If not specified, defaults to `null`.
+  - `train_mlp`: Whether to train the MLP layer. Default is `true`.
+  - `train_attn`: Whether to train the attention layer. Default is `true`.
+  - `train_unembed`: Whether to train the unembedding layer. Default is `true`.
 
 ```{tip}
 If you are using the openai API provided by Explorer, only `max_model_len` will take effect, and the value of `max_response_tokens`, `max_prompt_tokens`, and `min_response_tokens` will be ignored. When `max_tokens` is not independently specified, each API call will generate up to `max_model_len - prompt_length` tokens. Therefore, please ensure that the prompt length is less than `max_model_len` when using the API.
diff --git a/docs/sphinx_doc/source_zh/tutorial/trinity_configs.md b/docs/sphinx_doc/source_zh/tutorial/trinity_configs.md
@@ -164,6 +164,18 @@ model:
   max_response_tokens: 16384
   min_response_tokens: 1
   enable_prompt_truncation: true
+  repetition_penalty: 1.0
+  lora_configs: null
+  rope_scaling: null
+  rope_theta: null
+  tinker:
+    enable: false
+    base_model: null
+    rank: 32
+    seed: null
+    train_mlp: true
+    train_attn: true
+    train_unembed: true
 ```
 
 - `model_path`: 被训练模型的路径。
@@ -175,6 +187,25 @@ model:
 - `max_response_tokens`: 模型生成的回复中允许的最大 token 数。仅对 `InferenceModel` 中的 `chat` 和 `generate` 方法生效。
 - `min_response_tokens`: 模型生成的回复中允许的最小 token 数。仅对 `InferenceModel` 中的 `chat` 和 `generate` 方法生效。
 - `enable_prompt_truncation`: 是否截断 prompt。默认为 `true`。若设置为 `true`，则 prompt 将被截断为 `max_prompt_tokens` 个 token；若设置为 `false`，则 prompt 不会被截断，存在 prompt 和 response 长度之和超过 `max_model_len` 的风险。在 OpenAI API 模式下不生效。
+- `repetition_penalty`：重复惩罚因子。默认值为 `1.0`。
+- `lora_configs`：可选的 LoRA 配置。若未指定，则默认为 `null`。目前仅支持一个 LoRA 配置。
+  - `name`：LoRA 的名称。默认为 `None`。
+  - `path`：LoRA 的路径。默认为 `None`。
+  - `base_model_name`：LoRA 所基于的基础模型名称。若未指定，则默认为 `None`。
+  - `lora_rank`：LoRA 的秩（rank）。默认为 `32`。
+  - `lora_alpha`：LoRA 的 alpha 值。默认为 `32`。
+  - `lora_dtype`：LoRA 的数据类型。默认为 `auto`。
+  - `target_modules`：LoRA 的目标模块列表。默认为 `all-linear`。
+- `rope_scaling`：可选的 RoPE 缩放配置，采用 JSON 格式。若未指定，则默认为 `null`。
+- `rope_theta`：可选的 RoPE theta 值。若未指定，则默认为 `null`。
+- `tinker`：可选的 Tinker 配置。注意：若启用 Tinker，则 LoRA 配置将被忽略。
+  - `enable`：是否启用 Tinker。默认为 `false`。
+  - `base_model`：Tinker 所使用的基础模型路径。若未指定，则默认为 `model_path`。
+  - `rank`：控制适配矩阵大小的 LoRA 秩（rank）。默认为 `32`。
+  - `seed`：Tinker 使用的随机种子。若未指定，则默认为 `null`。
+  - `train_mlp`：是否训练 MLP 层。默认为 `true`。
+  - `train_attn`：是否训练注意力层。默认为 `true`。
+  - `train_unembed`：是否训练反嵌入（unembedding）层。默认为 `true`。
 
 ```{tip}
 如果使用的是 Explorer 提供的 openai API，则只有 `max_model_len` 会生效，而 `max_response_tokens`、`max_prompt_tokens` 和 `min_response_tokens` 的值将被忽略，在没有独立指定 `max_tokens` 时，每次 API 调用将生成最多 `max_model_len - prompt_length` 个 token，因此在使用时请确保 prompt 长度小于 `max_model_len`。
diff --git a/examples/tinker/README.md b/examples/tinker/README.md
@@ -0,0 +1,46 @@
+# Trinity with Tinker Backend
+
+This example demonstrates how to use Trinity with the [Tinker](https://thinkingmachines.ai/tinker/) backend, which enables model training on devices without GPUs.
+
+## Setup Instructions
+
+### 1. API Key Configuration
+Before starting Ray, you must set the `TRINITY_API_KEY` environment variable to your Tinker API key to enable proper access to Tinker's API:
+
+```bash
+export TRINITY_API_KEY=your_tinker_api_key
+```
+
+### 2. Configuration File
+Configure the Tinker backend in your YAML configuration file by setting the `model.tinker` parameters as shown below:
+
+```yaml
+model:
+  tinker:
+    enable: true
+    base_model: null
+    rank: 32
+    seed: null
+    train_mlp: true
+    train_attn: true
+    train_unembed: true
+```
+
+### 3. Configuration Parameters Explained
+
+- **`tinker`**: Optional Tinker-specific configuration section. **Important**: When Tinker is enabled, any LoRA configuration settings will be ignored.
+  - **`enable`**: Whether to activate the Tinker backend. Default: `false`
+  - **`base_model`**: Path to the base model for Tinker. If not specified (`null`), it defaults to the `model_path` defined elsewhere in your config
+  - **`rank`**: The LoRA rank that controls the size of the adaptation matrices. Default: `32`
+  - **`seed`**: Random seed for reproducible Tinker operations. If not specified (`null`), no specific seed is set
+  - **`train_mlp`**: Whether to train the MLP (feed-forward) layers. Default: `true`
+  - **`train_attn`**: Whether to train the attention layers. Default: `true`
+  - **`train_unembed`**: Whether to train the unembedding (output) layer. Default: `true`
+
+## Usage Notes
+
+Once configured, Trinity works with the Tinker backend just like it does with the standard veRL training backend, with two important limitations:
+1. **Entropy loss** is not consistent compared to veRL backends
+2. Algorithms that require **`compute_advantage_in_trainer=true`** are **not supported**
+
+The complete configuration file can be found at [`tinker.yaml`](tinker.yaml).
diff --git a/examples/tinker/tinker.yaml b/examples/tinker/tinker.yaml
@@ -0,0 +1,68 @@
+mode: both
+project: Trinity-RFT-gsm8k
+name: tinker-Qwen3-4B
+checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints}
+algorithm:
+  algorithm_type: grpo
+  repeat_times: 8
+  sample_strategy: default
+  kl_loss_fn_args:
+    kl_coef: 0.0
+  optimizer:
+    lr: 1.0e-05
+    lr_warmup_steps_ratio: 0.0
+    warmup_style: constant
+data_processor: {}
+model:
+  model_path: Qwen/Qwen3-4B-Instruct-2507
+  max_prompt_tokens: 1024
+  max_response_tokens: 2048
+  tinker:
+    enable: true
+    base_model: Qwen/Qwen3-4B-Instruct-2507
+buffer:
+  batch_size: 96
+  total_epochs: 1
+  explorer_input:
+    taskset:
+      name: taskset
+      storage_type: file
+      path: openai/gsm8k
+      split: train
+      subset_name: main
+      format:
+        prompt_key: question
+        response_key: answer
+      rollout_args:
+        temperature: 1.0
+        logprobs: 0
+    eval_tasksets: []
+    default_workflow_type: math_workflow
+  trainer_input:
+    experience_buffer:
+      name: experience_buffer
+      storage_type: queue
+      replay_buffer:
+        enable: false
+explorer:
+  runner_per_model: 8
+  rollout_model:
+    engine_num: 4
+    seed: 42
+  auxiliary_models: []
+  eval_interval: 1000
+trainer:
+  trainer_type: verl
+  save_interval: 100
+  enable_preview: true
+  grad_clip: 1.0
+  max_token_len_per_gpu: 16384
+monitor:
+  monitor_type: tensorboard
+synchronizer:
+  sync_method: memory
+  sync_style: fixed
+  sync_interval: 2
+  sync_timeout: 1200
+log:
+  level: INFO
diff --git a/trinity/common/config.py b/trinity/common/config.py
@@ -1530,12 +1530,11 @@ def check_and_update(self) -> Config:  # noqa: C901
                         f"Invalid trainer.save_hf_checkpoint: {self.trainer.save_hf_checkpoint}, "
                         "must be one of 'last', 'always', or 'never'."
                     )
+                self.trainer.trainer_config.synchronize_config(self)
             elif self.trainer.trainer_type == "tinker":
                 self.trainer.trainer_config = None
             else:
                 raise ValueError(f"Invalid trainer type: {self.trainer_type}")
-            if self.trainer.trainer_config:
-                self.trainer.trainer_config.synchronize_config(self)
 
         # check service
         if self.service.data_juicer is not None:
diff --git a/trinity/manager/synchronizer.py b/trinity/manager/synchronizer.py
@@ -83,9 +83,12 @@ async def _find_latest_state_dict(self) -> None:
             await self._find_verl_latest_state_dict()
         elif self.config.trainer.trainer_type == "tinker":
             await self._find_tinker_latest_state_dict()
+        else:
+            self.logger.warning(
+                "Synchronizer does not support this trainer type. Please use `verl` or `tinker`."
+            )
 
     async def _find_verl_latest_state_dict(self) -> None:
-        assert self.config.trainer.trainer_type == "verl"
         default_local_dir = self.config.checkpoint_job_dir
         local_latest_state_dict_iteration = os.path.join(
             default_local_dir, "latest_state_dict_iteration.txt"
@@ -119,7 +122,6 @@ async def _find_verl_latest_state_dict(self) -> None:
             await asyncio.sleep(1)
 
     async def _find_tinker_latest_state_dict(self) -> None:
-        assert self.config.trainer.trainer_type == "tinker"
         default_local_dir = self.config.checkpoint_job_dir
         local_latest_state_dict_iteration = os.path.join(
             default_local_dir, "latest_state_dict_iteration.txt"
diff --git a/trinity/trainer/tinker/utils.py b/trinity/trainer/tinker/utils.py
@@ -1,27 +1,12 @@
 from logging import Logger
-from typing import Any, List, Tuple, Union
+from typing import Any, List, Tuple
 
 import torch
 from tinker import types
 
 from trinity.common.experience import Experience, split_dpo_experience_to_single_turn
 
 
-def pad_to_length(
-    tensor: torch.tensor, length: int, pad_value: Union[int, float] = 0
-) -> torch.tensor:
-    pad_value = torch.tensor(pad_value, dtype=tensor.dtype)
-    assert len(tensor) <= length, f"Tensor length {len(tensor)} is longer than length {length}."
-    if len(tensor) == length:
-        return tensor
-    return torch.concat(
-        [
-            torch.full((length - len(tensor),), pad_value),
-            tensor,
-        ]
-    )
-
-
 def to_tinker_input(
     experiences: List[Experience], logger: Logger
 ) -> Tuple[List[types.Datum], List[types.ModelInput], List[dict]]:
diff --git a/trinity/trainer/tinker_trainer.py b/trinity/trainer/tinker_trainer.py
@@ -277,8 +277,8 @@ def save_checkpoint(self, block_until_saved: bool = False, save_as_hf: bool = Fa
             f"global_step_{self.train_step_num}",
         )
         os.makedirs(local_path, exist_ok=True)
-        remote_path_file = os.path.join(local_path, "remote_checkpoint_path.txt")
-        with open(remote_path_file, "w") as f:
+        remote_checkpoint_path = os.path.join(local_path, "remote_checkpoint_path.txt")
+        with open(remote_checkpoint_path, "w") as f:
             f.write(self.latest_remote_checkpoint_path)
 
         with open(self.local_latest_checkpointed_iteration, "w") as f:
@@ -311,8 +311,8 @@ def save_state_dict(self) -> None:
             f"global_step_{self.train_step_num}",
         )
         os.makedirs(local_path, exist_ok=True)
-        remote_path_file = os.path.join(local_path, "remote_sampler_path.txt")
-        with open(remote_path_file, "w") as f:
+        remote_sampler_path = os.path.join(local_path, "remote_sampler_path.txt")
+        with open(remote_sampler_path, "w") as f:
             f.write(self.latest_remote_sampler_path)
 
         with open(self.local_latest_state_dict_iteration, "w") as f:
diff --git a/trinity/trainer/verl/utils.py b/trinity/trainer/verl/utils.py
@@ -102,7 +102,7 @@ def to_data_proto(
     return DataProto.from_single_dict(batch_dict)
 
 
-def compute_data_metrics(batch: DataProto, use_critic: bool = False) -> dict:
+def compute_data_metrics(batch: DataProto) -> dict:
     """
     Computes various metrics from a batch of data for PPO training.
     Modified from verl.trainer.ppo.metric_utils.compute_data_metrics
@@ -113,16 +113,15 @@ def compute_data_metrics(batch: DataProto, use_critic: bool = False) -> dict:
 
     Args:
         batch: A DataProto object containing batch data with token-level scores, rewards, advantages, etc.
-        use_critic: Whether to include critic-specific metrics. Defaults to True.
 
     Returns:
         A dictionary of metrics including:
             - critic/score/mean, max, min: Statistics about sequence scores
             - critic/rewards/mean, max, min: Statistics about sequence rewards
             - critic/advantages/mean, max, min: Statistics about advantages
             - critic/returns/mean, max, min: Statistics about returns
-            - critic/values/mean, max, min: Statistics about critic values (if use_critic=True)
-            - critic/vf_explained_var: Explained variance of the value function (if use_critic=True)
+            - critic/values/mean, max, min: Statistics about critic values
+            - critic/vf_explained_var: Explained variance of the value function
             - response_length/mean, max, min, clip_ratio: Statistics about response lengths
             - prompt_length/mean, max, min, clip_ratio: Statistics about prompt lengths
     """
diff --git a/trinity/trainer/verl_trainer.py b/trinity/trainer/verl_trainer.py
@@ -472,7 +472,7 @@ async def train_step(self, batch_exps: List[Experience]) -> Dict:  # noqa C901
                 metrics.update(actor_output_metrics)
 
         # collect metrics
-        metrics.update(compute_data_metrics(batch=batch, use_critic=self.use_critic))
+        metrics.update(compute_data_metrics(batch=batch))
         timing_metrics = compute_timing_metrics(batch=batch, timing_raw=timing_raw)
         metrics.update({k.replace("timing_s/", "time/"): v for k, v in timing_metrics.items()})
         n_gpus = self.resource_pool_manager.get_n_gpus()