alibaba · millioniron · Dec 3, 2025 · Dec 4, 2025 · Dec 4, 2025 · Dec 7, 2025
diff --git a/docs_roll/docs/User Guides/Agentic/agentic_engineer_practice.md b/docs_roll/docs/User Guides/Agentic/agentic_engineer_practice.md
@@ -233,15 +233,15 @@ def formulate_rollouts(self, rollout_cache: RolloutCache):
     lm_input.non_tensor_batch["episode_score"] = np.array([episode_score], dtype=object)
 
     # Configure database field types
-    colummns_config = [
+    columns_config = [
         ["task_idx", "bigint"],
         ["model_name", "string"],
         ["stop_reason", "string"],
         ["episode_score", "double"],
         ["mode", "string"],
         ["save_content", "string"],
     ]
-    lm_input.meta_info["COLUMMNS_CONFIG"] = colummns_config
+    lm_input.meta_info["COLUMNS_CONFIG"] = columns_config
 
     return lm_input
 ```

diff --git a/docs_roll/docs/User Guides/Configuration/infer_correction.md b/docs_roll/docs/User Guides/Configuration/infer_correction.md
@@ -0,0 +1,142 @@
+# 训推差异修复
+
+
+## 简介
+
+训推差异是由于RL训练过程中，训练器和生成器之间由于后端不同（vLLM vs SGLang vs FSDP vs Megatron），精度不同（FP8 vs FP16 vs BF16 vs FP32），形成了一种类似off-policy gap，会导致训练不稳定和策略崩溃。
+
+
+## 实现原理
+
+
+修复训推差异大致可分为两种方法（1）对训练器和生成器进行策略修正（2）使用infer_log_probs直接代替old_log_probs(trainer)进行PPO ratio计算。第二种方案比较直接，我们着重说明第一种方法。
+
+### 对训练器和生成器进行策略修正
+
+### IS权重矫正
+通过对训练器（old_log_probs）和生成器（infer_log_prob）之间进行重要性采样矫正，弥合训推差异。与off-policy算法类似，IS权重矫正可以区分token级别和sequence级别，只能选择一个。
+
+### MASK过滤掉不符合条件的样本
+与IS权重修正不同的是，此方法对于超过阈值的样本直接进行mask遮掩，过滤掉不符合的样本。涉及的方法有（1）token级别：过滤掉不符合条件的token（2）灾难性token：过滤掉出现灾难性严重偏差的token的句子样本（3）sequence级别：对sequence进行IS计算，过滤掉不符合的句子样本（4）sequence级别，使用几何平均来计算IS权重，但指标也更为敏感
+
+
+## 关键参数配置
+
+生成器是否返回infer_log_probs
+
+GeneratingArguments：
+
+```yaml
+actor_infer:
+  model_args:
+    disable_gradient_checkpointing: true
+    dtype: fp16
+  generating_args:
+    max_new_tokens: ${response_length}
+    top_p: 0.99
+    top_k: 100
+    num_beams: 1
+    temperature: 0.99
+    num_return_sequences: ${num_return_sequences_in_group}
+    logprobs: 1
+```
+当logprobs:大于0时，返回infer_log_probs
+
+
+-------
+
+参数的配置在PPOConfig和中，关键配置参数如下：
+
+```yaml
+infer_correction: true 
+
+infer_is_mode: token       # 可选 token sequence 
+infer_is_threshold_min: 0.0
+infer_is_threshold_max: 2.0     # 1.5~5.0
+
+enable_token_reject: true
+infer_token_mask_threshold_min: 0.0
+infer_token_mask_threshold_max: 2.0 # 2~10
+
+enable_catastrophic_reject: true
+infer_catastrophic_threshold: 1e-4
+
+enable_seq_reject: sequence  可选None sequence geometric
+infer_seq_mask_threshold_min: 0.1
+infer_seq_mask_threshold_max: 10
+```
+
+
+### infer_correction
+- **含义**：控制是否启用训推差异修复机制。若启用，系统将使用 `infer_log_probs` 对策略梯度进行矫正。
+- **默认值**：`false`
+
+### infer_is_mode
+- **含义**：指定重要性采样（IS）权重的计算粒度。
+- **可选值**：
+  - `"token"`：每个 token 独立计算 IS 权重
+  - `"sequence"`：基于整个 response 序列的 log-ratio 求和后广播至所有 token
+  - `"none"`（或未设置）：不应用 IS 加权，权重恒为 1
+- **默认值**：若未设置，默认为 `"token"`
+- **注意**：不可同时使用多种模式，仅能选择其一。
+
+### infer_is_threshold_min
+- **含义**：IS 权重的下限阈值，用于裁剪过小的权重以控制方差。
+- **默认值**：`0.0`
+- **建议**：通常保留为 `0.0`，以保持无偏性下界
+
+### infer_is_threshold_max
+- **含义**：IS 权重的上限阈值，防止极端大的权重主导梯度。
+- **默认值**：`2.0`
+- **建议**：`"token"`级别推荐为 `1.5 ~ 5.0` `"sequence"`级别推荐为2.0 - 10.0
+
+### enable_token_reject
+- **含义**：是否启用 token 级别的样本拒绝机制。
+- **默认值**：`false`
+- **作用**：结合 `infer_token_mask_threshold_min/max`，mask 掉 IS ratio 超出合法区间的 token。
+
+### infer_token_mask_threshold_min
+- **含义**：token 级 IS ratio（`old_log_probs / infer_log_probs` 的指数）的下限。
+- **默认值**：`0.0`
+- **典型值**：`0.0`通常可设为1/max
+
+### infer_token_mask_threshold_max
+- **含义**：遮掩token 级 IS ratio 的上限。
+- **默认值**：`2.0`
+- **典型范围**：`1.5 ~ 5.0`
+
+### enable_catastrophic_reject
+- **含义**：是否启用“灾难性偏差”检测并拒绝整句样本。
+- **默认值**：`false`
+- **触发条件**：只要序列中存在任一 token 满足 `ratio < infer_catastrophic_threshold`，则整句被 mask。
+
+### infer_catastrophic_threshold
+- **含义**：灾难性拒绝的 ratio 下限阈值。
+- **默认值**：`1e-4`
+- **解释**：当 `infer_log_probs` 远大于 `old_log_probs`（即生成器过于“自信”），导致 `ratio = exp(old - infer)` 极小
+
+### enable_seq_reject
+- **含义**：是否启用序列级别的拒绝机制，以及使用何种聚合方式。
+- **可选值**：
+  - `null` / `false`：禁用
+  - `"sequence"`：使用 log-ratio **求和** 计算序列 IS ratio
+  - `"geometric"`：使用 log-ratio **平均**（等价于几何平均概率）计算序列 IS ratio
+- **默认值**：`null`
+
+### infer_seq_mask_threshold_min
+- **含义**：遮掩序列级 IS ratio 的下限。
+- **默认值**：`0.1`
+- **典型值**：通常可设为1/max，当使用`"geometric"`时，最好强制设为1/max
+
+
+### infer_seq_mask_threshold_max
+- **含义**：遮掩序列级 IS ratio 的上限。
+- **默认值**：`10.0`
+- **典型范围**：当使用`"sequence"`时，推荐`2.0 ~ 10.0`，但随着长度增加可适当放宽。当使用`"geometric"`时，推荐设置为1.0001 - 1.001
+
+
+
+## 使用建议
+
+1. 通常情况下，old_log_prob << infer_log_porb, 阈值的下限就比较重要了。并不建议使用sequence级别的IS或MASK
+
diff --git a/docs_roll/docs/User Guides/Configuration/vllm.md b/docs_roll/docs/User Guides/Configuration/vllm.md
@@ -74,6 +74,20 @@ In the configuration example, we can see:
 
 This design allows different components to choose the most suitable inference engine according to their needs.
 
+### beam_search Configuration
+RLVRPipeline supports vllm beam_search generation method, configured as follows:
+```yaml
+generate_opt_level: 0 # Degrades to batch_generate generation method, generate_opt_level=1 is prompt-level parallel method
+num_return_sequences_in_group: 8 
+actor_infer:
+  generating_args:
+    num_beams: ${num_return_sequences_in_group}
+    num_return_sequences: ${num_return_sequences_in_group}
+```
+Note:
+- generating_args.num_beams and generating_args.num_return_sequences must be set to the same value.
+- The generating_args configuration in validate is also configured in the same way.
+
 ## Performance Optimization Recommendations
 
 1. **Memory Management**:

diff --git a/...us-plugin-content-docs/current/User Guides/Agentic/agentic_engineer_practice.md b/...us-plugin-content-docs/current/User Guides/Agentic/agentic_engineer_practice.md
@@ -234,15 +234,15 @@ def formulate_rollouts(self, rollout_cache: RolloutCache):
     lm_input.non_tensor_batch["episode_score"] = np.array([episode_score], dtype=object)
 
     # 配置数据库字段类型
-    colummns_config = [
+    columns_config = [
         ["task_idx", "bigint"],
         ["model_name", "string"],
         ["stop_reason", "string"],
         ["episode_score", "double"],
         ["mode", "string"],
         ["save_content", "string"],
     ]
-    lm_input.meta_info["COLUMMNS_CONFIG"] = colummns_config
+    lm_input.meta_info["COLUMNS_CONFIG"] = columns_config
 
     return lm_input
 ```

diff --git a/...h-Hans/docusaurus-plugin-content-docs/current/User Guides/Configuration/vllm.md b/...h-Hans/docusaurus-plugin-content-docs/current/User Guides/Configuration/vllm.md
@@ -74,6 +74,21 @@ actor_infer:
 
 这种设计允许不同组件根据其需求选择最适合的推理引擎。
 
+### beam_search 配置方式
+RLVRPipeline 支持vllm beam_search 的生成方式，配置方式如下：
+```yaml
+generate_opt_level: 0 # 退化为batch_generate生成方式，generate_opt_level=1是prompt粒度并行方式
+num_return_sequences_in_group: 8 
+actor_infer:
+  generating_args:
+    num_beams: ${num_return_sequences_in_group}
+    num_return_sequences: ${num_return_sequences_in_group}
+```
+注意：
+- generating_args.num_beams 和 generating_args.num_return_sequences 必须设置为相同的值。
+- validate中配置generating_args也是相同的方式。
+
+
 ## 性能优化建议
 
 1. **内存管理**：

diff --git a/examples/qwen2.5-0.5B-agentic/agent_val_frozen_lake_amd.yaml b/examples/qwen2.5-0.5B-agentic/agent_val_frozen_lake_amd.yaml
@@ -107,7 +107,7 @@ actor_infer:
   strategy_args:
     strategy_name: vllm
     strategy_config:
-      gpu_memory_utilization: 0.4
+      gpu_memory_utilization: 0.6
       block_size: 16
       load_format: auto
   device_mapping: list(range(0,8))
@@ -131,7 +131,6 @@ reward_normalization:
   method: mean_std # asym_clip / identity / mean_std
 
 train_env_manager:
-  format_penalty: -0.15 # sokoban env penalty_for_step=-0.1
   max_env_num_per_worker: 16
   num_env_groups: 128
   # under the same group, the env config and env seed are ensured to be equal
@@ -163,8 +162,8 @@ custom_envs:
     ${custom_env.FrozenLakeThink}
   FrozenLakeLocallyDefineExamples:  # Can import from unified envs config or define dict locally
     env_type: frozen_lake
+    max_steps: ${max_actions_per_traj}
     max_tokens_per_step: ${max_tokens_per_step}
-    user_prompt_format: ${user_prompt_think_format}
     env_manager_cls: ${env_manager_cls}
     use_thread_lock: true
     env_config: