Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
22b2785
new file: docs_roll/docs/User Guides/Configuration/infer_correctio…
millioniron Dec 3, 2025
3d291b1
fixed_llm_proxy_mode_rollout_pipeline
WeepCat Dec 4, 2025
6ca3d10
fixed some typos
WeepCat Dec 4, 2025
77c1e5d
重新修订了整个的排版,抽象出了一个类,使得可以更加自由定义
millioniron Dec 7, 2025
a0ea354
(fix): update math rule reward worker.
Oct 28, 2025
055ef9b
(feat): set RAY_CGRAPH_get_timeout=600.
PanAndy Oct 29, 2025
29a7610
(fix): vllm 0.11.0 import
emiedon Oct 29, 2025
bcc5818
(fix): fix train infer ratio/diff mean & add train infer ratio/diff t…
ToasterSC Nov 5, 2025
a252c2e
(feat): support vllm beam_search.
PanAndy Nov 5, 2025
d8e5c94
(fix): ensure compatibility with transformers version check for causa…
chocoded Nov 5, 2025
77325f7
(feat): support pytorch280 docker.
PanAndy Dec 5, 2025
accefed
(fix): fix agentic val get_batch state in redundancy env.
PanAndy Nov 7, 2025
8629e85
(feat): Add support for Qwen-3-next on AMD GPUs.
Nov 18, 2025
41fe274
fix: fix tokenizer usage in llm judge reward worker.
guoshengCS Nov 28, 2025
38bfc2e
(feat): add vlm option.
PanAndy Dec 5, 2025
8e4bf7c
(feat): agentic-spec actor worker.
Oct 30, 2025
79af5c3
(feat): agentic_filter_task.
PanAndy Dec 2, 2025
7c261c8
(refactor): agentic pipeline modify.
Oct 31, 2025
28c3edd
(fix): update error logging for image loading failure.
chocoded Oct 31, 2025
7038040
(fix): fix max_len_mask key.
Oct 31, 2025
aa6ad59
(feat): add infer_log_probs in agentic.
PanAndy Dec 2, 2025
48c2253
(feat): update mcore_adapter.
PanAndy Dec 5, 2025
1747871
(fix): fix bugs in data fetching for face embeddings.
Nov 5, 2025
d353266
(feat): add agentic chunk.
PanAndy Dec 2, 2025
2a73b0a
(feat): add sglang 0.4.6.post5.
PanAndy Dec 5, 2025
e9ba131
(feat): support offload nccl to save gpu memory.
xuehuanran Nov 7, 2025
796603c
(feat): support pytorch280 docker.
PanAndy Dec 5, 2025
6f5b8f7
(fix): fix vllm 0110 model_config.
PanAndy Nov 10, 2025
29301e7
(refactor): refactor agentic norm.
Nov 11, 2025
5d92cc0
(feat): add agentic profile metrics.
PanAndy Dec 2, 2025
307924e
(feat): sglang 054 patch.
emiedon Nov 11, 2025
bc0cd9d
(feat): add enable_reference option.
PanAndy Nov 11, 2025
31feaf6
(fix): fix agentic reference.
PanAndy Nov 12, 2025
7c29858
(feat): add flash-linear-attention.
PanAndy Dec 5, 2025
c001d6c
(fix): vllm _generate_standard missing prompt_token_ids input args in…
HuangJoJo Nov 13, 2025
85a081c
(fix): sglang 054post2 tp worker init wrong.
emiedon Nov 13, 2025
346a406
(fix): vllm add missing argument is_lora in function update_parameter.
hydrozhao Nov 14, 2025
a98f4ce
(feat): update mcore_adapter.
PanAndy Dec 5, 2025
d5f07c9
(fix): fix get_cached_module_file.
PanAndy Dec 5, 2025
5266468
(fix): fix bugs with metrics recording in the DPO pipeline.
Schnabel-8 Nov 17, 2025
0e47311
(feat): add enable_old_logprobs, opt old log probs by cache.
PanAndy Nov 17, 2025
3de37e1
(fix): update image loading logic for byte data in rlvr_vlm_pipeline.py
chocoded Nov 18, 2025
5caa55c
(feat): mcore_adapter support qwen3vl.
liu-zichen Nov 18, 2025
b5cd1ea
(fix): add force_vit flags for image and video processing in Qwen3 VL…
chocoded Nov 18, 2025
e30bb72
(feat): add qwen3-vl example.
PanAndy Dec 5, 2025
2f9f2df
(feat): mock infer.
Nov 21, 2025
3e4633e
(feat): add qwen3-vl 32B example.
PanAndy Dec 5, 2025
3657919
(feat): add sequence packing for sft pipeline and distill pipeline, o…
Schnabel-8 Nov 24, 2025
4a68470
(feat): add alive check.
PanAndy Nov 24, 2025
21460df
(feat): sglang support dp-attention.
emiedon Nov 25, 2025
1c45b7a
(fix): set broadcast_non_tensor_batch for old_logprobs.
PanAndy Dec 3, 2025
24374f1
(fix): fix vllm get_metrics exception.
PanAndy Dec 4, 2025
61a544a
(fix): fix vllm 0110.
PanAndy Dec 4, 2025
e1695f2
(fix): fix AgenticAcotrWorker import.
PanAndy Dec 4, 2025
a595ec3
new file: docs_roll/docs/User Guides/Configuration/infer_correctio…
millioniron Dec 3, 2025
e98c4ce
重新修订了整个的排版,抽象出了一个类,使得可以更加自由定义
millioniron Dec 7, 2025
7613dbd
Merge branch 'main' of https://github.com/millioniron/ROLL
millioniron Dec 8, 2025
c3d1121
modified: roll/pipeline/agentic/env_manager/step_env_manager.py
millioniron Dec 8, 2025
515bf39
去掉原本官方的train-infer实现
millioniron Dec 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -233,15 +233,15 @@ def formulate_rollouts(self, rollout_cache: RolloutCache):
lm_input.non_tensor_batch["episode_score"] = np.array([episode_score], dtype=object)

# Configure database field types
colummns_config = [
columns_config = [
["task_idx", "bigint"],
["model_name", "string"],
["stop_reason", "string"],
["episode_score", "double"],
["mode", "string"],
["save_content", "string"],
]
lm_input.meta_info["COLUMMNS_CONFIG"] = colummns_config
lm_input.meta_info["COLUMNS_CONFIG"] = columns_config

return lm_input
```
Expand Down
142 changes: 142 additions & 0 deletions docs_roll/docs/User Guides/Configuration/infer_correction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# 训推差异修复


## 简介

训推差异是由于RL训练过程中,训练器和生成器之间由于后端不同(vLLM vs SGLang vs FSDP vs Megatron),精度不同(FP8 vs FP16 vs BF16 vs FP32),形成了一种类似off-policy gap,会导致训练不稳定和策略崩溃。


## 实现原理


修复训推差异大致可分为两种方法(1)对训练器和生成器进行策略修正(2)使用infer_log_probs直接代替old_log_probs(trainer)进行PPO ratio计算。第二种方案比较直接,我们着重说明第一种方法。

### 对训练器和生成器进行策略修正

### IS权重矫正
通过对训练器(old_log_probs)和生成器(infer_log_prob)之间进行重要性采样矫正,弥合训推差异。与off-policy算法类似,IS权重矫正可以区分token级别和sequence级别,只能选择一个。

### MASK过滤掉不符合条件的样本
与IS权重修正不同的是,此方法对于超过阈值的样本直接进行mask遮掩,过滤掉不符合的样本。涉及的方法有(1)token级别:过滤掉不符合条件的token(2)灾难性token:过滤掉出现灾难性严重偏差的token的句子样本(3)sequence级别:对sequence进行IS计算,过滤掉不符合的句子样本(4)sequence级别,使用几何平均来计算IS权重,但指标也更为敏感


## 关键参数配置

生成器是否返回infer_log_probs

GeneratingArguments:

```yaml
actor_infer:
model_args:
disable_gradient_checkpointing: true
dtype: fp16
generating_args:
max_new_tokens: ${response_length}
top_p: 0.99
top_k: 100
num_beams: 1
temperature: 0.99
num_return_sequences: ${num_return_sequences_in_group}
logprobs: 1
```
当logprobs:大于0时,返回infer_log_probs


-------

参数的配置在PPOConfig和中,关键配置参数如下:

```yaml
infer_correction: true

infer_is_mode: token # 可选 token sequence
infer_is_threshold_min: 0.0
infer_is_threshold_max: 2.0 # 1.5~5.0

enable_token_reject: true
infer_token_mask_threshold_min: 0.0
infer_token_mask_threshold_max: 2.0 # 2~10

enable_catastrophic_reject: true
infer_catastrophic_threshold: 1e-4

enable_seq_reject: sequence 可选None sequence geometric
infer_seq_mask_threshold_min: 0.1
infer_seq_mask_threshold_max: 10
```


### infer_correction
- **含义**:控制是否启用训推差异修复机制。若启用,系统将使用 `infer_log_probs` 对策略梯度进行矫正。
- **默认值**:`false`

### infer_is_mode
- **含义**:指定重要性采样(IS)权重的计算粒度。
- **可选值**:
- `"token"`:每个 token 独立计算 IS 权重
- `"sequence"`:基于整个 response 序列的 log-ratio 求和后广播至所有 token
- `"none"`(或未设置):不应用 IS 加权,权重恒为 1
- **默认值**:若未设置,默认为 `"token"`
- **注意**:不可同时使用多种模式,仅能选择其一。

### infer_is_threshold_min
- **含义**:IS 权重的下限阈值,用于裁剪过小的权重以控制方差。
- **默认值**:`0.0`
- **建议**:通常保留为 `0.0`,以保持无偏性下界

### infer_is_threshold_max
- **含义**:IS 权重的上限阈值,防止极端大的权重主导梯度。
- **默认值**:`2.0`
- **建议**:`"token"`级别推荐为 `1.5 ~ 5.0` `"sequence"`级别推荐为2.0 - 10.0

### enable_token_reject
- **含义**:是否启用 token 级别的样本拒绝机制。
- **默认值**:`false`
- **作用**:结合 `infer_token_mask_threshold_min/max`,mask 掉 IS ratio 超出合法区间的 token。

### infer_token_mask_threshold_min
- **含义**:token 级 IS ratio(`old_log_probs / infer_log_probs` 的指数)的下限。
- **默认值**:`0.0`
- **典型值**:`0.0`通常可设为1/max

### infer_token_mask_threshold_max
- **含义**:遮掩token 级 IS ratio 的上限。
- **默认值**:`2.0`
- **典型范围**:`1.5 ~ 5.0`

### enable_catastrophic_reject
- **含义**:是否启用“灾难性偏差”检测并拒绝整句样本。
- **默认值**:`false`
- **触发条件**:只要序列中存在任一 token 满足 `ratio < infer_catastrophic_threshold`,则整句被 mask。

### infer_catastrophic_threshold
- **含义**:灾难性拒绝的 ratio 下限阈值。
- **默认值**:`1e-4`
- **解释**:当 `infer_log_probs` 远大于 `old_log_probs`(即生成器过于“自信”),导致 `ratio = exp(old - infer)` 极小

### enable_seq_reject
- **含义**:是否启用序列级别的拒绝机制,以及使用何种聚合方式。
- **可选值**:
- `null` / `false`:禁用
- `"sequence"`:使用 log-ratio **求和** 计算序列 IS ratio
- `"geometric"`:使用 log-ratio **平均**(等价于几何平均概率)计算序列 IS ratio
- **默认值**:`null`

### infer_seq_mask_threshold_min
- **含义**:遮掩序列级 IS ratio 的下限。
- **默认值**:`0.1`
- **典型值**:通常可设为1/max,当使用`"geometric"`时,最好强制设为1/max


### infer_seq_mask_threshold_max
- **含义**:遮掩序列级 IS ratio 的上限。
- **默认值**:`10.0`
- **典型范围**:当使用`"sequence"`时,推荐`2.0 ~ 10.0`,但随着长度增加可适当放宽。当使用`"geometric"`时,推荐设置为1.0001 - 1.001



## 使用建议

1. 通常情况下,old_log_prob << infer_log_porb, 阈值的下限就比较重要了。并不建议使用sequence级别的IS或MASK

14 changes: 14 additions & 0 deletions docs_roll/docs/User Guides/Configuration/vllm.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,20 @@ In the configuration example, we can see:

This design allows different components to choose the most suitable inference engine according to their needs.

### beam_search Configuration
RLVRPipeline supports vllm beam_search generation method, configured as follows:
```yaml
generate_opt_level: 0 # Degrades to batch_generate generation method, generate_opt_level=1 is prompt-level parallel method
num_return_sequences_in_group: 8
actor_infer:
generating_args:
num_beams: ${num_return_sequences_in_group}
num_return_sequences: ${num_return_sequences_in_group}
```
Note:
- generating_args.num_beams and generating_args.num_return_sequences must be set to the same value.
- The generating_args configuration in validate is also configured in the same way.

## Performance Optimization Recommendations

1. **Memory Management**:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -234,15 +234,15 @@ def formulate_rollouts(self, rollout_cache: RolloutCache):
lm_input.non_tensor_batch["episode_score"] = np.array([episode_score], dtype=object)

# 配置数据库字段类型
colummns_config = [
columns_config = [
["task_idx", "bigint"],
["model_name", "string"],
["stop_reason", "string"],
["episode_score", "double"],
["mode", "string"],
["save_content", "string"],
]
lm_input.meta_info["COLUMMNS_CONFIG"] = colummns_config
lm_input.meta_info["COLUMNS_CONFIG"] = columns_config

return lm_input
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,21 @@ actor_infer:

这种设计允许不同组件根据其需求选择最适合的推理引擎。

### beam_search 配置方式
RLVRPipeline 支持vllm beam_search 的生成方式,配置方式如下:
```yaml
generate_opt_level: 0 # 退化为batch_generate生成方式,generate_opt_level=1是prompt粒度并行方式
num_return_sequences_in_group: 8
actor_infer:
generating_args:
num_beams: ${num_return_sequences_in_group}
num_return_sequences: ${num_return_sequences_in_group}
```
注意:
- generating_args.num_beams 和 generating_args.num_return_sequences 必须设置为相同的值。
- validate中配置generating_args也是相同的方式。


## 性能优化建议

1. **内存管理**:
Expand Down
5 changes: 2 additions & 3 deletions examples/qwen2.5-0.5B-agentic/agent_val_frozen_lake_amd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ actor_infer:
strategy_args:
strategy_name: vllm
strategy_config:
gpu_memory_utilization: 0.4
gpu_memory_utilization: 0.6
block_size: 16
load_format: auto
device_mapping: list(range(0,8))
Expand All @@ -131,7 +131,6 @@ reward_normalization:
method: mean_std # asym_clip / identity / mean_std

train_env_manager:
format_penalty: -0.15 # sokoban env penalty_for_step=-0.1
max_env_num_per_worker: 16
num_env_groups: 128
# under the same group, the env config and env seed are ensured to be equal
Expand Down Expand Up @@ -163,8 +162,8 @@ custom_envs:
${custom_env.FrozenLakeThink}
FrozenLakeLocallyDefineExamples: # Can import from unified envs config or define dict locally
env_type: frozen_lake
max_steps: ${max_actions_per_traj}
max_tokens_per_step: ${max_tokens_per_step}
user_prompt_format: ${user_prompt_think_format}
env_manager_cls: ${env_manager_cls}
use_thread_lock: true
env_config:
Expand Down
Loading