feature(xjy): Refine PriorZero Implementation #441

xiongjyu · 2025-11-20T15:48:21Z

这个 PR 主要完善了 PriorZero的实现与开发流程，修复了若干影响训练正确性和稳定性的关键问题，并对训练逻辑、损失计算、数据采集进行了系统性的增强。

本 PR 已完成的工作
• 修复了 PriorZero 训练流程中的多个关键 bug，包括 game segment 构建、loss 计算、log-prob 对齐以及 action 处理中的错误。
• 完善了 REINFORCE / RFT 风格的策略优化实现，在 buffer 中正确存储并使用 old_logprob，保证策略更新的正确性。
• 补充并规范了训练过程中的统计指标，包括 KL divergence、policy entropy 等，用于更好地监控训练状态。
• 优化了 Collector 与 Replay Buffer 的数据流转逻辑，提升数据一致性与采样稳定性，减少隐式错误。
• 引入并验证了单卡场景下的 vLLM 权重同步机制。
• 多 GPU / 多节点场景下的 vLLM 权重同步与稳定性验证

…_llm_prior, and SFT loss

zoo/jericho/priorzero/priorzero_policy.py

…lect to cprofile.

…ed the REINFORCE-series loss computation.

…me. Single-GPU works; multi-GPU not tested yet.

puyuan1996 · 2025-12-15T14:38:21Z

zoo/jericho/priorzero/vllm_utils/vllm_engine_ray.py

+    for i in range(num_engines):
+        bundle_indices = None
+        if tensor_parallel_size > 1:
+            bundle_indices = get_bundle_indices(shared_pg, i, tensor_parallel_size)


这里是参考的ray官方改进吗

这个vllm_engine基本和openrlhf这部分是一样的；不过目前只使用一个vllm,并且tensor_parallel_size =1；因为显存够

zoo/jericho/priorzero/vllm_utils/vllm_engine_ray.py

…ple for world-model training; train LLM only on latest trajectories

puyuan1996 · 2025-12-16T03:24:29Z

zoo/jericho/priorzero/priorzero_policy.py

+            for action in actions:
+                prior.append(llm_prior_logprob[idx][action])
+            policy_priors.append(prior)
+        policy_priors = self.pad_to_fixed_length(data=policy_priors, target_len=self.cfg.model.action_space_size, pad_val=-1e9)


注意检查这里valid_actions_list的顺序与action_mask的对应关系

这里检查了没问题

puyuan1996 · 2025-12-16T03:27:59Z

zoo/jericho/priorzero/priorzero_policy.py

-                        # ============ LLM Loss Metrics ============
+            # ============ LLM Loss Metrics ============
            'llm_sft_loss',              # Supervised fine-tuning loss
            'llm_rft_loss',              # Reinforcement fine-tuning loss


目前_forward_learn没有计算这些统计量吧？

这部分我修改了，现在在llm里面统计了，这个地方确实没有

zoo/jericho/priorzero/priorzero_collector.py

puyuan1996 · 2025-12-16T04:21:03Z

zoo/jericho/priorzero/priorzero_llm_modules.py

+            return samples
+        T = len(raw_obs_list[0])
+
+        for b in range(B):


TODO：更高效的构建方法

puyuan1996 · 2025-12-16T04:21:33Z

zoo/jericho/priorzero/priorzero_entry_sync.py

+        if num_of_transitions >= replay_buffer.replay_buffer_size:
+            all_data = replay_buffer.sample(batch_size=replay_buffer.replay_buffer_size, policy=policy)
+            replay_buffer._clear()
+            trainer.train_rft_from_priorzero_batch(all_data)


TODO：可以控制训练的off_policy程度

TODO：llm训练速度是否过慢呢？

world_model也是从replay_buffer中采样训练的吧？只清空llm训练的，不能清空world-model训练的，world_model训练是需要比较大的buffer size的

puyuan1996 · 2025-12-16T04:24:53Z

zoo/jericho/priorzero/priorzero_entry_async.py

+            if coordinator.can_collect():
+                logger.info(f"\n[Iter {learner.train_iter}] Starting async collect...")
+
+                async def collect_fn():


collect和train异步这个目前测试通过了吗

我一直都没用这个异步的；因为train不是应该等collect结束以后再sample吗？这里应该不能异步吧

puyuan1996 · 2025-12-16T04:28:10Z

zoo/jericho/priorzero/priorzero_llm_modules.py

+        self.log_state_to_tb()
+
+    def _broadcast_to_vllm(self):
+        use_prefix_cache = getattr(self.strategy.args, "enable_prefix_caching", False)


权重更新目前是串行的，可以改成@ray.remote的异步更新，另外应该不需要每次llm train后都更新，可以设置一个频率

但现在每次llm train完，一次要训 256 * 10条数据，应该已经很多了呀；按照batch_size=64来说的话，都更新40次模型参数了；还不需要同步吗

zoo/jericho/priorzero/priorzero_llm_modules.py

…ish sys-template and max-gen-length, use k3 kl and batch_size=128

…alueNormalizer

xiongjyu added 5 commits November 20, 2025 12:48

Fix game_segment/weighted_total_loss bugs and refine prompts, compute…

a3a2d69

…_llm_prior, and SFT loss

Fixed the accumulate_steps bug and added cprofile functionality.

959a558

Refine the code and fix the bug in data collection.

ecedc5f

Add REINFORCE-style losses and store old_logprob in the buffer.

2d53d22

Fix the get_llm_prior bug so that every action receives a logprob

c608600

xiongjyu commented Nov 24, 2025

View reviewed changes

zoo/jericho/priorzero/priorzero_policy.py Outdated Show resolved Hide resolved

puyuan1996 reviewed Nov 24, 2025

View reviewed changes

zoo/jericho/priorzero/priorzero_policy.py Outdated Show resolved Hide resolved

fixed the history bug in the build_llm_prompt and logs in forward_learn

15e39f6

xiongjyu deleted the branch opendilab:dev-multitask-balance-clean-rft November 24, 2025 14:28

xiongjyu closed this Nov 24, 2025

xiongjyu deleted the dev-multitask-balance-clean-rft branch November 24, 2025 14:28

xiongjyu reopened this Nov 24, 2025

xiongjyu added 4 commits November 24, 2025 22:35

rename advantage_tensor on rft

7c9acd9

Fixed the action out-of-bounds bug and added a record for forward_col…

738f300

…lect to cprofile.

Fixed the misalignment between old_log_prob and log_prob, and correct…

0a166f6

…ed the REINFORCE-series loss computation.

add some logs for analysying

4f3668e

puyuan1996 added the research Research work in progress label Nov 28, 2025

xiongjyu added 10 commits November 30, 2025 01:47

Polish the code and standardize the format.

2985e60

Add kL divergence in rft and llm_prior_entropy in collect

ff98006

polish config and format

7e43e45

delete unused files

d6555e5

Decouple the training of world_model and LLM.

b7d42ee

add cache in the jericho

95e2347

Separate sync and async entry points to simplify the program.

9682486

Reference OpenRLHF’s implementation to update vLLM weights in real ti…

0a38197

…me. Single-GPU works; multi-GPU not tested yet.

delete unused orz files

e361039

fix a small bug

f957db9

puyuan1996 reviewed Dec 15, 2025

View reviewed changes

zoo/jericho/priorzero/vllm_utils/vllm_engine_ray.py Show resolved Hide resolved

Fix action='go' bug; optimize replay buffer with larger capacity; sam…

628d7d2

…ple for world-model training; train LLM only on latest trajectories

xiongjyu added 2 commits December 16, 2025 21:56

fix a bug

c16174f

Optimized log-probability computation for the CoT setting.

35cb4f9

puyuan1996 reviewed Dec 18, 2025

View reviewed changes

polish and format file

2c67a8d

xiongjyu changed the title ~~feature(xjy): Fixed the accumulate_steps, game_segment/weighted_total_loss bugs and refine prompts, compute_llm_prior, and SFT loss, and added cprofile functionality.~~ feature(xjy): priorzero Dec 19, 2025

xiongjyu changed the title ~~feature(xjy): priorzero~~ feature(xjy): Refine PriorZero Implementation Dec 19, 2025

xiongjyu and others added 18 commits December 26, 2025 14:33

Improve single/multi-process LLM training with DeepSpeed

b16c3e7

fix the vllm bug when using torchrun

97c9843

fix the fork bug and polish the vllm about sleep

eb8a4bd

Optimized efficiency and added multiple ways to calculate advantage.

9178397

limit the ouput length when using vllm

cb6f7cf

polish(pu): add cot-reuse in training, use running-norm in value, pol…

19fac8f

…ish sys-template and max-gen-length, use k3 kl and batch_size=128

fix(pu): fix some bugs in reuse-collect-cot in training phase

88f047b

polish configs and format

de4b2c0

delete unuse config

2069e32

fix not found go bug

3ff091e

fix the misalignment bug when reusing cot

3888d8e

make the prompt more compact

da0d0fd

add lr warmup

5f88151

add warmup for training world model before training llm and AdaptiveV…

ed89062

…alueNormalizer

polish the implementation of profile

96dc250

fix a small bug

66ac376

add profile of forward_collect

a9593cd

add format reward option and fix the cot gradient

5d0f359

feature(xjy): Refine PriorZero Implementation #441

Are you sure you want to change the base?

feature(xjy): Refine PriorZero Implementation #441

Uh oh!

Conversation

xiongjyu commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xiongjyu commented Nov 20, 2025 •

edited

Loading