Skip to content

Conversation

@xiongjyu
Copy link
Collaborator

@xiongjyu xiongjyu commented Nov 7, 2025

No description provided.

…d AdamW; add value_priority, adaptive policy entropy control, encoder-clip, label smoothing, latent representation analysis option, and cosine similarity loss.
train_data_augmented.append(learner.train_iter)

log_vars = learner.train(train_data_augmented, collector.envstep)
reward_model.train_with_policy_batch(train_data)
Copy link
Collaborator

@puyuan1996 puyuan1996 Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该reward_model先训一些iters 然后unizero用训好的rnd网络估计融合奖励 再去训unizero的网络,目前这个版本相当于融合奖励每个迭代都在变化,对于unizero这边的学习来说太不平稳了?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对,目前加上了之前讨论的那个参数自适应,初始阶段为0,一段时间后慢慢升上来,这样的话初始阶段相当于只是训练了RND网络,但是没用到内在奖励

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前新跑的都是用了这个方法吗

@puyuan1996 puyuan1996 added the research Research work in progress label Nov 10, 2025
@xiongjyu xiongjyu changed the title Dev rnd feature(xjy): add the rnd-related features Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

research Research work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants