-
Notifications
You must be signed in to change notification settings - Fork 180
feature(pu): add atari/dmc multitask and balance pipeline in ScaleZero paper #417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
puyuan1996
wants to merge
51
commits into
main
Choose a base branch
from
dev-multitask-balance-clean
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…er and fix solved gpu batch-size bug
…dilab/LightZero into dev-multitask-balance-clean
…arnableScale in balance pipeline
…_curriculum_to_encoder option
…ty, fix _reset_collect/eval, add adaptive policy entropy control
…ation option in unizero.py
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
config
New or improved configuration
enhancement
New feature or request
research
Research work in progress
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request implements the core components of the ScaleZero paper by introducing a multi-task, balanced training pipeline for Atari and DeepMind Control (DMC) environments.
To enhance stability and performance in this new multi-task setting, several key improvements and bug fixes were made. We replaced BatchNorm with the more robust LayerNorm, corrected a critical bug that caused the kv_cache to be improperly overwritten, and fixed the state reset logic in _reset_eval() and _reset_collect() to ensure accurate evaluation.
Additionally, the PR introduces target-entropy control for better policy optimization, makes the number of MCTS simulations configurable for evaluation, and integrates relevant updates from the longrun PR to maintain code consistency.
本次 PR 核心是实现了 ScaleZero 论文的关键部分,为 Atari 和 DeepMind Control (DMC) 环境引入了一套多任务(multi-task)且均衡(balanced)的训练流水线。
为确保在多任务场景下的稳定性和高性能,我们进行了一系列关键优化与修复:将不稳定的 BatchNorm 替换为更鲁棒的 LayerNorm;修复了导致状态错误的 kv_cache 重写 Bug;并修正了 _reset_eval() 和 _reset_collect() 中的状态重置逻辑,以保证评估的准确性。
此外,本次更新还引入了 target-entropy 控制机制以优化策略,并使评估阶段的 MCTS 模拟次数变为可配置项。同时,我们整合了 longrun PR 的相关变更,以保持代码库的统一和同步。