Skip to content

Commit 81db0b2

Browse files
puyuan1996zjowowentAnGjIa520wangshulun
authored
feature/polish(pu): add atari/dmc multitask and balance pipeline in ScaleZero paper and fix MuZero/UniZero longrun performance (#451)
* feature(pu): add unizero/muzero multitask pipeline and net plasticity related metrics * fix(pu): fix some adaptation bug * feature(pu): add unizero multitask balance pipeline for atari and dmc * fix(pu): fix some adaptation bug * feature(pu): add vit encoder for unizero * polish(pu): polish moe layer in transformer * feature(pu): add eval norm mean/medium for atari * fix(pu): fix atari norm mean/median, fix collect in balance pipeline * polish(pu): polish config * fix(pu): fix dmc multitask to be compatiable with timestep (which is used in rope) * polish(pu): polish config * fix(pu): fix task_id bug in balance pipeline, and polish benchmark_name option * fix(pu): fix benchmark_name option * polish(pu): fix norm score computation, adapt config to aliyun * polish(pu): polish unizero_mt balance pipeline use CurriculumController and fix solved gpu batch-size bug * tmp * tmp * tmp * test(pu): add vit moe test * polish(pu): add adapter_scales to tb * feature(pu): add atari uz balance config * polish(pu): add stable_adaptor_scale * tmp * sync code * polish(pu): use freeze_non_lora_parameters in transformer, not use LearnableScale in balance pipeline * feature(pu): add vit-encoder lora in balance pipeline * polish(pu): fix reanalyze index bug, fix global_solved bug, add apply_curriculum_to_encoder option * polish(pu): add collect/eval_num_simulations option * polish(pu): polish comments and style in entry of scalezero * polish(pu): polish comments and style of ctree/tree_search/buffer/common.py * polish(pu): polish comments and style of files in lzero.model * polish(pu): polish comments and style of files in lzero.model.unizero_world_models * polish(pu): polish comments and style of unizero_world_models * polish(pu): polish comments and style of files in policy/ * polish(pu): polish comments and style of files in worker * polish(pu): polish comments and style of files in configs * fix(pu): fix some merge typo * fix(pu): fix ln norm_type, fix kv_cache rewrite bug, add value_priority, fix _reset_collect/eval, add adaptive policy entropy control * fix(pu): fix unizero_mt * polish(pu): add LN in head, polish init_weight, polish adamw weight-decay * fix(pu): fix configure_optimizer_unizero in unizero_mt * feature(pu): add encoder-clip, label smooth, analyze_latent_representation option in unizero.py * feature(pu): add encoder-clip, label smooth option in unizero_multitask.py * fix(pu): fix tb log when gpu_num<task_num, fix total_loss += bug, polish alpha_loss * polish(pu):polish config * fix(pu): fix encoder-clip bug and num_channel/res bug * polish(pu): polish scale_factor in DPS * tmp * feature(pu): add some analysis metrics in tensorboard for unizero and unizero-mt * polish(pu): abstract a KVCacheManager for world model * tmp * polish(pu): polish unizero obs_loss to cos_sim loss * tmp * polish(pu): polish minotor-log and adapt to ale/xxx-v5 style game * feature(pu): add decode_loss for unizero atari * test(pu): test unizero-mt * fix(pu): fix Deep Copy Before Storag bug when Use KVCacheManager * sync code * feature(pu): add iter_policy_evaluation demo in grid-world * polish(pu): polish atari uz config * polish(pu): polish policy logits stability * sync code * polish(pu): polish policy logits stability * fix(pu): fix exp_name and task_id bug in dmc pipeline, fix some configs * feature(pu): add head-clip manager * fix(pu): fix head-clip log * tmp * polish(pu): polish comments and code styles * polish(pu): polish comments and code styles in entry/mcts/model * polish(pu): polish comments and code styles in policy/config * polish(pu): polish comments and code styles in config * polish(pu): polish comments and code styles in atari env * fix(pu): fix comments of worker in ddp mode, fix device bug in evaluator for unizero_multitask pipeline * fix(pu): fix unizero_multitask ddp barrier bug * fix(pu): add policy_logits_clip_method option * fix(pu): add policy_logits_clip_method option * polish(pu): polish comments, docstring, readme * polish(pu): polish atari unizero configs and default configs in unizero.py and unizero.py * polish(pu): update to macos-15 * fix(pu): fix gymnasium[atari] version * fix(pu): fix import bug * polish(pu): polish comments, docstring, some little redundancy * polish(pu): optimize import orders * refactor(pu): move some reusable common var. and safe_eval() method to lzero/entry/utils.py * fix(pu): fix Optional import bug * fix(pu): fix prediction network * fix(pu): add brew install swig in test.yml * fix(pu): fix import bug in test * fix(pu): fix type lint bug * fix(pu): fix import bug in test * fix(pu): fix import bug in test * fix(pu): fix test * fix(pu): fix some args bug * polish(pu): add some comments and little polish * fix(pu): fix 2 tests * fix(pu): fix not_enough_data ddp bug * fix(pu): fix final_norm_option and predict_latent_loss_type default config bug --------- Co-authored-by: puyuan <puyuan1996@qq.com> Co-authored-by: zjowowen <zjowowen@outlook.com> Co-authored-by: jasper <1157507000@qq.com> Co-authored-by: wangshulun <wangshulun@vivi-x.ai>
1 parent 8ec0169 commit 81db0b2

File tree

96 files changed

+21064
-3437
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

96 files changed

+21064
-3437
lines changed

.github/workflows/release.yml

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ jobs:
5353
matrix:
5454
os:
5555
- 'ubuntu-20.04'
56-
- 'macos-13'
56+
- 'macos-15'
5757
python:
5858
- '3.7'
5959
- '3.8'
@@ -73,11 +73,11 @@ jobs:
7373
architecture: x86
7474
- os: ubuntu-20.04
7575
architecture: AMD64
76-
- os: macos-13
76+
- os: macos-15
7777
architecture: aarch64
78-
- os: macos-13
78+
- os: macos-15
7979
architecture: x86
80-
- os: macos-13
80+
- os: macos-15
8181
architecture: AMD64
8282

8383
steps:
@@ -167,25 +167,25 @@ jobs:
167167
name: build-artifacts-wheels-ubuntu-20.04-3.11-aarch64
168168
path: aggregated_wheels_all
169169

170-
- name: Download wheel macos-13, 3.7, x86_64
170+
- name: Download wheel macos-15, 3.7, x86_64
171171
uses: actions/download-artifact@v4
172172
with:
173-
name: build-artifacts-wheels-macos-13-3.7-x86_64
173+
name: build-artifacts-wheels-macos-15-3.7-x86_64
174174
path: aggregated_wheels_all
175-
- name: Download wheel macos-13, 3.8, x86_64
175+
- name: Download wheel macos-15, 3.8, x86_64
176176
uses: actions/download-artifact@v4
177177
with:
178-
name: build-artifacts-wheels-macos-13-3.8-x86_64
178+
name: build-artifacts-wheels-macos-15-3.8-x86_64
179179
path: aggregated_wheels_all
180-
- name: Download wheel macos-13, 3.7, arm64
180+
- name: Download wheel macos-15, 3.7, arm64
181181
uses: actions/download-artifact@v4
182182
with:
183-
name: build-artifacts-wheels-macos-13-3.7-arm64
183+
name: build-artifacts-wheels-macos-15-3.7-arm64
184184
path: aggregated_wheels_all
185-
- name: Download wheel macos-13, 3.8, arm64
185+
- name: Download wheel macos-15, 3.8, arm64
186186
uses: actions/download-artifact@v4
187187
with:
188-
name: build-artifacts-wheels-macos-13-3.8-arm64
188+
name: build-artifacts-wheels-macos-15-3.8-arm64
189189
path: aggregated_wheels_all
190190

191191
- name: Upload unified wheels artifact

.github/workflows/release_test.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ jobs:
5656
matrix:
5757
os:
5858
- 'ubuntu-20.04'
59-
- 'macos-13'
59+
- 'macos-15'
6060
python:
6161
- '3.7.17'
6262
- '3.8.17'
@@ -76,11 +76,11 @@ jobs:
7676
architecture: x86
7777
- os: ubuntu-20.04
7878
architecture: AMD64
79-
- os: macos-13
79+
- os: macos-15
8080
architecture: aarch64
81-
- os: macos-13
81+
- os: macos-15
8282
architecture: x86
83-
- os: macos-13
83+
- os: macos-15
8484
architecture: AMD64
8585
- python: '3.7.17'
8686
architecture: arm64

.github/workflows/test.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ jobs:
2020
matrix:
2121
os:
2222
- 'self-hosted'
23-
- 'macos-13'
23+
- 'macos-15'
2424
python-version:
2525
- '3.8'
2626
- '3.9'
@@ -61,7 +61,7 @@ jobs:
6161
if: ${{ env.OS_NAME == 'MacOS' }}
6262
shell: bash
6363
run: |
64-
brew install tree cloc wget curl make zip graphviz
64+
brew install tree cloc wget curl make zip graphviz swig
6565
brew install llvm # Install llvm (which includes clang)
6666
brew install opencv # Install OpenCV
6767
echo 'export PATH="/usr/local/opt/llvm/bin:$PATH"' >> $GITHUB_ENV # update PATH

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1453,4 +1453,4 @@ events.*
14531453
!/assets/pooltool/**
14541454
lzero/mcts/ctree/ctree_alphazero/pybind11
14551455

1456-
zoo/jericho/envs/z-machine-games-master
1456+
zoo/jericho/envs/z-machine-games-master

lzero/entry/README.md

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# LightZero Entry Functions
2+
3+
English | [中文](./README_zh.md)
4+
5+
This directory contains the training and evaluation entry functions for various algorithms in the LightZero framework. These entry functions serve as the main interfaces for launching different types of reinforcement learning experiments.
6+
7+
## 📁 Directory Structure
8+
9+
### 🎯 Training Entries
10+
11+
#### AlphaZero Family
12+
- **`train_alphazero.py`** - Training entry for AlphaZero algorithm
13+
- Suitable for perfect information board games (e.g., Go, Chess)
14+
- No environment model needed, learns through self-play
15+
- Uses Monte Carlo Tree Search (MCTS) for policy improvement
16+
17+
#### MuZero Family
18+
- **`train_muzero.py`** - Standard training entry for MuZero algorithm
19+
- Supports MuZero, EfficientZero, Sampled EfficientZero, Gumbel MuZero variants
20+
- Learns an implicit model of the environment (dynamics model)
21+
- Suitable for single-task reinforcement learning scenarios
22+
23+
- **`train_muzero_segment.py`** - MuZero training with segment collector and buffer reanalyze
24+
- Uses `MuZeroSegmentCollector` for data collection
25+
- Supports buffer reanalyze trick for improved sample efficiency
26+
- Supported algorithms: MuZero, EfficientZero, Sampled MuZero, Sampled EfficientZero, Gumbel MuZero, StochasticMuZero
27+
28+
- **`train_muzero_with_gym_env.py`** - MuZero training adapted for Gym environments
29+
- Specifically designed for OpenAI Gym-style environments
30+
- Simplifies environment interface adaptation
31+
32+
- **`train_muzero_with_reward_model.py`** - MuZero training with reward model
33+
- Integrates external Reward Model
34+
- Suitable for scenarios requiring learning complex reward functions
35+
36+
- **`train_muzero_multitask_segment_ddp.py`** - MuZero multi-task distributed training
37+
- Supports multi-task learning
38+
- Uses DDP (Distributed Data Parallel) for distributed training
39+
- Uses Segment Collector
40+
41+
#### UniZero Family
42+
- **`train_unizero.py`** - Training entry for UniZero algorithm
43+
- Based on paper "UniZero: Generalized and Efficient Planning with Scalable Latent World Models"
44+
- Enhanced planning capabilities for better long-term dependency capture
45+
- Uses scalable latent world models
46+
- Paper: https://arxiv.org/abs/2406.10667
47+
48+
- **`train_unizero_segment.py`** - UniZero training with segment collector
49+
- Uses `MuZeroSegmentCollector` for efficient data collection
50+
- Supports buffer reanalyze trick
51+
52+
- **`train_unizero_multitask_segment_ddp.py`** - UniZero/ScaleZero multi-task distributed training
53+
- Supports multi-task learning and distributed training
54+
- Includes benchmark score definitions (e.g., Atari human-normalized scores)
55+
- Supports curriculum learning strategies
56+
- Uses DDP for training acceleration
57+
58+
- **`train_unizero_multitask_balance_segment_ddp.py`** - UniZero/ScaleZero balanced multi-task distributed training
59+
- Implements balanced sampling across tasks in multi-task training
60+
- Dynamically adjusts batch sizes for different tasks
61+
- Suitable for scenarios with large task difficulty variations
62+
63+
- **`train_unizero_multitask_segment_eval.py`** - UniZero/ScaleZero multi-task evaluation training
64+
- Specialized for training and periodic evaluation in multi-task scenarios
65+
- Includes detailed evaluation metric statistics
66+
67+
- **`train_unizero_with_loss_landscape.py`** - UniZero training with loss landscape visualization
68+
- For training with loss landscape visualization
69+
- Helps understand model optimization process and generalization performance
70+
- Integrates `loss_landscapes` library
71+
72+
#### ReZero Family
73+
- **`train_rezero.py`** - Training entry for ReZero algorithm
74+
- Supports ReZero-MuZero and ReZero-EfficientZero
75+
- Improves training stability through residual connections
76+
- Paper: https://arxiv.org/pdf/2404.16364
77+
78+
### 🎓 Evaluation Entries
79+
80+
- **`eval_alphazero.py`** - Evaluation entry for AlphaZero
81+
- Loads trained AlphaZero models for evaluation
82+
- Can play against other agents for performance testing
83+
84+
- **`eval_muzero.py`** - Evaluation entry for MuZero family
85+
- Supports evaluation of all MuZero variants
86+
- Provides detailed performance statistics
87+
88+
- **`eval_muzero_with_gym_env.py`** - MuZero evaluation for Gym environments (not recently maintained)
89+
- Specialized for evaluating models trained in Gym environments
90+
91+
92+
## 📖 Usage Guide
93+
94+
### Basic Usage Pattern
95+
96+
All training entry functions follow a similar calling pattern:
97+
98+
```python
99+
from lzero.entry import train_muzero
100+
101+
# Prepare configuration
102+
cfg = dict(...) # User configuration
103+
create_cfg = dict(...) # Creation configuration
104+
105+
# Start training
106+
policy = train_muzero(
107+
input_cfg=(cfg, create_cfg),
108+
seed=0,
109+
model=None, # Optional: pre-initialized model
110+
model_path=None, # Optional: pretrained model path
111+
max_train_iter=int(1e10), # Maximum training iterations
112+
max_env_step=int(1e10), # Maximum environment steps
113+
)
114+
```
115+
116+
### Choosing the Right Entry Function
117+
118+
1. **Single-Task Learning**:
119+
- Board games → `train_alphazero`
120+
- General RL tasks → `train_muzero` or `train_unizero`
121+
- Gym environments → `train_muzero_with_gym_env` (not recently maintained)
122+
123+
2. **Multi-Task Learning**:
124+
- Standard multi-task → `train_unizero_multitask_segment_ddp`
125+
- Balanced task sampling → `train_unizero_multitask_balance_segment_ddp`
126+
127+
3. **Distributed Training**:
128+
- All entry functions with `_ddp` suffix support distributed training
129+
130+
4. **Special Requirements**:
131+
- Loss landscape visualization → `train_unizero_with_loss_landscape`
132+
- External reward model → `train_muzero_with_reward_model`
133+
- Improved training stability → `train_rezero`
134+
135+
## 🔗 Related Resources
136+
137+
- **AlphaZero**: [Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm](https://arxiv.org/abs/1712.01815)
138+
- **MuZero**: [Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model](https://arxiv.org/abs/1911.08265)
139+
- **EfficientZero**: [Mastering Atari Games with Limited Data](https://arxiv.org/abs/2111.00210)
140+
- **UniZero**: [Generalized and Efficient Planning with Scalable Latent World Models](https://arxiv.org/abs/2406.10667)
141+
- **ReZero**: [Boosting MCTS-based Algorithms by Reconstructing the Terminal Reward](https://arxiv.org/abs/2404.16364)
142+
- **ScaleZero**: [One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning](https://arxiv.org/abs/2509.07945)
143+
144+
## 💡 Tips
145+
146+
- Recommended to start with standard `train_muzero` or `train_unizero`
147+
- For large-scale experiments, consider using DDP versions for faster training
148+
- Using `_segment` versions can achieve better sample efficiency (via reanalyze trick)
149+
- Check configuration examples in `zoo/` directory to learn how to set up each algorithm
150+
151+
## 📝 Notes
152+
153+
1. All path parameters should use **absolute paths**
154+
2. Pretrained model paths typically follow format: `exp_name/ckpt/ckpt_best.pth.tar`
155+
3. When using distributed training, ensure `CUDA_VISIBLE_DEVICES` environment variable is set correctly
156+
4. Some entry functions have specific algorithm type requirements - check function documentation

0 commit comments

Comments
 (0)