You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A commented and [documented](https://github.com/werner-duvaud/muzero-general/wiki/MuZero-Documentation) implementation of MuZero based on the Google DeepMind [paper](https://arxiv.org/abs/1911.08265) (Nov 2019) and the associated [pseudocode](https://arxiv.org/src/1911.08265v2/anc/pseudocode.py).
12
+
A commented and [documented](https://github.com/werner-duvaud/muzero-general/wiki/MuZero-Documentation) implementation of MuZero based on the Google DeepMind [paper](https://arxiv.org/abs/1911.08265) (Schrittwieser et al., Nov 2019) and the associated [pseudocode](https://arxiv.org/src/1911.08265v2/anc/pseudocode.py).
11
13
It is designed to be easily adaptable for every games or reinforcement learning environments (like [gym](https://github.com/openai/gym)). You only need to add a [game file](https://github.com/werner-duvaud/muzero-general/tree/master/games) with the hyperparameters and the game class. Please refer to the [documentation](https://github.com/werner-duvaud/muzero-general/wiki/MuZero-Documentation) and the [example](https://github.com/werner-duvaud/muzero-general/blob/master/games/cartpole.py).
14
+
This implementation is primarily for educational purpose.\
15
+
[Explanatory video of MuZero](https://youtu.be/We20YSAJZSE)
12
16
13
17
MuZero is a state of the art RL algorithm for board games (Chess, Go, ...) and Atari games.
14
18
It is the successor to [AlphaZero](https://arxiv.org/abs/1712.01815) but without any knowledge of the environment underlying dynamics. MuZero learns a model of the environment and uses an internal representation that contains only the useful information for predicting the reward, value, policy and transitions. MuZero is also close to [Value prediction networks](https://arxiv.org/abs/1707.03497). See [How it works](https://github.com/werner-duvaud/muzero-general/wiki/How-MuZero-works).
@@ -28,14 +32,13 @@ It is the successor to [AlphaZero](https://arxiv.org/abs/1712.01815) but without
28
32
*[ ] Windows support (Experimental / Workaround: Use the [notebook](https://github.com/werner-duvaud/muzero-general/blob/master/notebook.ipynb) in [Google Colab](https://colab.research.google.com))
29
33
30
34
### Further improvements
31
-
These improvements are active research, they are personal ideas and go beyond MuZero paper. We are open to contributions and other ideas.
35
+
Here is a list of features which could be interesting to add but which are not in MuZero's paper. We are open to contributions and other ideas.
*[x][Tool to understand the learned model](https://github.com/werner-duvaud/muzero-general/blob/master/diagnose_model.py)
36
-
*[ ]Support of stochastic environments
40
+
*[ ]Batch MCTS
37
41
*[ ] Support of more than two player games
38
-
*[ ] RL tricks (Never Give Up, Adaptive Exploration, ...)
39
42
40
43
## Demo
41
44
@@ -96,6 +99,11 @@ tensorboard --logdir ./results
96
99
97
100
You can adapt the configurations of each game by editing the `MuZeroConfig` class of the respective file in the [games folder](https://github.com/werner-duvaud/muzero-general/tree/master/games).
98
101
102
+
## Related work
103
+
104
+
*[EfficientZero](https://arxiv.org/abs/2111.00210) (Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, Yang Gao)
105
+
*[Sampled MuZero](https://arxiv.org/abs/2104.06303) (Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Mohammadamin Barekatain, Simon Schmitt, David Silver)
Copy file name to clipboardExpand all lines: games/atari.py
+8-8Lines changed: 8 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
importdatetime
2
-
importos
2
+
importpathlib
3
3
4
4
importgym
5
5
importnumpy
@@ -15,6 +15,7 @@
15
15
16
16
classMuZeroConfig:
17
17
def__init__(self):
18
+
# fmt: off
18
19
# More information is available here: https://github.com/werner-duvaud/muzero-general/wiki/Hyperparameter-Optimization
19
20
20
21
self.seed=0# Seed for numpy, torch and the game
@@ -78,7 +79,7 @@ def __init__(self):
78
79
79
80
80
81
### Training
81
-
self.results_path=os.path.join(os.path.dirname(os.path.realpath(__file__)), "../results", os.path.basename(__file__)[:-3], datetime.datetime.now().strftime("%Y-%m-%d--%H-%M-%S")) # Path to store the model weights and TensorBoard logs
82
+
self.results_path=pathlib.Path(__file__).resolve().parents[1] /"results"/pathlib.Path(__file__).stem/datetime.datetime.now().strftime("%Y-%m-%d--%H-%M-%S") # Path to store the model weights and TensorBoard logs
82
83
self.save_model=True# Save the checkpoint in results_path as model.checkpoint
83
84
self.training_steps=int(1000e3) # Total number of training steps (ie weights update according to a batch)
84
85
self.batch_size=1024# Number of parts of games to train on at each training step
@@ -114,7 +115,7 @@ def __init__(self):
114
115
self.self_play_delay=0# Number of seconds to wait after each played game
115
116
self.training_delay=0# Number of seconds to wait after each training step
116
117
self.ratio=None# Desired training steps per self played step ratio. Equivalent to a synchronous version, training can take much longer. Set it to None to disable it
0 commit comments