You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/reinforcement_learning/README.md
+1-19Lines changed: 1 addition & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -56,7 +56,6 @@ The tutorial algorithms follow the same basic structure, as shown in file: [`./t
56
56
| Prioritized Experience Replay | Discrete | Pong, CartPole |[Schaul et al. Prioritized experience replay. Schaul et al. 2015.](https://arxiv.org/abs/1511.05952)|
57
57
|Dueling DQN|Discrete | Pong, CartPole |[Dueling network architectures for deep reinforcement learning. Wang et al. 2015.](https://arxiv.org/abs/1511.06581)|
58
58
|Double DQN| Discrete | Pong, CartPole |[Deep reinforcement learning with double q-learning. Van et al. 2016.](https://arxiv.org/abs/1509.06461)|
59
-
|Retrace|Discrete | Pong, CartPole |[Safe and efficient off-policy reinforcement learning. Munos et al. 2016: ](https://arxiv.org/pdf/1606.02647.pdf)|
60
59
|Noisy DQN|Discrete | Pong, CartPole |[Noisy networks for exploration. Fortunato et al. 2017.](https://arxiv.org/pdf/1706.10295.pdf)|
61
60
| Distributed DQN (C51)| Discrete | Pong, CartPole |[A distributional perspective on reinforcement learning. Bellemare et al. 2017.](https://arxiv.org/pdf/1707.06887.pdf)|
62
61
|**policy-based**||||
@@ -170,23 +169,6 @@ The tutorial algorithms follow the same basic structure, as shown in file: [`./t
170
169
```
171
170
172
171
173
-
174
-
175
-
***Retrace(lambda) DQN**
176
-
177
-
<u>Code</u>: `./tutorial_Retrace.py`
178
-
179
-
<u>Paper</u>: [Safe and Efficient Off-Policy Reinforcement Learning](https://arxiv.org/abs/1606.02647)
180
-
181
-
<u>Description:</u>
182
-
183
-
```
184
-
Retrace (lambda) is an off-policy algorithm that extend the idea of eligibility trace. It apply an importance sampling ratio truncated at 1 to several behaviour policies, which suffer from the variance explosion of standard IS and lead to safe and efficient learning.
0 commit comments