You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -186,11 +186,9 @@ We can then define two neural networks: one for computing the value $V_i'$ and a
186
186
187
187
Experiments show that this technique can be quite sample-efficient, however it may need a relatively large number of training steps on each batch of samples. This is because the value and policy networks learn together from one equation, but the policy network can learn wrong behavior while the value network has not yet converged.
188
188
189
-
### As a regularization method (Better Entropy)
189
+
### As a regularization method (Stable Entropy)
190
190
191
-
**Note:** this is currently work in progress and can't be run from `scripts/main.py`, however there are some quick experiments [showing much better stability properties](#policy-annealing-seems-to-be-more-stable) compared to other regularization methods.
192
-
193
-
One problem with the methods described above is that they define their own loss functions, which aren't obviously similar to the Policy Gradients method. This means that Policy Annealing, in its previous forms, is hard to apply to state-of-the-art algorithms like PPO.
191
+
One problem with the Policy Annealing methods described above is that they define their own loss functions, which aren't obviously similar to the Policy Gradients method. This means that Policy Annealing, in its previous forms, is hard to apply to state-of-the-art algorithms.
194
192
195
193
To remedy this, we can define an "entropy-regularized reward" (ERR) as:
196
194
@@ -204,23 +202,25 @@ We can then see that our core equation is simply the sum of these new rewards:
204
202
V_{i}' = R_{i}' + R_{i+1}' + ... + R_{i+k}'
205
203
```
206
204
207
-
Thus, we can treat these entropy-regularized rewards just as we would treat standard rewards in any RL method (for both value estimation and policy updates) and still, theoretically, arrive at the same Boltzmann distribution.
205
+
Thus, we can treat these entropy-regularized rewards just as we would treat standard rewards in any RL method (for both value estimation and policy updates) and still, at least theoretically, arrive at the same Boltzmann distribution. At the same time, we can use the bells and whistles that have been invented for algorithms like PPO.
206
+
207
+
**Note:** this is a work in progress and can't be run from `scripts/main.py`, however there are some quick experiments [showing much better stability properties](#policy-annealing-based-regularization-seems-to-be-more-stable) compared to other regularization methods such as Entropy Bonus.
208
208
209
209
## Comparison to Entropy Bonus
210
210
211
-
Entropy bonus regularization is a common technique in modern deep reinforcement learning. The goal is to encourage exploration by adding the policy's entropy, scaled by a temperature parameter $\alpha$ (equivalent to T in our model), to the reward signal. The objective is to find a policy $\pi$ that maximizes the expectation of $R + \alpha \cdot H(\pi)$, where $H$ is the policy's entropy.
211
+
Entropy bonus regularization is a common technique in modern deep reinforcement learning. The goal is to encourage exploration by adding the policy's entropy, scaled by a temperature parameter $\alpha$ (equivalent to $T$ in our model), to the reward signal. The objective is to find a policy $\pi$ that maximizes the average of $R + \alpha \cdot H(\pi)$, where $H$ is the policy's entropy.
212
212
213
-
### Policy Annealing seems to be more stable
213
+
### Policy Annealing based regularization seems to be more stable
214
214
215
215
Both theoretical arguments and toy simulations show that Policy Annealing based algorithms have much better stability characteristics than Entropy Bonus.
216
216
217
217
For example, let's take an RL environment that has no input but three possible actions: $A$ (low reward), $B$ (low reward), and $C$ (high reward). Each episode only lasts one action.
218
218
219
-
The simulation below shows how the action probabilities change for three methods: Policy Gradients without regularization (blue), Policy Gradients with Entropy Bonus (red), and Policy Gradients with Policy Annealing Regularization (green).
219
+
The simulation below shows how the action probabilities change for three methods: Policy Gradients without regularization (blue), Policy Gradients with Entropy Bonus (red), and Policy Gradients with Policy Annealing Regularization (green, proposed algorithm).
Note that Policy Annealing regularization allows the policy to become stationary at an optimal distribution, while the Entropy Bonus policy never truly settles. For Policy Annealing, at the optimal distribution, the rewards are fully compensated by the $-T \cdot \log(p)$ terms, resulting in zero advantage and a stationary policy. For Entropy Bonus, the loss function constantly fluctuates as it tries to balance different rewards with a uniform exploration drive, leading to an unstable final policy.
223
+
Note that Policy Annealing regularization allows the policy to become stationary at an optimal distribution, while the Entropy Bonus policy never truly settles. For Policy Annealing, at the optimal distribution, the rewards are fully compensated by the $-T \cdot \log(p)$ terms, resulting in zero advantage and a stationary policy. However, because Entropy Bonus attempts to equalize probabilities of actions it didn't even choose (and thus have no data on the reward outcome), there will always be unnecessary fluctuations in the output probability.
0 commit comments