Skip to content

Commit 6f3760e

Browse files
Clarify demo and README
1 parent 72de56e commit 6f3760e

File tree

2 files changed

+15
-15
lines changed

2 files changed

+15
-15
lines changed

README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ https://github.com/user-attachments/assets/7df2edd7-20dc-49d8-97dc-27ac8d1c4313
6767
Here's a much more stable setup that almost always converges:
6868

6969
```bash
70-
python scripts/main.py --value-function grouped --env-name CartPole-v1 --num-episode-batches 10 --temp-start 0.5 --temp-end 0.5
70+
python scripts/main.py --value-function grouped --batch-size 32 --group-size 8 --env-name CartPole-v1 --num-episode-batches 10 --temp-start 0.5 --temp-end 0.5
7171
```
7272

7373
The final solution seems more convincing:
@@ -186,11 +186,9 @@ We can then define two neural networks: one for computing the value $V_i'$ and a
186186

187187
Experiments show that this technique can be quite sample-efficient, however it may need a relatively large number of training steps on each batch of samples. This is because the value and policy networks learn together from one equation, but the policy network can learn wrong behavior while the value network has not yet converged.
188188

189-
### As a regularization method (Better Entropy)
189+
### As a regularization method (Stable Entropy)
190190

191-
**Note:** this is currently work in progress and can't be run from `scripts/main.py`, however there are some quick experiments [showing much better stability properties](#policy-annealing-seems-to-be-more-stable) compared to other regularization methods.
192-
193-
One problem with the methods described above is that they define their own loss functions, which aren't obviously similar to the Policy Gradients method. This means that Policy Annealing, in its previous forms, is hard to apply to state-of-the-art algorithms like PPO.
191+
One problem with the Policy Annealing methods described above is that they define their own loss functions, which aren't obviously similar to the Policy Gradients method. This means that Policy Annealing, in its previous forms, is hard to apply to state-of-the-art algorithms.
194192

195193
To remedy this, we can define an "entropy-regularized reward" (ERR) as:
196194

@@ -204,23 +202,25 @@ We can then see that our core equation is simply the sum of these new rewards:
204202
V_{i}' = R_{i}' + R_{i+1}' + ... + R_{i+k}'
205203
```
206204

207-
Thus, we can treat these entropy-regularized rewards just as we would treat standard rewards in any RL method (for both value estimation and policy updates) and still, theoretically, arrive at the same Boltzmann distribution.
205+
Thus, we can treat these entropy-regularized rewards just as we would treat standard rewards in any RL method (for both value estimation and policy updates) and still, at least theoretically, arrive at the same Boltzmann distribution. At the same time, we can use the bells and whistles that have been invented for algorithms like PPO.
206+
207+
**Note:** this is a work in progress and can't be run from `scripts/main.py`, however there are some quick experiments [showing much better stability properties](#policy-annealing-based-regularization-seems-to-be-more-stable) compared to other regularization methods such as Entropy Bonus.
208208

209209
## Comparison to Entropy Bonus
210210

211-
Entropy bonus regularization is a common technique in modern deep reinforcement learning. The goal is to encourage exploration by adding the policy's entropy, scaled by a temperature parameter $\alpha$ (equivalent to T in our model), to the reward signal. The objective is to find a policy $\pi$ that maximizes the expectation of $R + \alpha \cdot H(\pi)$, where $H$ is the policy's entropy.
211+
Entropy bonus regularization is a common technique in modern deep reinforcement learning. The goal is to encourage exploration by adding the policy's entropy, scaled by a temperature parameter $\alpha$ (equivalent to $T$ in our model), to the reward signal. The objective is to find a policy $\pi$ that maximizes the average of $R + \alpha \cdot H(\pi)$, where $H$ is the policy's entropy.
212212

213-
### Policy Annealing seems to be more stable
213+
### Policy Annealing based regularization seems to be more stable
214214

215215
Both theoretical arguments and toy simulations show that Policy Annealing based algorithms have much better stability characteristics than Entropy Bonus.
216216

217217
For example, let's take an RL environment that has no input but three possible actions: $A$ (low reward), $B$ (low reward), and $C$ (high reward). Each episode only lasts one action.
218218

219-
The simulation below shows how the action probabilities change for three methods: Policy Gradients without regularization (blue), Policy Gradients with Entropy Bonus (red), and Policy Gradients with Policy Annealing Regularization (green).
219+
The simulation below shows how the action probabilities change for three methods: Policy Gradients without regularization (blue), Policy Gradients with Entropy Bonus (red), and Policy Gradients with Policy Annealing Regularization (green, proposed algorithm).
220220

221-
https://github.com/user-attachments/assets/3c387fd8-7f02-49be-bbcb-fdeed156a2de
221+
https://github.com/user-attachments/assets/b6b15c2d-8df1-4901-a1c2-025c4911db5f
222222

223-
Note that Policy Annealing regularization allows the policy to become stationary at an optimal distribution, while the Entropy Bonus policy never truly settles. For Policy Annealing, at the optimal distribution, the rewards are fully compensated by the $-T \cdot \log(p)$ terms, resulting in zero advantage and a stationary policy. For Entropy Bonus, the loss function constantly fluctuates as it tries to balance different rewards with a uniform exploration drive, leading to an unstable final policy.
223+
Note that Policy Annealing regularization allows the policy to become stationary at an optimal distribution, while the Entropy Bonus policy never truly settles. For Policy Annealing, at the optimal distribution, the rewards are fully compensated by the $-T \cdot \log(p)$ terms, resulting in zero advantage and a stationary policy. However, because Entropy Bonus attempts to equalize probabilities of actions it didn't even choose (and thus have no data on the reward outcome), there will always be unnecessary fluctuations in the output probability.
224224

225225
## Codebase
226226

demos/bonus_vs_better.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
TEMPERATURE = 0.5
1818
LEARNING_RATE = 1.0
1919
BATCH_SIZE = 4
20-
NUM_TRAINING_STEPS = 200
20+
NUM_TRAINING_STEPS = 300
2121

2222

2323
# --- Policy Model ---
@@ -89,8 +89,8 @@ def run_simulation() -> dict[str, np.ndarray]:
8989
# fmt: off
9090
algorithms: list[dict[str, Any]] = [
9191
{"name": "Policy Gradients without regularization", "loss_fn": calculate_pg_eb_loss, "history": [], "kwargs": {"use_entropy": False}},
92-
{"name": "Policy Gradients + Entropy Bonus", "loss_fn": calculate_pg_eb_loss, "history": [], "kwargs": {"use_entropy": True}},
93-
{"name": "Policy Gradients + Policy Annealing Regularization", "loss_fn": calculate_annealing_based_loss, "history": [], "kwargs": {"use_policy_gradient": True}},
92+
{"name": "Policy Gradients + Entropy Bonus (Industry standard)", "loss_fn": calculate_pg_eb_loss, "history": [], "kwargs": {"use_entropy": True}},
93+
{"name": "Policy Gradients + Stable Entropy (Proposed algorithm)", "loss_fn": calculate_annealing_based_loss, "history": [], "kwargs": {"use_policy_gradient": True}},
9494
# {"name": "Grouped Policy Annealing", "loss_fn": calculate_annealing_based_loss, "history": [], "kwargs": {"use_policy_gradient": False}},
9595
]
9696
# fmt: on
@@ -207,7 +207,7 @@ def create_animation(histories: dict[str, np.ndarray]):
207207
]
208208

209209
fig.update_layout(
210-
title="Policy Annealing variants vs Entropy Bonus",
210+
title="Entropy Bonus vs Stable Entropy (Proposed algorithm)",
211211
xaxis_title="Action",
212212
yaxis_title="Probability",
213213
yaxis_range=[0, 1],

0 commit comments

Comments
 (0)