Skip to content

Commit d97ab8a

Browse files
Fix last commit
1 parent 9f47a21 commit d97ab8a

File tree

1 file changed

+6
-9
lines changed

1 file changed

+6
-9
lines changed

README.md

Lines changed: 6 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,9 @@
22

33
This repository implements an new class of a thermodynamics inspired algorithms for reinforcement learning, together with experiments demonstrating it's usability. Just as particles in nature generally prefer to occupy states/locations with lower total energy (there's more air particles down here compared to 100km away from earth), the algorithm enforces that action sequences that get high rewards have high total probabilities. Total probability is here defined as the product of probabilities of all actions taken: $p\_{total}=p_1 p_2 ... p_n$.
44

5-
<figure>
6-
<img src="https://github.com/user-attachments/assets/9cecaade-91cd-4dbe-b2c1-19a632622d43" alt="">
7-
<figcaption>
8-
A neural network trained the Policy Annealing algorithm taking a lunar lander to a safe stop. For training, see "Usage".
9-
</figcaption>
10-
</figure>
5+
As a quick demo, here's a neural network trained using the Policy Annealing algorithm to take a moonlander to a safe stop. For training, see [Usage/Lunar Lander](#lunar-lander).
6+
7+
![Lunar lander is landing!](https://github.com/user-attachments/assets/9cecaade-91cd-4dbe-b2c1-19a632622d43)
118

129
## Installation
1310

@@ -49,15 +46,15 @@ Below is a list of environments together with hyperparameters and visualizations
4946

5047
### Cart Pole
5148

52-
Cart Pole environment requires the agent to learn to balance a vertical pole by accelerating the cart left or right. Reward is given for each time step as long as the pole is vertical within 15 degrees accuracy and does not veer out of the image.
49+
Cart Pole environment requires the agent to learn to balance a vertical pole by accelerating the cart left or right. Reward is given for each time step as long as the pole is vertical within 15 degrees accuracy and does not veer out of the image. There's a maximum time limit of 500 simulation steps.
5350

5451
Here's a very aggressive setup for training the CartPole agent. The training can be extremely fast if we're lucky, but is quite unstable and often does not converge.
5552

5653
```bash
5754
python scripts/main.py --value-function direct --batch-size 4 --num-episode-batches 10 --learning-rate 0.005 --num-optim-steps 300 --env-name CartPole-v1 --render full
5855
```
5956

60-
A relatively lucky run visualized:
57+
A relatively lucky from-scratch training:
6158

6259
https://github.com/user-attachments/assets/7df2edd7-20dc-49d8-97dc-27ac8d1c4313
6360

@@ -71,7 +68,7 @@ The final solution seems more convincing:
7168

7269
https://github.com/user-attachments/assets/50dae70d-556a-4041-9a1d-7fb8ff77ad46
7370

74-
With validation mode on, it's even getting boring:
71+
With validation mode on, the solution gets so good that it's even boring:
7572

7673
https://github.com/user-attachments/assets/d861589b-2d54-40f7-b0c2-839c1039b4fb
7774

0 commit comments

Comments
 (0)