You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+6-9Lines changed: 6 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,12 +2,9 @@
2
2
3
3
This repository implements an new class of a thermodynamics inspired algorithms for reinforcement learning, together with experiments demonstrating it's usability. Just as particles in nature generally prefer to occupy states/locations with lower total energy (there's more air particles down here compared to 100km away from earth), the algorithm enforces that action sequences that get high rewards have high total probabilities. Total probability is here defined as the product of probabilities of all actions taken: $p\_{total}=p_1 p_2 ... p_n$.
A neural network trained the Policy Annealing algorithm taking a lunar lander to a safe stop. For training, see "Usage".
9
-
</figcaption>
10
-
</figure>
5
+
As a quick demo, here's a neural network trained using the Policy Annealing algorithm to take a moonlander to a safe stop. For training, see [Usage/Lunar Lander](#lunar-lander).
6
+
7
+

11
8
12
9
## Installation
13
10
@@ -49,15 +46,15 @@ Below is a list of environments together with hyperparameters and visualizations
49
46
50
47
### Cart Pole
51
48
52
-
Cart Pole environment requires the agent to learn to balance a vertical pole by accelerating the cart left or right. Reward is given for each time step as long as the pole is vertical within 15 degrees accuracy and does not veer out of the image.
49
+
Cart Pole environment requires the agent to learn to balance a vertical pole by accelerating the cart left or right. Reward is given for each time step as long as the pole is vertical within 15 degrees accuracy and does not veer out of the image. There's a maximum time limit of 500 simulation steps.
53
50
54
51
Here's a very aggressive setup for training the CartPole agent. The training can be extremely fast if we're lucky, but is quite unstable and often does not converge.
55
52
56
53
```bash
57
54
python scripts/main.py --value-function direct --batch-size 4 --num-episode-batches 10 --learning-rate 0.005 --num-optim-steps 300 --env-name CartPole-v1 --render full
0 commit comments