Fix last commit

kongaskristjan · kongaskristjan · commit d97ab8ad0163 · 2025-06-07T00:00:19.000+03:00
diff --git a/README.md b/README.md
@@ -2,12 +2,9 @@
 
 This repository implements an new class of a thermodynamics inspired algorithms for reinforcement learning, together with experiments demonstrating it's usability. Just as particles in nature generally prefer to occupy states/locations with lower total energy (there's more air particles down here compared to 100km away from earth), the algorithm enforces that action sequences that get high rewards have high total probabilities. Total probability is here defined as the product of probabilities of all actions taken: $p\_{total}=p_1 p_2 ... p_n$.
 
-<figure>
-    <img src="https://github.com/user-attachments/assets/9cecaade-91cd-4dbe-b2c1-19a632622d43" alt="">
-    <figcaption>
-        A neural network trained the Policy Annealing algorithm taking a lunar lander to a safe stop. For training, see "Usage".
-    </figcaption>
-</figure>
+As a quick demo, here's a neural network trained using the Policy Annealing algorithm to take a moonlander to a safe stop. For training, see [Usage/Lunar Lander](#lunar-lander).
+
+![Lunar lander is landing!](https://github.com/user-attachments/assets/9cecaade-91cd-4dbe-b2c1-19a632622d43)
 
 ## Installation
 
@@ -49,15 +46,15 @@ Below is a list of environments together with hyperparameters and visualizations
 
 ### Cart Pole
 
-Cart Pole environment requires the agent to learn to balance a vertical pole by accelerating the cart left or right. Reward is given for each time step as long as the pole is vertical within 15 degrees accuracy and does not veer out of the image.
+Cart Pole environment requires the agent to learn to balance a vertical pole by accelerating the cart left or right. Reward is given for each time step as long as the pole is vertical within 15 degrees accuracy and does not veer out of the image. There's a maximum time limit of 500 simulation steps.
 
 Here's a very aggressive setup for training the CartPole agent. The training can be extremely fast if we're lucky, but is quite unstable and often does not converge.
 
 ```bash
 python scripts/main.py --value-function direct --batch-size 4 --num-episode-batches 10 --learning-rate 0.005 --num-optim-steps 300 --env-name CartPole-v1 --render full
 ```
 
-A relatively lucky run visualized:
+A relatively lucky from-scratch training:
 
 https://github.com/user-attachments/assets/7df2edd7-20dc-49d8-97dc-27ac8d1c4313
 
@@ -71,7 +68,7 @@ The final solution seems more convincing:
 
 https://github.com/user-attachments/assets/50dae70d-556a-4041-9a1d-7fb8ff77ad46
 
-With validation mode on, it's even getting boring:
+With validation mode on, the solution gets so good that it's even boring:
 
 https://github.com/user-attachments/assets/d861589b-2d54-40f7-b0c2-839c1039b4fb