Skip to content

Commit 793c479

Browse files
Fix linting requirements + small fixes in README
1 parent 3be3265 commit 793c479

File tree

2 files changed

+8
-13
lines changed

2 files changed

+8
-13
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ repos:
5757
- mdformat-gfm
5858
- mdformat-tables
5959
- mdformat_frontmatter
60+
- mdformat_myst
6061

6162
# word spelling linter
6263
- repo: https://github.com/codespell-project/codespell

README.md

Lines changed: 7 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -73,32 +73,26 @@ p \propto e^{-E/kT}
7373
where
7474

7575
- $p$ is the probability of the particle being at a certain location
76-
- $e \\approx 2.718$
76+
- $e \approx 2.718$
7777
- $E$ is the energy of the particle due to being at that location
7878
- $k$ is the Boltzmann constant
7979
- $T$ is the absolute temperature of the system
80-
- $\\propto$ signifies that $p$ varies proportionally to the right hand side
80+
- $\propto$ signifies that $p$ varies proportionally to the right hand side
8181

8282
Essentially, particles are always more concentrated to lower energy states/locations, with the probability distribution depending on the temperature.
8383

84-
In a deep reinforcement learning setting, we want the neural network to have a high probability of emitting actions that lead to high rewards and a low probability of actions that lead to low rewards. If we substitute negative energy with reward $R = -E$, and set Boltzmann constant $k=1$ ($k$ is just some physical constant which we don't need in the algorithm), we get:
84+
In a deep reinforcement learning setting, we want the neural network to have a high probability of emitting actions that lead to high rewards and a low probability of actions that lead to low rewards. If we substitute negative energy with reward $R = -E$ (high reward = high likelihood = low energy), and set Boltzmann constant $k=1$ ($k$ is just some physical constant which we don't need in the algorithm), we get:
8585

8686
```math
87-
p \propto e^{R/T} \Rightarrow\\
88-
```
89-
90-
Or, in other words:
91-
92-
```math
93-
p = c\ e^{R/T}
87+
p \propto e^{R/T} \Rightarrow p = c\ e^{R/T}
9488
```
9589

9690
where $c$ is a constant for a particular distribution.
9791

98-
Now let's keep in mind that the total probability of a trajectory is the product of the probabilities of all actions taken at all steps:
92+
Now let's keep in mind that the total probability of a trajectory is the product of the probabilities of all actions taken on all steps:
9993

10094
```math
101-
p = p_1 p_2 ... p_n\\
95+
p = p_1 p_2 ... p_n
10296
```
10397

10498
And the reward is the sum of all rewards at all steps:
@@ -185,4 +179,4 @@ The following coding conventions are used:
185179

186180
## Future
187181

188-
Current evidence does not support policy annealing to be state of the art. However, it does
182+
Current experiments are promising, but as of now, there's no evidence for anything state of the art.

0 commit comments

Comments
 (0)