You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+7-13Lines changed: 7 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -73,32 +73,26 @@ p \propto e^{-E/kT}
73
73
where
74
74
75
75
- $p$ is the probability of the particle being at a certain location
76
-
- $e \\approx 2.718$
76
+
- $e \approx 2.718$
77
77
- $E$ is the energy of the particle due to being at that location
78
78
- $k$ is the Boltzmann constant
79
79
- $T$ is the absolute temperature of the system
80
-
- $\\propto$ signifies that $p$ varies proportionally to the right hand side
80
+
- $\propto$ signifies that $p$ varies proportionally to the right hand side
81
81
82
82
Essentially, particles are always more concentrated to lower energy states/locations, with the probability distribution depending on the temperature.
83
83
84
-
In a deep reinforcement learning setting, we want the neural network to have a high probability of emitting actions that lead to high rewards and a low probability of actions that lead to low rewards. If we substitute negative energy with reward $R = -E$, and set Boltzmann constant $k=1$ ($k$ is just some physical constant which we don't need in the algorithm), we get:
84
+
In a deep reinforcement learning setting, we want the neural network to have a high probability of emitting actions that lead to high rewards and a low probability of actions that lead to low rewards. If we substitute negative energy with reward $R = -E$ (high reward = high likelihood = low energy), and set Boltzmann constant $k=1$ ($k$ is just some physical constant which we don't need in the algorithm), we get:
85
85
86
86
```math
87
-
p \propto e^{R/T} \Rightarrow\\
88
-
```
89
-
90
-
Or, in other words:
91
-
92
-
```math
93
-
p = c\ e^{R/T}
87
+
p \propto e^{R/T} \Rightarrow p = c\ e^{R/T}
94
88
```
95
89
96
90
where $c$ is a constant for a particular distribution.
97
91
98
-
Now let's keep in mind that the total probability of a trajectory is the product of the probabilities of all actions taken at all steps:
92
+
Now let's keep in mind that the total probability of a trajectory is the product of the probabilities of all actions taken on all steps:
99
93
100
94
```math
101
-
p = p_1 p_2 ... p_n\\
95
+
p = p_1 p_2 ... p_n
102
96
```
103
97
104
98
And the reward is the sum of all rewards at all steps:
@@ -185,4 +179,4 @@ The following coding conventions are used:
185
179
186
180
## Future
187
181
188
-
Current evidence does not support policy annealing to be state of the art. However, it does
182
+
Current experiments are promising, but as of now, there's no evidence for anything state of the art.
0 commit comments