@@ -55,10 +55,9 @@ \subsection{Bellman Equation}%
5555The Bellman Equation offers a way to describe the utility of each state in an \ac {MDP}. For this, it defines the
5656utility of a state as the reward for the current state plus the sum of all future rewards discounted by $ \gamma $ .
5757
58- \[
58+ \begin { equation }
5959U(s) = R(s) + \gamma \max _{a\in\mathcal {A}(s)} \sum _{s'}{P(s' \mid s,a)U(s')}
60- % TODO numbers on equation?
61- \]
60+ \end {equation }
6261
6362In the above equation, the \emph {max } operation selects the optimal action in regard to all possible actions. The
6463Bellman equation is explicitly targeting \emph {discrete } state spaces. If the state transition graph is a cyclic graph
@@ -96,9 +95,9 @@ \subsection{Value and Policy Iteration}%
9695assuming both the transition function $ P(s' \ mid s,a) \forall s \in \mathcal {S}$ and the reward function $ R(s)$ are
9796known to the agent.
9897In the algorithm, the utility of each state is updated based on the \emph {Bellman update } rule:
99- \[
98+ \begin { equation }
10099U_{i+1}(s) \gets R(s) + \gamma \max _{a \in \mathcal {A}(s)} \sum _{s'}{P(s' \mid s,a) U_i(s')}
101- \]
100+ \end { equation }
102101This needs to be performed for \emph {each } state during \emph {each } iteration. It is clear how quickly this becomes
103102intractable as well when $ \gamma $ is reasonably close to 1, meaning that also long-term rewards are taken into
104103consideration.
@@ -133,10 +132,9 @@ \subsection{Temporal Difference Learning}%
133132is called a trial
134133\footnote {in newer \ac {RL} literature this is also called a \emph {trajectory } \citep {proximalpolicyopt , heess2017emergence } }.
135134The update rule for the utility of each state is as follows:
136- \[
135+ \begin { equation }
137136U^\pi (s) \gets U^\pi (s) + \alpha (R(s) + \gamma U^\pi (s') - U^\pi (s))
138- \]
139-
137+ \end {equation }
140138Where $ \alpha $ is the learning rate and $ U^\pi $ the utility under the execution of $ \pi (s)$ in state $ s$ . This only
141139updates the utilities based on the observed transitions so if the unknown transition function sometimes leads to
142140extremely negative rewards through rare transitions, this is unlikely to be captured. However, with sufficiently many
@@ -148,9 +146,9 @@ \subsection{Exploration}%
148146
149147The above learning approach has one weakness: It is only based on observed utilities. If $ \pi $ follows the pattern of
150148always choosing the action that leads to the highest expected $ U_{i+1}$ , i.e.
151- \[
149+ \begin { equation }
152150\pi (s) = \max _{a \in \mathcal {A}(s)}P(s' \mid s, a)U(s')
153- \]
151+ \end { equation }
154152then it will never explore possible alternatives and will very quickly get stuck on a rigid action
155153pattern mapping each state to a resulting action. To avoid this, the concept of \emph {exploration } has been introduced.
156154There are many approaches to encourage exploration. The simplest is to define a factor $ \epsilon $ which defines the
@@ -184,18 +182,18 @@ \subsection{Q Learning}%
184182(i.e. learn what a good policy is), this becomes problematic if the transition function is not known. An alternative
185183model is called \emph {Q-Learning } which is a form of Temporal Difference Learning. It learns an action-utility value
186184instead of simply the values. The relationship between this \emph {Q-Value } and the former value of a state is simply
187- \[
185+ \begin { equation }
188186U(s) = \max _{a}Q(s,a)
189- \]
187+ \end { equation }
190188so the value of a state is that of the highest Q-Value. The benefit of this approach is that it does not require a model
191189of how the world works, it therefore is called a \emph {model-free } method. The update rule for the Q-Values is simply
192190the Bellman equation with $ U(s)$ and $ U(s')$ replaced with $ Q(s,a)$ and $ Q(s',a')$ respectively.
193191
194192The update rules for the Q-Value approach are related to the Temporal Difference Learning rules but include a $ \max $
195193operator
196- \[
194+ \begin { equation }
197195Q(s,a) \gets Q(s,a) + \alpha (R(s) + \gamma \max _{a'}Q(s', a') - Q(s,a))
198- \]
196+ \end { equation }
199197An alternative version is the reduction of the above equation by removing the $ \max $ operator. This results in the
200198\emph {actual } action being considered instead of the one that the policy believes to be the best. Q-Learning is
201199\emph {off-policy } while the latter version, called \ac {SARSA}, is \emph {on-policy }. The distinction has a significant
@@ -233,9 +231,9 @@ \subsection{Policy Search and Policy Gradient Methods}%
233231$ \hat {A}_t$ to create an estimator for the policy gradient:
234232
235233
236- \begin {equation* }
234+ \begin {equation }
237235\hat {g} \ =\ \hat {\mathbb {E}}_{t} \ \left [ \nabla _{\theta }\log \pi _{\theta }( a_{t} \ \mid s_{t})\hat {A}_{t} \right ]
238- \end {equation* }
236+ \end {equation }
239237
240238where $ \hat {A}_t$ describes the advantage of taking one action over another in a given state. It can therefore be
241239described as an \emph {actor-critic architecture }, because $ A(a_t, s_t) = Q(a_t,s_t) - V(s_t)$ , meaning that the
0 commit comments