|
1 | | -\ac {RL} is an interesting intersection between supervised and unsupervised learning concepts. On the one hand it does not require large amounts of labelled data to generate successful systems. But it does require some form of feedback and it uses the feedback of an \emph{environment} received by an acting \emph{agent}. |
2 | | -The general principle of \ac {RL} therefore includes an agent and the environment in which it performs actions. The function that determines the action $a$ taken by the agent in a given state $s$ is called its policy, short $\pi$. The environment then reacts to the action of the agent by returning a new state $s'$ which is evaluated and a corresponding reward $r$ is given to the agent. |
3 | | -%TODO russel |
4 | | -%\citep[](russell2016artificial) |
| 1 | +The previous chapters have introduced concepts of \ac {SL} , \ac {NN}, backpropagation and \ac {RNN} for time-embedded |
| 2 | +learning tasks. \ac {RL} can be described as an intersection between supervised and unsupervised learning concepts and |
| 3 | +Deep \ac {RL} is the usage of \ac {NN}, especially those with many layers, to perform \ac {RL}. |
5 | 4 |
|
6 | | -As an example, an agent active in the environment of playing Super Mario may receive rewards corresponding to the game score. It may perform all valid moves permitted by the game and the goal is to improve its score. |
| 5 | +On the one hand \ac {RL} does not require large amounts of labelled data to generate successful systems which is |
| 6 | +beneficial for areas where such data is either expensive to aquire or difficult to clearly label. On the other hand it |
| 7 | +requires some form of feedback. Generally, \ac {RL} \emph{agents} use feedback received from an \emph{environment}. The |
| 8 | +general principle of \ac {RL} therefore includes an agent and the environment in which it performs actions. The function |
| 9 | +that determines the action $a$ taken by the agent in a given state $s$ is called its policy, usually represented by |
| 10 | +$\pi$. The environment reacts to the actions of the agent by returning new states $s'$ which are evaluated and a |
| 11 | +corresponding reward $r$ is given to the agent. The reward gives the agent information about how well it performed |
| 12 | +\citep[p.830f.]{russell2016artificial}. |
7 | 13 |
|
8 | | -To train an agent, the task is usually performed several times and the environment is reset after each iteration to allow for a new learning step. |
| 14 | +This chapter will first introduce the concepts of a \ac {MDP}, then introduce different concepts of \ac {RL} agents, |
| 15 | +describe approaches to encourage exploration of its options and finally describe how \ac {NN} can be used to create |
| 16 | +state-of-the-art agents that can solve complex tasks. The majority of the chapter is based on |
| 17 | +chapters 17 and 21 of \citet[]{russell2016artificial} unless otherwise marked. |
| 18 | + |
| 19 | +\subsection{Markovian Decision Processes}% |
| 20 | +\label{ssub:markovian_decision_processes} |
| 21 | + |
| 22 | +A common model describing the conceptual process of states and actions followed by new states and new actions of an |
| 23 | +agent and its environment is called a \acf {MDP}. In fact, \ac {RL} is an approach for solving such \ac {MDP} problems |
| 24 | +optimally |
| 25 | +\footnote{Although \ac {RL} can also be applied to non-sequential decision problems, the field has largely focused on |
| 26 | +sequential problems}. |
| 27 | + |
| 28 | +A \ac {MDP} is usually defined by the following components: |
| 29 | + |
| 30 | +\begin{itemize} |
| 31 | + \item $\mathcal{A}$: Finite set of allowed actions |
| 32 | + \item $\mathcal{S}$: Finite set of states |
| 33 | + \item $P(s' \mid s,a) \forall s \in \mathcal{S}, a \in \mathcal{A}$: Probability of transitioning from state |
| 34 | + $s$ to state $s'$ when action $a$ is taken |
| 35 | + \item $\gamma$: Discount factor for each time step, discounting future rewards to allow for long-term and |
| 36 | + short-term focus |
| 37 | + \item $R(s)$: Reward function that defines the reward received for transitioning into state $s$ |
| 38 | +\end{itemize} |
| 39 | + |
| 40 | +To solve such a problem, an agent needs to be equipped with a policy $\pi$ that allows for corresponding actions to each |
| 41 | +of the states. The type of policy can further be distinguished between \emph{stationary} and \emph{nonstationary} |
| 42 | +policies. The former type refers to policies that recommend the same action for the same state independent of the |
| 43 | +time stetime step. The latter describes those policies which are trying to solve non-finite state spaces and where an |
| 44 | +agent might therefore act differently once time becomes scarce. However, also infinite-horizon \ac {MDP} can have |
| 45 | +terminal states which conceptually mean that the process has ended. |
| 46 | + |
| 47 | +A more complex form of \ac {MDP} is the \ac {POMDP} which involves agents basing their actions on a belief of the |
| 48 | +current state. As the later practical application to the \ac {PowerTAC} competition however can be mapped to a \ac {MDP} |
| 49 | +where the transition probability implicitly represents the partial observability \citep{tactexurieli2016mdp}, this will not be discussed. |
| 50 | + |
| 51 | +\subsection{Bellman Equation}% |
| 52 | +\label{ssub:bellman_equation} |
| 53 | + |
| 54 | +The Bellman Equation offers a way to describe the utility of each state in an \ac {MDP}. For this, it assumes that the |
| 55 | +utility of a state is the reward for the current state plus the sum of all future rewards discounted by $\gamma$. |
| 56 | + |
| 57 | +\[ |
| 58 | +U(s) = R(s) + \gamma \max_{a\in\mathcal{A}(s)} \sum_{s'}{P(s' \mid s,a)U(s')} |
| 59 | +%TODO numbers on equation? |
| 60 | +\] |
| 61 | + |
| 62 | +In the above equation, the \emph{max} operation selects the optimal action in regard to all possible actions. In a |
| 63 | +discrete action space this would be a selection over all possible actions, in a continuous action space it however can |
| 64 | +become more complex. \ac {NN} based \ac {RL} agents simply invoke their policy network to retrieve the action which the |
| 65 | +agent believes it the one with the highest utility \citep{mnih2013playing}. |
| 66 | + |
| 67 | + |
| 68 | +%As an example, an agent active in the environment of playing Super Mario may receive rewards corresponding to the game |
| 69 | +%score. It may perform all valid moves permitted by the game and the goal is to improve its score. |
| 70 | + |
| 71 | +%To train an agent, the task is usually performed several times and the environment is reset after each iteration to |
| 72 | +%allow for a new learning step. |
9 | 73 | %TODO add atari games reference |
10 | | -When thinking about such an agent, it becomes obvious that without some explicit incentive to explore new alternatives, it may be contempt with whatever success it achieves and then always perform the same action. To avoid this, the agent can either be forced to try new alternative actions (through forcing random actions in a certain percentage of cases) or through explicit rewards for random actions. |
| 74 | +When thinking about such an agent, it becomes obvious that without some explicit incentive to explore new alternatives, |
| 75 | +it may be contempt with whatever success it achieves and then always perform the same action. To avoid this, the agent |
| 76 | +can either be forced to try new alternative actions (through forcing random actions in a certain percentage of cases) or |
| 77 | +through explicit rewards for random actions. |
11 | 78 | %\citep() %A3C TODO |
12 | 79 |
|
13 | | -%TODO already described \ac {MDP}? |
14 | | -The model describing this process of subsequent states and actions is commonly modelled as a \ac {MDP}. In fact, \ac {RL} is a concept of solving such \ac {MDP} problems assuming no internal model of the environment and without the agent knowing \emph{why} it gets a reward. The concept of \ac {RL} therefore teaches the agent to perform optimally in a given state based on previous experiences. |
15 | 80 |
|
16 | | -There are several forms of learning to solve a \ac {MDP} using \ac {RL} but for the purpose of this thesis I will focus on explaining how policy search algorithms work. Their goal is to find a good policy $\pi$ given a certain state $s$. Alternatives to this approach are concepts that try to learn expected future reward for future possible states, for available actions and others. %TODO not clean paragraph... |
| 81 | +There are several forms of learning to solve a \ac {MDP} using \ac {RL} but for the purpose of this thesis I will focus |
| 82 | +on explaining how policy search algorithms work. Their goal is to find a good policy $\pi$ given a certain state $s$. |
| 83 | +Alternatives to this approach are concepts that try to learn expected future reward for future possible states, for |
| 84 | +available actions and others. %TODO not clean paragraph... |
17 | 85 |
|
18 | 86 |
|
19 | | -\section{Policy Search} |
20 | | -\section{Deep Reinforcement Learning} |
| 87 | +\section{Policy Search} \section{Deep Reinforcement Learning} |
21 | 88 |
|
22 | 89 | %TODO paper deep \ac{RL} > mixing R.L and deep \ac{NN} |
23 | 90 | \section{Proximal Policy Optimization OR(TBD) Deep Q learning} |
0 commit comments