Skip to content

Commit 257d8e9

Browse files
committed
added some more RL chaps stuff
1 parent ba7c5ef commit 257d8e9

File tree

4 files changed

+96
-13
lines changed

4 files changed

+96
-13
lines changed

src/bibliography.bib

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,13 @@ @book{russell2016artificial
1717
publisher = {Pearson Education, Limited}
1818
}
1919

20+
@article{mnih2013playing,
21+
title = {Playing atari with deep reinforcement learning},
22+
author = {Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Graves, Alex and Antonoglou, Ioannis and Wierstra, Daan and Riedmiller, Martin},
23+
journal = {arXiv preprint arXiv:1312.5602},
24+
year = {2013}
25+
}
26+
2027

2128
@article{cognition1999,
2229
author = {Walker, S},

src/chaps/body.tex

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ \section{Recurrent Neural Networks}%
2828
\chapter{Reinforcement Learning}
2929
\input{chaps/reinforcement.tex}
3030

31+
%TODO still needed after paper by DeepMind? --> showed that learning from teacher helps
3132
\chapter{Animal Cognition}
3233
\section{Recognition}
3334
\section{Memory}

src/chaps/implementation.tex

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ \section{Connecting Python agents to PowerTAC}
6565
\footnote{\url{https://github.com/powertac/broker-adapter-grpc} }.
6666

6767
Because the programming language is different from the supplied sample-broker, many of the domain objects need to be redefined and some code redeveloped. The classes in \ac {PowerTAC} which are transfered between the client and the server are all annotated so that the xml serializer can translate between the xml and object variants without errors. This helps to recreate a similar functionality for the needed classes in the python environment. If the project was started again today, it might have been simpler to first define a set of message types in a language such as Protocoll Buffers, the underlying technology of \ac {GRPC}, but because all current systems rely on \ac {JMI} communication, it is better to manually recreate these translators. The \ac {XML} parsing libraries provided by Python can be used to parse the \ac {XML} that is received.
68-
\section{Paralleling environments with Kubernetes}
68+
\section{Parallelizing environments with Kubernetes}
6969

7070
\section{Agent Models}
7171

@@ -170,4 +170,12 @@ \subsubsection{Customer demand estimation}%
170170

171171

172172
\subsection{Wholesale Market}
173+
174+
Using \ac {MDP}
175+
176+
\ac {MDP} is actually with infinite states but for analytical concept, its irrelevant. Important is: Continuous states,
177+
continuous actions (with some rounding to nearest .02)
178+
179+
theoretically it's a nonstationary \ac {MDP} because it's limited to 24 state transitions before termination (t-0)
180+
173181
\subsection{Balancing Market}

src/chaps/reinforcement.tex

Lines changed: 79 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,90 @@
1-
\ac {RL} is an interesting intersection between supervised and unsupervised learning concepts. On the one hand it does not require large amounts of labelled data to generate successful systems. But it does require some form of feedback and it uses the feedback of an \emph{environment} received by an acting \emph{agent}.
2-
The general principle of \ac {RL} therefore includes an agent and the environment in which it performs actions. The function that determines the action $a$ taken by the agent in a given state $s$ is called its policy, short $\pi$. The environment then reacts to the action of the agent by returning a new state $s'$ which is evaluated and a corresponding reward $r$ is given to the agent.
3-
%TODO russel
4-
%\citep[](russell2016artificial)
1+
The previous chapters have introduced concepts of \ac {SL} , \ac {NN}, backpropagation and \ac {RNN} for time-embedded
2+
learning tasks. \ac {RL} can be described as an intersection between supervised and unsupervised learning concepts and
3+
Deep \ac {RL} is the usage of \ac {NN}, especially those with many layers, to perform \ac {RL}.
54

6-
As an example, an agent active in the environment of playing Super Mario may receive rewards corresponding to the game score. It may perform all valid moves permitted by the game and the goal is to improve its score.
5+
On the one hand \ac {RL} does not require large amounts of labelled data to generate successful systems which is
6+
beneficial for areas where such data is either expensive to aquire or difficult to clearly label. On the other hand it
7+
requires some form of feedback. Generally, \ac {RL} \emph{agents} use feedback received from an \emph{environment}. The
8+
general principle of \ac {RL} therefore includes an agent and the environment in which it performs actions. The function
9+
that determines the action $a$ taken by the agent in a given state $s$ is called its policy, usually represented by
10+
$\pi$. The environment reacts to the actions of the agent by returning new states $s'$ which are evaluated and a
11+
corresponding reward $r$ is given to the agent. The reward gives the agent information about how well it performed
12+
\citep[p.830f.]{russell2016artificial}.
713

8-
To train an agent, the task is usually performed several times and the environment is reset after each iteration to allow for a new learning step.
14+
This chapter will first introduce the concepts of a \ac {MDP}, then introduce different concepts of \ac {RL} agents,
15+
describe approaches to encourage exploration of its options and finally describe how \ac {NN} can be used to create
16+
state-of-the-art agents that can solve complex tasks. The majority of the chapter is based on
17+
chapters 17 and 21 of \citet[]{russell2016artificial} unless otherwise marked.
18+
19+
\subsection{Markovian Decision Processes}%
20+
\label{ssub:markovian_decision_processes}
21+
22+
A common model describing the conceptual process of states and actions followed by new states and new actions of an
23+
agent and its environment is called a \acf {MDP}. In fact, \ac {RL} is an approach for solving such \ac {MDP} problems
24+
optimally
25+
\footnote{Although \ac {RL} can also be applied to non-sequential decision problems, the field has largely focused on
26+
sequential problems}.
27+
28+
A \ac {MDP} is usually defined by the following components:
29+
30+
\begin{itemize}
31+
\item $\mathcal{A}$: Finite set of allowed actions
32+
\item $\mathcal{S}$: Finite set of states
33+
\item $P(s' \mid s,a) \forall s \in \mathcal{S}, a \in \mathcal{A}$: Probability of transitioning from state
34+
$s$ to state $s'$ when action $a$ is taken
35+
\item $\gamma$: Discount factor for each time step, discounting future rewards to allow for long-term and
36+
short-term focus
37+
\item $R(s)$: Reward function that defines the reward received for transitioning into state $s$
38+
\end{itemize}
39+
40+
To solve such a problem, an agent needs to be equipped with a policy $\pi$ that allows for corresponding actions to each
41+
of the states. The type of policy can further be distinguished between \emph{stationary} and \emph{nonstationary}
42+
policies. The former type refers to policies that recommend the same action for the same state independent of the
43+
time stetime step. The latter describes those policies which are trying to solve non-finite state spaces and where an
44+
agent might therefore act differently once time becomes scarce. However, also infinite-horizon \ac {MDP} can have
45+
terminal states which conceptually mean that the process has ended.
46+
47+
A more complex form of \ac {MDP} is the \ac {POMDP} which involves agents basing their actions on a belief of the
48+
current state. As the later practical application to the \ac {PowerTAC} competition however can be mapped to a \ac {MDP}
49+
where the transition probability implicitly represents the partial observability \citep{tactexurieli2016mdp}, this will not be discussed.
50+
51+
\subsection{Bellman Equation}%
52+
\label{ssub:bellman_equation}
53+
54+
The Bellman Equation offers a way to describe the utility of each state in an \ac {MDP}. For this, it assumes that the
55+
utility of a state is the reward for the current state plus the sum of all future rewards discounted by $\gamma$.
56+
57+
\[
58+
U(s) = R(s) + \gamma \max_{a\in\mathcal{A}(s)} \sum_{s'}{P(s' \mid s,a)U(s')}
59+
%TODO numbers on equation?
60+
\]
61+
62+
In the above equation, the \emph{max} operation selects the optimal action in regard to all possible actions. In a
63+
discrete action space this would be a selection over all possible actions, in a continuous action space it however can
64+
become more complex. \ac {NN} based \ac {RL} agents simply invoke their policy network to retrieve the action which the
65+
agent believes it the one with the highest utility \citep{mnih2013playing}.
66+
67+
68+
%As an example, an agent active in the environment of playing Super Mario may receive rewards corresponding to the game
69+
%score. It may perform all valid moves permitted by the game and the goal is to improve its score.
70+
71+
%To train an agent, the task is usually performed several times and the environment is reset after each iteration to
72+
%allow for a new learning step.
973
%TODO add atari games reference
10-
When thinking about such an agent, it becomes obvious that without some explicit incentive to explore new alternatives, it may be contempt with whatever success it achieves and then always perform the same action. To avoid this, the agent can either be forced to try new alternative actions (through forcing random actions in a certain percentage of cases) or through explicit rewards for random actions.
74+
When thinking about such an agent, it becomes obvious that without some explicit incentive to explore new alternatives,
75+
it may be contempt with whatever success it achieves and then always perform the same action. To avoid this, the agent
76+
can either be forced to try new alternative actions (through forcing random actions in a certain percentage of cases) or
77+
through explicit rewards for random actions.
1178
%\citep() %A3C TODO
1279

13-
%TODO already described \ac {MDP}?
14-
The model describing this process of subsequent states and actions is commonly modelled as a \ac {MDP}. In fact, \ac {RL} is a concept of solving such \ac {MDP} problems assuming no internal model of the environment and without the agent knowing \emph{why} it gets a reward. The concept of \ac {RL} therefore teaches the agent to perform optimally in a given state based on previous experiences.
1580

16-
There are several forms of learning to solve a \ac {MDP} using \ac {RL} but for the purpose of this thesis I will focus on explaining how policy search algorithms work. Their goal is to find a good policy $\pi$ given a certain state $s$. Alternatives to this approach are concepts that try to learn expected future reward for future possible states, for available actions and others. %TODO not clean paragraph...
81+
There are several forms of learning to solve a \ac {MDP} using \ac {RL} but for the purpose of this thesis I will focus
82+
on explaining how policy search algorithms work. Their goal is to find a good policy $\pi$ given a certain state $s$.
83+
Alternatives to this approach are concepts that try to learn expected future reward for future possible states, for
84+
available actions and others. %TODO not clean paragraph...
1785

1886

19-
\section{Policy Search}
20-
\section{Deep Reinforcement Learning}
87+
\section{Policy Search} \section{Deep Reinforcement Learning}
2188

2289
%TODO paper deep \ac{RL} > mixing R.L and deep \ac{NN}
2390
\section{Proximal Policy Optimization OR(TBD) Deep Q learning}

0 commit comments

Comments
 (0)