Skip to content

Commit 9e4f0f5

Browse files
committed
terrace coding session
1 parent 3e8e033 commit 9e4f0f5

File tree

2 files changed

+24
-5
lines changed

2 files changed

+24
-5
lines changed

src/chaps/body.tex

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,7 @@ \section{Recurrent Neural Networks}%
2626

2727
%TODO is this part of AI?
2828
\chapter{Reinforcement Learning}
29-
\section{Policy Search}
30-
\section{Deep Reinforcement Learning}
31-
32-
%TODO paper deep \ac{RL} > mixing R.L and deep \ac{NN}
33-
\section{Proximal Policy Optimization OR(TBD) Deep Q learning}
29+
\input{chaps/reinforcement.tex}
3430

3531
\chapter{Animal Cognition}
3632
\section{Recognition}

src/chaps/reinforcement.tex

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
\ac {RL} is an interesting intersection between supervised and unsupervised learning concepts. On the one hand it does not require large amounts of labelled data to generate successful systems. But it does require some form of feedback and it uses the feedback of an \emph{environment} received by an acting \emph{agent}.
2+
The general principle of \ac {RL} therefore includes an agent and the environment in which it performs actions. The function that determines the action $a$ taken by the agent in a given state $s$ is called its policy, short $\pi$. The environment then reacts to the action of the agent by returning a new state $s'$ which is evaluated and a corresponding reward $r$ is given to the agent.
3+
%TODO russel
4+
%\citep[](russell2016artificial)
5+
6+
As an example, an agent active in the environment of playing Super Mario may receive rewards corresponding to the game score. It may perform all valid moves permitted by the game and the goal is to improve its score.
7+
8+
To train an agent, the task is usually performed several times and the environment is reset after each iteration to allow for a new learning step.
9+
%TODO add atari games reference
10+
When thinking about such an agent, it becomes obvious that without some explicit incentive to explore new alternatives, it may be contempt with whatever success it achieves and then always perform the same action. To avoid this, the agent can either be forced to try new alternative actions (through forcing random actions in a certain percentage of cases) or through explicit rewards for random actions.
11+
%\citep() %A3C TODO
12+
13+
%TODO already described \ac {MDP}?
14+
The model describing this process of subsequent states and actions is commonly modelled as a \ac {MDP}. In fact, \ac {RL} is a concept of solving such \ac {MDP} problems assuming no internal model of the environment and without the agent knowing \emph{why} it gets a reward. The concept of \ac {RL} therefore teaches the agent to perform optimally in a given state based on previous experiences.
15+
16+
There are several forms of learning to solve a \ac {MDP} using \ac {RL} but for the purpose of this thesis I will focus on explaining how policy search algorithms work. Their goal is to find a good policy $\pi$ given a certain state $s$. Alternatives to this approach are concepts that try to learn expected future reward for future possible states, for available actions and others. %TODO not clean paragraph...
17+
18+
19+
\section{Policy Search}
20+
\section{Deep Reinforcement Learning}
21+
22+
%TODO paper deep \ac{RL} > mixing R.L and deep \ac{NN}
23+
\section{Proximal Policy Optimization OR(TBD) Deep Q learning}

0 commit comments

Comments
 (0)