Report on project1-Navigation

In this project, the DQN learning algorithm has been used to solve the Navigation problem. Other learning algorithms like Dobule DQN, Prioritized Experience Replay DQN, Dueling DQN will be added later.

The report will describe the learning algorithm with used hyper parameters, the arcitectures for neural netwoorks.

Training Code

The code is written in PyTorch and Python3, executed in Jupyter Notebook

Navigation.ipynb : Main Instruction file
dqn_agent.py : Agent and ReplayBuffer Class
model.py : Build QNetwork and train function
vanila_dqn_checkpoint.pth : Saved Model Weights

Learning Algorithm

Deep Q-Network

Q-learning is a value-based Reinforcement Learning algorithm that is used to find the optimal action-selection policy using a q function, Q(s,a)

It's goal is to maximize the value function Q

$\mathit{Q}^{*}(s, a) = \underset{\pi}{\mathrm{max}}\left \{ r_{t} + \gamma r_{t+1} + \gamma ^{2} r_{t+2}+...|s_{t}=s, a_{t}=a, \pi \right \}$

which is the maximum sum of rewards r_t discounted by γ at each timestep t, achievable by a behaviour policy π=P(a|s), after making an observation (s) and taking an action (a)

The follwoing is pseudo code of Q learning algorithm.

Initialze Q-values Q(s,a) arbitrarily for all state-action pairs.
For i=1 to # num_episodes
Choose an action A_t int eht current state (s) based on current Q-value estimates (e,g ε-greedy)
Take action A_t amd observe reward and state, R_t+1, S_t+1 Update Q(s|a)

$\mathit{Q(s_{t}|a_{t})} = \mathit{Q(s_{t}|a_{t})} + \alpha(\mathitt{R_{t+1}+\gamma \mathrm{max_{a}\mathit{Q(S_{t+1}, a)-\mathit{Q(s_{t}|a_{t})} }}})$

Q-networks approximate the Q-function as a neural network given a state, Q-values for each action
Q(s, a, θ) is a neural network that define obejctive function by mean-squared error in Q-values $\mathfrak{L}{(\theta) = \mathrm{E}\left [ \left ( \underbrace{r + \gamma \underset{a^{'}}{\mathrm{max}}Q(s^{'},a^{'},\theta)} - Q(s,a,\theta)\right )^{2} \right ] }$

To find optimum parameters θ, optimise by SGD, using δL(θ)/δθ
This algorithm diverges because stages are correlated and targets are non-stationary.

DQN-Experience replay
In order to deal with the correlated states, the agent build a dataset of experience and then makes random samples from the dataset.

$\mathfrak{L}{(\theta) = \mathrm{E_{\mathit{s,a,r,s^{'} D}}}\left [ \left (r + \gamma \underset{a^{'}}{\mathrm{max}}Q(s^{'},a^{'},\theta) - Q(s,a,\theta)\right )^{2} \right ] }$

DQN-Fixed Target
Also, the agent fixes the parameter θ^- and then with some frequency updates them

$\mathfrak{L}{(\theta) = \mathrm{E_{\mathit{s,a,r,s^{'} D}}}\left [ \left (r + \gamma \underset{a^{'}}{\mathrm{max}}Q(s^{'},a^{'},\theta^-) - Q(s,a,\theta)\right )^{2} \right ] }$

References

Neural Network Architecture
The state space has 37 dimensions and the size of action space per state is 4.
so the number of input features of NN is 37 and the output size is 4.
And the number of hidden layers and each size is configurable in this project.
You can input the list of hidden layers as one of the input parameters when creating an agent.
The hidden layers used in this project is [64,32] ie, 2 layers with 64, 32 neurons in each layer.

Number of features

Input layers : 37
Hidden layer 1: 64
Hidden layer 2: 32
Output layer : 4

QNetwork(
  (layers): ModuleList(
    (0): Linear(in_features=37, out_features=64, bias=True)
    (1): Linear(in_features=64, out_features=32, bias=True)
  )
  (output): Linear(in_features=32, out_features=4, bias=True)
)

Hyper-parameters

BUFFER_SIZE = int(1e5) # replay buffer size
BATCH_SIZE = 64 # minibatch size
GAMMA = 0.99 # discount factor
TAU = 1e-3 # for soft update of target parameters
LR = 5e-4 # learning rate
UPDATE_EVERY = 4 # how often to update the network

Note: learning rate is also configurable, you can specify when creating an agent.

Plot of Rewards

A plot of rewards per episode

plot an average reward (over 100 episodes)
It shows this agent solve the environment in in 169 episodes!

Episode 100	Average Score: 2.41
Episode 200	Average Score: 8.68
Episode 269	Average Score: 13.00
Environment solved in 169 episodes!	Average Score: 13.00
Total training time 0:03:53 s

The follwoing movie shows how the trained agent works to collect bananas and the final score

https://youtu.be/GxIUse16NSs

Ideas for Future Work

This project used simply the vanila DQN focusing on understanding algorithms and implementation.
As a future work, more improved algorithms like double DQN, dueling DQN and prioritized experince replay can be applied. And find-out fine-tuned hyper parameters that improve the overall performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report on project1-Navigation

Training Code

Learning Algorithm

Deep Q-Network

Plot of Rewards

The follwoing movie shows how the trained agent works to collect bananas and the final score

Ideas for Future Work

FilesExpand file tree

Report.md

Latest commit

History

Report.md

File metadata and controls

Report on project1-Navigation

Training Code

Learning Algorithm

Deep Q-Network

Plot of Rewards

The follwoing movie shows how the trained agent works to collect bananas and the final score

Ideas for Future Work