|
| 1 | + |
| 2 | +# Deep Reinforcement Learning Nanodegree |
| 3 | +## P2 - Continuous Control Report |
| 4 | +This report outlines my implementation for Udacity's Deep Reinforcement Learning Nanodegree's second project on the [Reacher](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#reacher) environment. |
| 5 | + |
| 6 | + |
| 7 | + |
| 8 | +In this environment, a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible. |
| 9 | + |
| 10 | +The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1. |
| 11 | + |
| 12 | +### Implementation |
| 13 | +The algorithm used is **Deep Deterministic Policy Gradients (DDPG)**. |
| 14 | + |
| 15 | +The final hyper-parameters used were as follows (n_episodes=1000, max_t=1000). |
| 16 | +```python |
| 17 | +SEED = 13 # random seed for python, numpy & torch |
| 18 | +episodes = 1000 # max episodes to run |
| 19 | +max_t = 1000 # max steps in episode |
| 20 | +solved_threshold = 30 # finish training when avg. score in 100 episodes crosses this threshold |
| 21 | + |
| 22 | +BUFFER_SIZE = int(1e6) # replay buffer size |
| 23 | +BATCH_SIZE = 128 # minibatch size |
| 24 | +GAMMA = 0.99 # discount factor |
| 25 | +TAU = 1e-3 # for soft update of target parameters |
| 26 | +LR_ACTOR = 3e-4 # learning rate of the actor |
| 27 | +LR_CRITIC = 3e-4 # learning rate of the critic |
| 28 | +WEIGHT_DECAY = 0 # L2 weight decay |
| 29 | + |
| 30 | +LEARN_EVERY = 20 # learning timestep interval |
| 31 | +LEARN_NUM = 10 # number of learning passes |
| 32 | +GRAD_CLIPPING = 1.0 # Gradient Clipping |
| 33 | + |
| 34 | +# Ornstein-Uhlenbeck noise parameters |
| 35 | +OU_SIGMA = 0.1 |
| 36 | +OU_THETA = 0.15 |
| 37 | +EPSILON = 1.0 # for epsilon in the noise process (act step) |
| 38 | +EPSILON_DECAY = 1e-6 |
| 39 | + |
| 40 | +``` |
| 41 | + |
| 42 | +#### Deep Deterministic Policy Gradient (DDPG) |
| 43 | +This algorithm is outlined in [this paper](https://arxiv.org/pdf/1509.02971.pdf), _Continuous Control with Deep Reinforcement Learning_, by researchers at Google Deepmind. In this paper, the authors present "a model-free, off-policy actor-critic algorithm using deep function approximators that can learn policies in high-dimensional, continuous action spaces." They highlight that DDPG can be viewed as an extension of Deep Q-learning for continuous tasks. |
| 44 | + |
| 45 | +#### Actor-Critic Method |
| 46 | +Actor-critic methods leverage the strengths of both policy-based and value-based methods. |
| 47 | + |
| 48 | +Using a policy-based approach, the agent (actor) learns how to act by directly estimating the optimal policy and maximizing reward through gradient ascent. Meanwhile, employing a value-based approach, the agent (critic) learns how to estimate the value (i.e., the future cumulative reward) of different state-action pairs. Actor-critic methods combine these two approaches in order to accelerate the learning process. Actor-critic agents are also more stable than value-based agents, while requiring fewer training samples than policy-based agents. |
| 49 | + |
| 50 | +You can find the actor-critic logic implemented in the file **`ddpg_agent.py`**. The actor-critic models can be found via their respective **`Actor()`** and **`Critic()`** classes in **`models.py`**. |
| 51 | + |
| 52 | +In the algorithm, local and target networks are implemented separately for both the actor and the critic. |
| 53 | + |
| 54 | +```python |
| 55 | + # Actor Network (w/ Target Network) |
| 56 | + self.actor_local = Actor(state_size, action_size, random_seed).to(device) |
| 57 | + self.actor_target = Actor(state_size, action_size, random_seed).to(device) |
| 58 | + self.actor_optimizer = optim.Adam(self.actor_local.parameters(), lr=LR_ACTOR) |
| 59 | + |
| 60 | + # Critic Network (w/ Target Network) |
| 61 | + self.critic_local = Critic(state_size, action_size, random_seed).to(device) |
| 62 | + self.critic_target = Critic(state_size, action_size, random_seed).to(device) |
| 63 | + self.critic_optimizer = optim.Adam(self.critic_local.parameters(), lr=LR_CRITIC, weight_decay=WEIGHT_DECAY) |
| 64 | +``` |
| 65 | + |
| 66 | +#### Exploration vs Exploitation |
| 67 | +One challenge is choosing which action to take while the agent is still learning the optimal policy. Should the agent choose an action based on the rewards observed thus far? Or, should the agent try a new action in hopes of earning a higher reward? This is known as the **exploration-exploitation dilemma**. |
| 68 | + |
| 69 | +For this project, we'll use the **Ornstein-Uhlenbeck process**, as suggested in the previously mentioned [paper by Google DeepMind](https://arxiv.org/pdf/1509.02971.pdf) (see bottom of page 4). The Ornstein-Uhlenbeck process adds a certain amount of noise to the action values at each timestep. This noise is correlated to previous noise, and therefore tends to stay in the same direction for longer durations without canceling itself out. This allows the arm to maintain velocity and explore the action space with more continuity. |
| 70 | + |
| 71 | +You can find the Ornstein-Uhlenbeck process implemented in the **`OUNoise`** class in **`ddpg_agent.py`**. |
| 72 | + |
| 73 | +In total, there are five hyperparameters related to this noise process. |
| 74 | + |
| 75 | +The Ornstein-Uhlenbeck process itself has three hyperparameters that determine the noise characteristics and magnitude: |
| 76 | +- mu: the long-running mean |
| 77 | +- theta: the speed of mean reversion |
| 78 | +- sigma: the volatility parameter |
| 79 | + |
| 80 | +The final noise parameters were set as follows: |
| 81 | + |
| 82 | +```python |
| 83 | +OU_SIGMA = 0.1 # Ornstein-Uhlenbeck noise parameter |
| 84 | +OU_THETA = 0.15 # Ornstein-Uhlenbeck noise parameter |
| 85 | +EPSILON = 1.0 # explore->exploit noise process added to act step |
| 86 | +EPSILON_DECAY = 1e-6 # decay rate for noise process |
| 87 | +``` |
| 88 | + |
| 89 | +#### Experience Replay |
| 90 | +Experience replay allows the RL agent to learn from past experience. |
| 91 | + |
| 92 | +DDPG also utilizes a replay buffer to gather experiences from each agent. The replay buffer contains a collection of experience tuples with the state, action, reward, and next state `(s, a, r, s')`. Each agent samples from this buffer as part of the learning step. Experiences are sampled randomly, so that the data is uncorrelated. This prevents action values from oscillating or diverging catastrophically, since a naive algorithm could otherwise become biased by correlations between sequential experience tuples. |
| 93 | + |
| 94 | +Also, experience replay improves learning through repetition. By doing multiple passes over the data, our agents have multiple opportunities to learn from a single experience tuple. This is particularly useful for state-action pairs that occur infrequently within the environment. |
| 95 | + |
| 96 | +#### Neural Network |
| 97 | +As implemented in the file **`model.py`**, both **Actor** and **Critic** (and local & target for each) consist of three (3) fully-connected (**Linear**) layers. The **input to fc1 is state_size**, while the **output of fc3 is action_size**. There are **400 and 300 hidden units** in fc1 and fc2, respectively, and **batch normalization (BatchNorm1d) **is applied to fc1. **ReLU activation is applied to fc1 and fc2**, while **tanh is applied to fc3**. |
| 98 | + |
| 99 | +##### |
| 100 | + |
| 101 | +**NOTE**: The files **`ddpg_agent.py`** and **`model.py`** were taken *almost verbatim* from the **Deep Deterministic Policy Gradients (DDPG)** Coding Exercise in **3. Policy-Based Methods, Lesson 5. Actor-Critic Methods.** Specificially, from **DDPG.ipynb** running the **'Pendulum-v0'** gym environment. |
| 102 | + |
| 103 | +##### |
| 104 | + |
| 105 | +## Plot of Rewards |
| 106 | + |
| 107 | +The best result (DDPG) was an agent being able to solve the environment in ***296 episodes!***. |
| 108 | + |
| 109 | + |
| 110 | + |
| 111 | +##### |
| 112 | + |
| 113 | +## Ideas for Future Work |
| 114 | +1. Do **hyper-parameter tuning** on the current DDPG model. |
| 115 | +2. Try **Trust Region Policy Optimization (TRPO)** and **Truncated Natural Policy Gradient (TNPG)** as these two algorithms have been shown to achieve better performance. |
| 116 | +3. Try the (very!) recent **Distributed Distributional Deterministic Policy Gradients (D4PG)** algorithm as another method for adapting DDPG for continuous control. |
| 117 | +4. Try the **(Optional) Challenge: Crawl**. |
| 118 | + |
0 commit comments