|
| 1 | + |
| 2 | +# Deep Reinforcement Learning Nanodegree |
| 3 | +## P2 - Continuous Control Report |
| 4 | +This report outlines my implementation for Udacity's Deep Reinforcement Learning Nanodegree's third project on the [Tennis](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#tennis) environment. |
| 5 | + |
| 6 | + |
| 7 | + |
| 8 | +In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01. Thus, the goal of each agent is to keep the ball in play. |
| 9 | + |
| 10 | +The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Each agent receives its own, local observation. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. |
| 11 | + |
| 12 | +The task is episodic, and in order to solve the environment, your agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents). Specifically, |
| 13 | + |
| 14 | +- After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 2 (potentially different) scores. We then take the maximum of these 2 scores. |
| 15 | +- This yields a single score for each episode. |
| 16 | +The environment is considered solved, when the average (over 100 episodes) of those scores is at least **+0.5**. |
| 17 | + |
| 18 | +### Implementation |
| 19 | +The algorithm used is **Deep Deterministic Policy Gradients (DDPG)**. |
| 20 | + |
| 21 | +The final hyper-parameters used were as follows (n_episodes=1000, max_t=1000). |
| 22 | +```python |
| 23 | +SEED = 1 # random seed for python, numpy & torch |
| 24 | +episodes = 1000 # max episodes to run |
| 25 | +max_t = 1000 # max steps in episode |
| 26 | +solved_threshold = 0.5 # finish training when avg. score in 100 episodes crosses this threshold |
| 27 | + |
| 28 | +BUFFER_SIZE = int(1e6) # replay buffer size |
| 29 | +BATCH_SIZE = 512 # minibatch size |
| 30 | +GAMMA = 0.99 # discount factor |
| 31 | +TAU = 2e-3 # for soft update of target parameters |
| 32 | +LR_ACTOR = 3e-4 # learning rate of the actor |
| 33 | +LR_CRITIC = 3e-3 # learning rate of the critic |
| 34 | +WEIGHT_DECAY = 0 # L2 weight decay |
| 35 | + |
| 36 | +LEARN_EVERY = 20 # learning timestep interval |
| 37 | +LEARN_NUM = 10 # number of learning passes |
| 38 | +GRAD_CLIPPING = 0.8 # Gradient Clipping |
| 39 | + |
| 40 | +# Ornstein-Uhlenbeck noise parameters |
| 41 | +OU_SIGMA = 0.1 |
| 42 | +OU_THETA = 0.15 |
| 43 | +EPSILON = 1.0 # for epsilon in the noise process (act step) |
| 44 | +EPSILON_DECAY = 1e-6 |
| 45 | + |
| 46 | +``` |
| 47 | + |
| 48 | +#### Deep Deterministic Policy Gradient (DDPG) |
| 49 | +This algorithm is outlined in [this paper](https://arxiv.org/pdf/1509.02971.pdf), _Continuous Control with Deep Reinforcement Learning_, by researchers at Google Deepmind. In this paper, the authors present "a model-free, off-policy actor-critic algorithm using deep function approximators that can learn policies in high-dimensional, continuous action spaces." They highlight that DDPG can be viewed as an extension of Deep Q-learning for continuous tasks. |
| 50 | + |
| 51 | +#### Actor-Critic Method |
| 52 | +Actor-critic methods leverage the strengths of both policy-based and value-based methods. |
| 53 | + |
| 54 | +Using a policy-based approach, the agent (actor) learns how to act by directly estimating the optimal policy and maximizing reward through gradient ascent. Meanwhile, employing a value-based approach, the agent (critic) learns how to estimate the value (i.e., the future cumulative reward) of different state-action pairs. Actor-critic methods combine these two approaches in order to accelerate the learning process. Actor-critic agents are also more stable than value-based agents, while requiring fewer training samples than policy-based agents. |
| 55 | + |
| 56 | +You can find the actor-critic logic implemented in the file **`ddpg_agent.py`**. The actor-critic models can be found via their respective **`Actor()`** and **`Critic()`** classes in **`models.py`**. |
| 57 | + |
| 58 | +In the algorithm, local and target networks are implemented separately for both the actor and the critic. |
| 59 | + |
| 60 | +```python |
| 61 | + # Actor Network (w/ Target Network) |
| 62 | + self.actor_local = Actor(state_size, action_size, random_seed).to(device) |
| 63 | + self.actor_target = Actor(state_size, action_size, random_seed).to(device) |
| 64 | + self.actor_optimizer = optim.Adam(self.actor_local.parameters(), lr=LR_ACTOR) |
| 65 | + |
| 66 | + # Critic Network (w/ Target Network) |
| 67 | + self.critic_local = Critic(state_size, action_size, random_seed).to(device) |
| 68 | + self.critic_target = Critic(state_size, action_size, random_seed).to(device) |
| 69 | + self.critic_optimizer = optim.Adam(self.critic_local.parameters(), lr=LR_CRITIC, weight_decay=WEIGHT_DECAY) |
| 70 | +``` |
| 71 | + |
| 72 | +#### Exploration vs Exploitation |
| 73 | +One challenge is choosing which action to take while the agent is still learning the optimal policy. Should the agent choose an action based on the rewards observed thus far? Or, should the agent try a new action in hopes of earning a higher reward? This is known as the **exploration-exploitation dilemma**. |
| 74 | + |
| 75 | +For this project, we'll use the **Ornstein-Uhlenbeck process**, as suggested in the previously mentioned [paper by Google DeepMind](https://arxiv.org/pdf/1509.02971.pdf) (see bottom of page 4). The Ornstein-Uhlenbeck process adds a certain amount of noise to the action values at each timestep. This noise is correlated to previous noise, and therefore tends to stay in the same direction for longer durations without canceling itself out. This allows the arm to maintain velocity and explore the action space with more continuity. |
| 76 | + |
| 77 | +You can find the Ornstein-Uhlenbeck process implemented in the **`OUNoise`** class in **`ddpg_agent.py`**. |
| 78 | + |
| 79 | +In total, there are five hyperparameters related to this noise process. |
| 80 | + |
| 81 | +The Ornstein-Uhlenbeck process itself has three hyperparameters that determine the noise characteristics and magnitude: |
| 82 | +- mu: the long-running mean |
| 83 | +- theta: the speed of mean reversion |
| 84 | +- sigma: the volatility parameter |
| 85 | + |
| 86 | +The final noise parameters were set as follows: |
| 87 | + |
| 88 | +```python |
| 89 | +OU_SIGMA = 0.1 # Ornstein-Uhlenbeck noise parameter |
| 90 | +OU_THETA = 0.15 # Ornstein-Uhlenbeck noise parameter |
| 91 | +EPSILON = 1.0 # explore->exploit noise process added to act step |
| 92 | +EPSILON_DECAY = 1e-6 # decay rate for noise process |
| 93 | +``` |
| 94 | + |
| 95 | +#### Experience Replay |
| 96 | +Experience replay allows the RL agent to learn from past experience. |
| 97 | + |
| 98 | +DDPG also utilizes a replay buffer to gather experiences from each agent. The replay buffer contains a collection of experience tuples with the state, action, reward, and next state `(s, a, r, s')`. Each agent samples from this buffer as part of the learning step. Experiences are sampled randomly, so that the data is uncorrelated. This prevents action values from oscillating or diverging catastrophically, since a naive algorithm could otherwise become biased by correlations between sequential experience tuples. |
| 99 | + |
| 100 | +Also, experience replay improves learning through repetition. By doing multiple passes over the data, our agents have multiple opportunities to learn from a single experience tuple. This is particularly useful for state-action pairs that occur infrequently within the environment. |
| 101 | + |
| 102 | +#### Neural Network |
| 103 | +As implemented in the file **`model.py`**, both **Actor** and **Critic** (and local & target for each) consist of three (2) fully-connected (**Linear**) layers. The **input to fc1 is state_size**, while the **output of fc3 is action_size**. There are **256 and 128 hidden units** in fc1 and fc2, respectively, and **batch normalization (BatchNorm1d) **is applied to fc1. **ReLU activation is applied to fc1 and fc2**, while **tanh is applied to fc3**. |
| 104 | + |
| 105 | +##### |
| 106 | + |
| 107 | +**NOTE**: The files **`ddpg_agent.py`** and **`model.py`** were taken *almost verbatim* from the **Project 2: Continuous Control**. I used the same algorithm with a few changes and small hyperparamter tuning. You can find the Project 2 files in my Github Repo. |
| 108 | +[https://github.com/aadimator/drl-nd/tree/master/p2_continuous-control](https://github.com/aadimator/drl-nd/tree/master/p2_continuous-control) |
| 109 | + |
| 110 | +##### |
| 111 | + |
| 112 | +## Plot of Rewards |
| 113 | + |
| 114 | +The best result (DDPG) was an agent being able to solve the environment in ***549 episodes!***. |
| 115 | + |
| 116 | + |
| 117 | + |
| 118 | +##### |
| 119 | + |
| 120 | +## Ideas for Future Work |
| 121 | +1. Do **hyper-parameter tuning** on the current DDPG model. |
| 122 | +2. Research and try other **Multi-Agent Reinforcement Learning (MARL)** algorithms. |
| 123 | +3. Research and implement/apply stability improvements on the **Multi Agent DDPG (MADDPG)** model and see if I can actually reduce the variance in the number of episodes to solve the environment (between training runs). |
| 124 | +4. Try the (very!) recent **Distributed Distributional Deterministic Policy Gradients (D4PG)** algorithm as another method for adapting DDPG for continuous control. |
| 125 | +5. Try the **(Optional) Challenge: Soccer**. |
| 126 | + |
0 commit comments