Skip to content

Commit 6099c40

Browse files
committed
feat: Complete P3 report
1 parent da8b6bf commit 6099c40

File tree

2 files changed

+129
-3
lines changed

2 files changed

+129
-3
lines changed

p3_collab-compet/Report.md

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
2+
# Deep Reinforcement Learning Nanodegree
3+
## P2 - Continuous Control Report
4+
This report outlines my implementation for Udacity's Deep Reinforcement Learning Nanodegree's third project on the [Tennis](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#tennis) environment.
5+
6+
![Trained Agent](https://i.imgur.com/z901EXq.gifv)
7+
8+
In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01. Thus, the goal of each agent is to keep the ball in play.
9+
10+
The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket. Each agent receives its own, local observation. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping.
11+
12+
The task is episodic, and in order to solve the environment, your agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents). Specifically,
13+
14+
- After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 2 (potentially different) scores. We then take the maximum of these 2 scores.
15+
- This yields a single score for each episode.
16+
The environment is considered solved, when the average (over 100 episodes) of those scores is at least **+0.5**.
17+
18+
### Implementation
19+
The algorithm used is **Deep Deterministic Policy Gradients (DDPG)**.
20+
21+
The final hyper-parameters used were as follows (n_episodes=1000, max_t=1000).
22+
```python
23+
SEED = 1 # random seed for python, numpy & torch
24+
episodes = 1000 # max episodes to run
25+
max_t = 1000 # max steps in episode
26+
solved_threshold = 0.5 # finish training when avg. score in 100 episodes crosses this threshold
27+
28+
BUFFER_SIZE = int(1e6) # replay buffer size
29+
BATCH_SIZE = 512 # minibatch size
30+
GAMMA = 0.99 # discount factor
31+
TAU = 2e-3 # for soft update of target parameters
32+
LR_ACTOR = 3e-4 # learning rate of the actor
33+
LR_CRITIC = 3e-3 # learning rate of the critic
34+
WEIGHT_DECAY = 0 # L2 weight decay
35+
36+
LEARN_EVERY = 20 # learning timestep interval
37+
LEARN_NUM = 10 # number of learning passes
38+
GRAD_CLIPPING = 0.8 # Gradient Clipping
39+
40+
# Ornstein-Uhlenbeck noise parameters
41+
OU_SIGMA = 0.1
42+
OU_THETA = 0.15
43+
EPSILON = 1.0 # for epsilon in the noise process (act step)
44+
EPSILON_DECAY = 1e-6
45+
46+
```
47+
48+
#### Deep Deterministic Policy Gradient (DDPG)
49+
This algorithm is outlined in [this paper](https://arxiv.org/pdf/1509.02971.pdf), _Continuous Control with Deep Reinforcement Learning_, by researchers at Google Deepmind. In this paper, the authors present "a model-free, off-policy actor-critic algorithm using deep function approximators that can learn policies in high-dimensional, continuous action spaces." They highlight that DDPG can be viewed as an extension of Deep Q-learning for continuous tasks.
50+
51+
#### Actor-Critic Method
52+
Actor-critic methods leverage the strengths of both policy-based and value-based methods.
53+
54+
Using a policy-based approach, the agent (actor) learns how to act by directly estimating the optimal policy and maximizing reward through gradient ascent. Meanwhile, employing a value-based approach, the agent (critic) learns how to estimate the value (i.e., the future cumulative reward) of different state-action pairs. Actor-critic methods combine these two approaches in order to accelerate the learning process. Actor-critic agents are also more stable than value-based agents, while requiring fewer training samples than policy-based agents.
55+
56+
You can find the actor-critic logic implemented in the file **`ddpg_agent.py`**. The actor-critic models can be found via their respective **`Actor()`** and **`Critic()`** classes in **`models.py`**.
57+
58+
In the algorithm, local and target networks are implemented separately for both the actor and the critic.
59+
60+
```python
61+
# Actor Network (w/ Target Network)
62+
self.actor_local = Actor(state_size, action_size, random_seed).to(device)
63+
self.actor_target = Actor(state_size, action_size, random_seed).to(device)
64+
self.actor_optimizer = optim.Adam(self.actor_local.parameters(), lr=LR_ACTOR)
65+
66+
# Critic Network (w/ Target Network)
67+
self.critic_local = Critic(state_size, action_size, random_seed).to(device)
68+
self.critic_target = Critic(state_size, action_size, random_seed).to(device)
69+
self.critic_optimizer = optim.Adam(self.critic_local.parameters(), lr=LR_CRITIC, weight_decay=WEIGHT_DECAY)
70+
```
71+
72+
#### Exploration vs Exploitation
73+
One challenge is choosing which action to take while the agent is still learning the optimal policy. Should the agent choose an action based on the rewards observed thus far? Or, should the agent try a new action in hopes of earning a higher reward? This is known as the **exploration-exploitation dilemma**.
74+
75+
For this project, we'll use the **Ornstein-Uhlenbeck process**, as suggested in the previously mentioned [paper by Google DeepMind](https://arxiv.org/pdf/1509.02971.pdf) (see bottom of page 4). The Ornstein-Uhlenbeck process adds a certain amount of noise to the action values at each timestep. This noise is correlated to previous noise, and therefore tends to stay in the same direction for longer durations without canceling itself out. This allows the arm to maintain velocity and explore the action space with more continuity.
76+
77+
You can find the Ornstein-Uhlenbeck process implemented in the **`OUNoise`** class in **`ddpg_agent.py`**.
78+
79+
In total, there are five hyperparameters related to this noise process.
80+
81+
The Ornstein-Uhlenbeck process itself has three hyperparameters that determine the noise characteristics and magnitude:
82+
- mu: the long-running mean
83+
- theta: the speed of mean reversion
84+
- sigma: the volatility parameter
85+
86+
The final noise parameters were set as follows:
87+
88+
```python
89+
OU_SIGMA = 0.1 # Ornstein-Uhlenbeck noise parameter
90+
OU_THETA = 0.15 # Ornstein-Uhlenbeck noise parameter
91+
EPSILON = 1.0 # explore->exploit noise process added to act step
92+
EPSILON_DECAY = 1e-6 # decay rate for noise process
93+
```
94+
95+
#### Experience Replay
96+
Experience replay allows the RL agent to learn from past experience.
97+
98+
DDPG also utilizes a replay buffer to gather experiences from each agent. The replay buffer contains a collection of experience tuples with the state, action, reward, and next state `(s, a, r, s')`. Each agent samples from this buffer as part of the learning step. Experiences are sampled randomly, so that the data is uncorrelated. This prevents action values from oscillating or diverging catastrophically, since a naive algorithm could otherwise become biased by correlations between sequential experience tuples.
99+
100+
Also, experience replay improves learning through repetition. By doing multiple passes over the data, our agents have multiple opportunities to learn from a single experience tuple. This is particularly useful for state-action pairs that occur infrequently within the environment.
101+
102+
#### Neural Network
103+
As implemented in the file **`model.py`**, both **Actor** and **Critic** (and local & target for each) consist of three (2) fully-connected (**Linear**) layers. The **input to fc1 is state_size**, while the **output of fc3 is action_size**. There are **256 and 128 hidden units** in fc1 and fc2, respectively, and **batch normalization (BatchNorm1d) **is applied to fc1. **ReLU activation is applied to fc1 and fc2**, while **tanh is applied to fc3**.
104+
105+
#####  
106+
107+
**NOTE**: The files **`ddpg_agent.py`** and **`model.py`** were taken *almost verbatim* from the **Project 2: Continuous Control**. I used the same algorithm with a few changes and small hyperparamter tuning. You can find the Project 2 files in my Github Repo.
108+
[https://github.com/aadimator/drl-nd/tree/master/p2_continuous-control](https://github.com/aadimator/drl-nd/tree/master/p2_continuous-control)
109+
110+
#####  
111+
112+
## Plot of Rewards
113+
114+
The best result (DDPG) was an agent being able to solve the environment in ***549 episodes!***.
115+
116+
![Best Agent](https://i.imgur.com/hjuoonS.png)
117+
118+
#####  
119+
120+
## Ideas for Future Work
121+
1. Do **hyper-parameter tuning** on the current DDPG model.
122+
2. Research and try other **Multi-Agent Reinforcement Learning (MARL)** algorithms.
123+
3. Research and implement/apply stability improvements on the **Multi Agent DDPG (MADDPG)** model and see if I can actually reduce the variance in the number of episodes to solve the environment (between training runs).
124+
4. Try the (very!) recent **Distributed Distributional Deterministic Policy Gradients (D4PG)** algorithm as another method for adapting DDPG for continuous control.
125+
5. Try the **(Optional) Challenge: Soccer**.
126+

p3_collab-compet/Tennis.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -407,7 +407,7 @@
407407
"agent = Agent(state_size=state_size, action_size=action_size, random_seed=SEED)\n",
408408
"scores, avgs = ddpg()\n",
409409
"\n",
410-
"# Environment solved in 296 episodes!\tAverage Score: 30.03"
410+
"# Environment solved in 549 episodes!\tAverage Score: 0.53"
411411
]
412412
},
413413
{
@@ -427,7 +427,7 @@
427427
"source": [
428428
"end = time.time()\n",
429429
"elapsed = (end - start) / 60.0 # in minutes\n",
430-
"print(\"\\nElapsed Time: {0:3.2f} mins.\".format(elapsed)) # 223.58 mins."
430+
"print(\"\\nElapsed Time: {0:3.2f} mins.\".format(elapsed)) # 6.78 mins."
431431
]
432432
},
433433
{
@@ -469,7 +469,7 @@
469469
},
470470
{
471471
"cell_type": "code",
472-
"execution_count": null,
472+
"execution_count": 82,
473473
"metadata": {},
474474
"outputs": [],
475475
"source": [

0 commit comments

Comments
 (0)