Clarification in report

MathiasGruber · MathiasGruber · commit 9424d9627416 · 2019-03-30T09:33:43.000+01:00
diff --git a/REPORT.md b/REPORT.md
@@ -1,7 +1,7 @@
 # Tennis Game Report
 
 ## Learning Algorithm
-I've based my implementation on the multi agent deep deterministic gradient method (MADDPG), [see paper](https://arxiv.org/abs/1706.02275). MADDPG is extension of the single agent DDPG algorithm I used during the [reacher project](https://github.com/MathiasGruber/ReacherAgent-PyTorch). The DDPG agent is an actor-critic method, which has been shown to perform well in environments with a continuous action space, which are not well handled by the DQN algorithm and its various extensions. The algorithm consists of two neural networks, the *actor* and the *critic*, where the *actor* is used to approximate the optimal deterministic policy, while the *critic* learns to evaluate the optimal action-value function based on actions from the *actor*. The idea is thus that the *actor* is used for specifying actions *a*, while the *critic* calculates a temporal difference error (TD error) that criticizes the actions made by the *actor*. In the case of multiple agents, i.e. the MADDPG algorithm, the critic is given information about the state and action from all agents, whereas the actors are only given information pertaining to the agent in question, meaning that at inference time it can act independently of the other agents.
+I've based my implementation on the multi agent deep deterministic gradient method (MADDPG), [see paper](https://arxiv.org/abs/1706.02275). MADDPG is extension of the single agent DDPG algorithm I used during the [reacher project](https://github.com/MathiasGruber/ReacherAgent-PyTorch). The DDPG agent is an actor-critic method, which has been shown to perform well in environments with a continuous action space, which are not well handled by the DQN algorithm and its various extensions. The algorithm consists of two neural networks, the *actor* and the *critic*, where the *actor* is used to approximate the optimal deterministic policy, while the *critic* learns to evaluate the optimal action-value function based on actions from the *actor*. The idea is thus that the *actor* is used for specifying actions *a*, while the *critic* calculates a temporal difference error (TD error) that criticizes the actions made by the *actor*. In the case of multiple agents, i.e. the MADDPG algorithm, each agent has its own DDPG algorithm, with the critic being given information about the state and action from all agents, whereas the actors are only given information pertaining to the agent in question, meaning that at inference time it can act independently of the other agents.
 
 Following the same convergence arguments as found in the DQN algorithm, DDPG actually employs 4 neural networks; a local actor, a target actor, a local critic and a target critic. In addition, we also need to employ replay buffer in order to account for temporal sample correlation, and we use a so-called 'soft update' mechanism to update the weights of the target networks slowly - i.e. we continously slowly update the target networks based on the parameters in the local networks. In this solution, I've also implemented prioritized experience replay for DDPG, which works exactly the same as the implementation for DQN.