+I've based my implementation on the multi agent deep deterministic gradient method (MADDPG), [see paper](https://arxiv.org/abs/1706.02275). MADDPG is extension of the single agent DDPG algorithm I used during the [reacher project](https://github.com/MathiasGruber/ReacherAgent-PyTorch). The DDPG agent is an actor-critic method, which has been shown to perform well in environments with a continuous action space, which are not well handled by the DQN algorithm and its various extensions. The algorithm consists of two neural networks, the *actor* and the *critic*, where the *actor* is used to approximate the optimal deterministic policy, while the *critic* learns to evaluate the optimal action-value function based on actions from the *actor*. The idea is thus that the *actor* is used for specifying actions *a*, while the *critic* calculates a temporal difference error (TD error) that criticizes the actions made by the *actor*. In the case of multiple agents, i.e. the MADDPG algorithm, each agent has its own DDPG algorithm, with the critic being given information about the state and action from all agents, whereas the actors are only given information pertaining to the agent in question, meaning that at inference time it can act independently of the other agents.
0 commit comments