|
| 1 | +# Multi-agent Reinforcement Learning |
| 2 | + |
| 3 | +Previous sections discussed reinforcement learning involving only one |
| 4 | +agent. However, researchers are becoming increasingly interested in |
| 5 | +multiagent reinforcement learning. Consider the framework of |
| 6 | +single-agent reinforcement learning shown in Figure |
| 7 | +:numref:`ch011/ch11-rl`. This framework considers the impact of |
| 8 | +only a single agent's action on the environment, and the reward feedback |
| 9 | +from the environment applies only to this agent. If we extend the |
| 10 | +single-agent mode to multiple agents, we have at least two multiagent |
| 11 | +reinforcement learning frameworks, as shown in Figure |
| 12 | +:numref:`ch011/ch11-marl`. Figure |
| 13 | +:numref:`ch011/ch11-marl`(a) shows a scenario where multiple |
| 14 | +agents perform actions at the same time. The agents are unable to |
| 15 | +observe actions of other agents, and their actions have an overall |
| 16 | +impact on the environment. Each agent receives an individual reward for |
| 17 | +its actions. Figure :numref:`ch011/ch11-marl`(b) shows a scenario where multiple |
| 18 | +agents perform actions in sequence. Each agent can observe the actions |
| 19 | +of its previous agents. Their actions have an overall impact on the |
| 20 | +environment. Each agent receives an individual or team reward. Aside |
| 21 | +from these two frameworks, other frameworks may involve a more complex |
| 22 | +mechanism of observations, communications, cooperation, and competition |
| 23 | +among agents. The simplest situation is to assume that the agent |
| 24 | +observations are the environment states. However, this is the least |
| 25 | +possible in the real world. In practice, agents usually have different |
| 26 | +observations on the environment. |
| 27 | + |
| 28 | + |
| 29 | +:label:`ch011/ch11-marl` |
| 30 | + |
| 31 | +## Multi-agent RL |
| 32 | + |
| 33 | +Based on the Markov decision process used in single-agent reinforcement |
| 34 | +learning, we can define that used in multiagent reinforcement learning |
| 35 | +as a tuple |
| 36 | +$(\mathcal{S}, N, \boldsymbol{\mathcal{A}}, \mathbf{R}, \mathcal{T}, \gamma)$. |
| 37 | +In the tuple, $N$ indicates the number of agents, and $\mathcal{S}$ and |
| 38 | +$\boldsymbol{\mathcal{A}}=(\mathcal{A}_1, \mathcal{A}_2, ..., \mathcal{A}_N)$ |
| 39 | +are the environment state space and the multiagent action space, |
| 40 | +respectively, where $A_i$ is the action space of the $i$th agent. |
| 41 | +$\mathbf{R}=(R_1, R_2, ..., R_N)$ is the multiagent reward function. |
| 42 | +$\mathbf{R}(s,\mathbf{a})$: |
| 43 | +$\mathcal{S}\times \boldsymbol{\mathcal{A}}\rightarrow \mathbb{R}^N$ |
| 44 | +denotes the reward vector with respect to the state $s\in\mathcal{S}$ |
| 45 | +and multiagent action $\mathbf{a}\in\boldsymbol{\mathcal{A}}$, and $R_i$ |
| 46 | +is the reward for the $i$th agent. The probability of transitioning from |
| 47 | +the current state and action to the next state is defined as |
| 48 | +$\mathcal{T}(s^\prime|s,\mathbf{a})$: |
| 49 | +$\mathcal{S}\times\boldsymbol{\mathcal{A}}\times\mathcal{S}\rightarrow \mathbb{R}_+$. |
| 50 | +$\gamma\in (0,1)$ is the reward discount factor[^1]. In addition to |
| 51 | +maximizing the expected cumulative reward |
| 52 | +$\mathbb{E}[\sum_t \gamma^t r^i_t], i\in[N]$ for each agent, multiagent |
| 53 | +reinforcement learning involves other objectives such as reaching Nash |
| 54 | +equilibrium or maximizing the team reward --- these do not form part of |
| 55 | +single-agent reinforcement learning. |
| 56 | + |
| 57 | +We can therefore conclude that multiagent reinforcement learning is more |
| 58 | +complex than single-agent reinforcement learning, and that its |
| 59 | +complexity is not simply the accumulation of each agent's decision |
| 60 | +complexity. Closely related to a classical research topic named Game |
| 61 | +theory, the research of multiagent systems has a long history --- even |
| 62 | +before reinforcement learning became popular. There was significant |
| 63 | +research into such systems and many open theoretical problems existed. A |
| 64 | +typical one is that Nash equilibrium is unsolvable in a two-player |
| 65 | +non-zero-sum game[^2]. We will not delve too deeply into such problems |
| 66 | +due to limited space. Instead, we will provide a simple example to |
| 67 | +explain why a multiagent learning problem cannot be directly solved |
| 68 | +using a single-agent reinforcement learning algorithm. |
| 69 | + |
| 70 | +## Game Example |
| 71 | + |
| 72 | +Consider the rock-paper-scissors game. In this game, the win-lose |
| 73 | +relationship is scissors \< rock \< paper \< scissors\... `<` means that |
| 74 | +the latter pure strategy wins over the previous one, and a reward of --1 |
| 75 | +or +1 is given to the two players, respectively. If both players choose |
| 76 | +the same pure strategy, they are rewarded 0. The payoff table of the |
| 77 | +game is provided in Table :numref:`ch11-marl`. The horizontal and vertical headings |
| 78 | +indicate the strategies of Player 1 and Player 2, respectively. The |
| 79 | +arrays in the table are the players' rewards for their actions. |
| 80 | + |
| 81 | +:Payoff table of the rock-paper-scissors game |
| 82 | + |
| 83 | +| Reward | Scissors | Rock | Paper | |
| 84 | +|----------| -----------| -----------| ----------- | |
| 85 | +| Scissors | (0,0) | (--1, +1) | (+1, --1) | |
| 86 | +| Rock | (+1, --1) | (0,0) | (--1, +1) | |
| 87 | +| Paper | (--1, +1) | (+1, --1) | (0,0) | |
| 88 | +:label:`ch11-marl` |
| 89 | + |
| 90 | +Due to the antisymmetric nature of this matrix, the Nash equilibrium |
| 91 | +strategy is the same for both players, with a strategy distribution of |
| 92 | +$(\frac{1}{3}, \frac{1}{3}, \frac{1}{3})$. This means that both players |
| 93 | +have a $\frac{1}{3}$ probability of choosing paper, rock, or scissors. |
| 94 | +If we treat the Nash equilibrium strategy as the objective of multiagent |
| 95 | +reinforcement learning, we can conclude that this strategy cannot be |
| 96 | +obtained simply through single-agent reinforcement learning. Assume that |
| 97 | +we randomly initialize two players for any two pure strategies. For |
| 98 | +example, Player 1 chooses scissors, and Player 2 chooses rock. Also |
| 99 | +assume that the strategy of Player 2 is fixed. As such, the strategy of |
| 100 | +Player 2 can be considered as a part of the environment. This allows us |
| 101 | +to use single-agent reinforcement learning to improve the strategy of |
| 102 | +Player 1 in order to maximize its reward. In this case, Player 1 |
| 103 | +converges to the pure strategy of paper. If we then fix this strategy |
| 104 | +for Player 1 to train Player 2, Player 2 converges to the pure strategy |
| 105 | +of scissors. In this way, Player 1 and Player 2 enter a cycle of three |
| 106 | +strategies, but neither of them can obtain the correct Nash equilibrium |
| 107 | +strategy. |
| 108 | + |
| 109 | +## Self-play |
| 110 | + |
| 111 | +The learning method used in the preceding example is called *self-play*, |
| 112 | +as shown in Figure :numref:`ch011/ch11-marl-sp`. It is one of the most basic among |
| 113 | +the multiagent reinforcement learning methods. In self-play, given the |
| 114 | +fixed strategy of Player 1, the strategy of Player 2 is optimized by |
| 115 | +maximizing its own reward using single-agent learning methods. The |
| 116 | +strategy, referred to as best response strategy, is then fixed for |
| 117 | +Player 2 to optimize the strategy of Player 1. In this manner, the cycle |
| 118 | +repeats indefinitely. In some cases, however, self-play may fail to |
| 119 | +converge to the objective we expect. Due to the possible existence of |
| 120 | +such a cycle, we need training methods that are more complex and methods |
| 121 | +that are designed for multiagent learning to achieve our objective. |
| 122 | + |
| 123 | + |
| 124 | +:label:`ch011/ch11-marl-sp` |
| 125 | + |
| 126 | +Generally, multiagent reinforcement learning is more complex than |
| 127 | +single-agent reinforcement learning. In self-play, a single-agent |
| 128 | +reinforcement learning process may be considered as a subtask of |
| 129 | +multi-agent reinforcement learning. In the game discussed above, when |
| 130 | +the strategy of Player 1 is fixed, Player 1 plus the game environment |
| 131 | +constitute the learning environment of Player 2, which can maximize its |
| 132 | +reward using single-agent reinforcement learning. Likewise, when the |
| 133 | +strategy of Player 2 is fixed, Player 1 can perform single-agent |
| 134 | +reinforcement learning. The cycle repeats indefinitely. This is why |
| 135 | +single-agent reinforcement learning can be considered as subtasks of |
| 136 | +multiagent reinforcement learning. Another learning method is |
| 137 | +*fictitious self-play*, as shown in Figure |
| 138 | +:numref:`ch011/ch11-marl-fsp`, whereby an agent needs to choose |
| 139 | +an optimal strategy based on its opponent's historical average |
| 140 | +strategies, and vice versa. In this manner, players can converge to Nash |
| 141 | +equilibrium strategy in games like rock-paper-scissors. |
| 142 | + |
| 143 | + |
| 144 | +:label:`ch011/ch11-marl-fsp` |
| 145 | + |
| 146 | +[^1]: Assume that the agents use the same reward discount factor. |
| 147 | + |
| 148 | +[^2]: This is regarded as a Polynomial Parity Argument, Directed (PPAD) |
| 149 | + version problem. For details, see Settling the Complexity of |
| 150 | + Computing Two-Player Nash Equilibria. Xi Chen, et al. |
0 commit comments