Skip to content

Commit ec32e8e

Browse files
committed
Upload sections
1 parent d463d96 commit ec32e8e

File tree

3 files changed

+614
-0
lines changed

3 files changed

+614
-0
lines changed
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
# Multi-agent Reinforcement Learning
2+
3+
Previous sections discussed reinforcement learning involving only one
4+
agent. However, researchers are becoming increasingly interested in
5+
multiagent reinforcement learning. Consider the framework of
6+
single-agent reinforcement learning shown in Figure
7+
:numref:`ch011/ch11-rl`. This framework considers the impact of
8+
only a single agent's action on the environment, and the reward feedback
9+
from the environment applies only to this agent. If we extend the
10+
single-agent mode to multiple agents, we have at least two multiagent
11+
reinforcement learning frameworks, as shown in Figure
12+
:numref:`ch011/ch11-marl`. Figure
13+
:numref:`ch011/ch11-marl`(a) shows a scenario where multiple
14+
agents perform actions at the same time. The agents are unable to
15+
observe actions of other agents, and their actions have an overall
16+
impact on the environment. Each agent receives an individual reward for
17+
its actions. Figure :numref:`ch011/ch11-marl`(b) shows a scenario where multiple
18+
agents perform actions in sequence. Each agent can observe the actions
19+
of its previous agents. Their actions have an overall impact on the
20+
environment. Each agent receives an individual or team reward. Aside
21+
from these two frameworks, other frameworks may involve a more complex
22+
mechanism of observations, communications, cooperation, and competition
23+
among agents. The simplest situation is to assume that the agent
24+
observations are the environment states. However, this is the least
25+
possible in the real world. In practice, agents usually have different
26+
observations on the environment.
27+
28+
![Two possible multiagent reinforcement learning frameworks: (a)Synchronous multiagent decision-making; (b) Asynchronous multiagentdecision-making](../img/ch11/ch11-marl.pdf)
29+
:label:`ch011/ch11-marl`
30+
31+
## Multi-agent RL
32+
33+
Based on the Markov decision process used in single-agent reinforcement
34+
learning, we can define that used in multiagent reinforcement learning
35+
as a tuple
36+
$(\mathcal{S}, N, \boldsymbol{\mathcal{A}}, \mathbf{R}, \mathcal{T}, \gamma)$.
37+
In the tuple, $N$ indicates the number of agents, and $\mathcal{S}$ and
38+
$\boldsymbol{\mathcal{A}}=(\mathcal{A}_1, \mathcal{A}_2, ..., \mathcal{A}_N)$
39+
are the environment state space and the multiagent action space,
40+
respectively, where $A_i$ is the action space of the $i$th agent.
41+
$\mathbf{R}=(R_1, R_2, ..., R_N)$ is the multiagent reward function.
42+
$\mathbf{R}(s,\mathbf{a})$:
43+
$\mathcal{S}\times \boldsymbol{\mathcal{A}}\rightarrow \mathbb{R}^N$
44+
denotes the reward vector with respect to the state $s\in\mathcal{S}$
45+
and multiagent action $\mathbf{a}\in\boldsymbol{\mathcal{A}}$, and $R_i$
46+
is the reward for the $i$th agent. The probability of transitioning from
47+
the current state and action to the next state is defined as
48+
$\mathcal{T}(s^\prime|s,\mathbf{a})$:
49+
$\mathcal{S}\times\boldsymbol{\mathcal{A}}\times\mathcal{S}\rightarrow \mathbb{R}_+$.
50+
$\gamma\in (0,1)$ is the reward discount factor[^1]. In addition to
51+
maximizing the expected cumulative reward
52+
$\mathbb{E}[\sum_t \gamma^t r^i_t], i\in[N]$ for each agent, multiagent
53+
reinforcement learning involves other objectives such as reaching Nash
54+
equilibrium or maximizing the team reward --- these do not form part of
55+
single-agent reinforcement learning.
56+
57+
We can therefore conclude that multiagent reinforcement learning is more
58+
complex than single-agent reinforcement learning, and that its
59+
complexity is not simply the accumulation of each agent's decision
60+
complexity. Closely related to a classical research topic named Game
61+
theory, the research of multiagent systems has a long history --- even
62+
before reinforcement learning became popular. There was significant
63+
research into such systems and many open theoretical problems existed. A
64+
typical one is that Nash equilibrium is unsolvable in a two-player
65+
non-zero-sum game[^2]. We will not delve too deeply into such problems
66+
due to limited space. Instead, we will provide a simple example to
67+
explain why a multiagent learning problem cannot be directly solved
68+
using a single-agent reinforcement learning algorithm.
69+
70+
## Game Example
71+
72+
Consider the rock-paper-scissors game. In this game, the win-lose
73+
relationship is scissors \< rock \< paper \< scissors\... `<` means that
74+
the latter pure strategy wins over the previous one, and a reward of --1
75+
or +1 is given to the two players, respectively. If both players choose
76+
the same pure strategy, they are rewarded 0. The payoff table of the
77+
game is provided in Table :numref:`ch11-marl`. The horizontal and vertical headings
78+
indicate the strategies of Player 1 and Player 2, respectively. The
79+
arrays in the table are the players' rewards for their actions.
80+
81+
:Payoff table of the rock-paper-scissors game
82+
83+
| Reward | Scissors | Rock | Paper |
84+
|----------| -----------| -----------| ----------- |
85+
| Scissors | (0,0) | (--1, +1) | (+1, --1) |
86+
| Rock | (+1, --1) | (0,0) | (--1, +1) |
87+
| Paper | (--1, +1) | (+1, --1) | (0,0) |
88+
:label:`ch11-marl`
89+
90+
Due to the antisymmetric nature of this matrix, the Nash equilibrium
91+
strategy is the same for both players, with a strategy distribution of
92+
$(\frac{1}{3}, \frac{1}{3}, \frac{1}{3})$. This means that both players
93+
have a $\frac{1}{3}$ probability of choosing paper, rock, or scissors.
94+
If we treat the Nash equilibrium strategy as the objective of multiagent
95+
reinforcement learning, we can conclude that this strategy cannot be
96+
obtained simply through single-agent reinforcement learning. Assume that
97+
we randomly initialize two players for any two pure strategies. For
98+
example, Player 1 chooses scissors, and Player 2 chooses rock. Also
99+
assume that the strategy of Player 2 is fixed. As such, the strategy of
100+
Player 2 can be considered as a part of the environment. This allows us
101+
to use single-agent reinforcement learning to improve the strategy of
102+
Player 1 in order to maximize its reward. In this case, Player 1
103+
converges to the pure strategy of paper. If we then fix this strategy
104+
for Player 1 to train Player 2, Player 2 converges to the pure strategy
105+
of scissors. In this way, Player 1 and Player 2 enter a cycle of three
106+
strategies, but neither of them can obtain the correct Nash equilibrium
107+
strategy.
108+
109+
## Self-play
110+
111+
The learning method used in the preceding example is called *self-play*,
112+
as shown in Figure :numref:`ch011/ch11-marl-sp`. It is one of the most basic among
113+
the multiagent reinforcement learning methods. In self-play, given the
114+
fixed strategy of Player 1, the strategy of Player 2 is optimized by
115+
maximizing its own reward using single-agent learning methods. The
116+
strategy, referred to as best response strategy, is then fixed for
117+
Player 2 to optimize the strategy of Player 1. In this manner, the cycle
118+
repeats indefinitely. In some cases, however, self-play may fail to
119+
converge to the objective we expect. Due to the possible existence of
120+
such a cycle, we need training methods that are more complex and methods
121+
that are designed for multiagent learning to achieve our objective.
122+
123+
![Self-playalgorithm](../img/ch11/ch11-marl-sp.png)
124+
:label:`ch011/ch11-marl-sp`
125+
126+
Generally, multiagent reinforcement learning is more complex than
127+
single-agent reinforcement learning. In self-play, a single-agent
128+
reinforcement learning process may be considered as a subtask of
129+
multi-agent reinforcement learning. In the game discussed above, when
130+
the strategy of Player 1 is fixed, Player 1 plus the game environment
131+
constitute the learning environment of Player 2, which can maximize its
132+
reward using single-agent reinforcement learning. Likewise, when the
133+
strategy of Player 2 is fixed, Player 1 can perform single-agent
134+
reinforcement learning. The cycle repeats indefinitely. This is why
135+
single-agent reinforcement learning can be considered as subtasks of
136+
multiagent reinforcement learning. Another learning method is
137+
*fictitious self-play*, as shown in Figure
138+
:numref:`ch011/ch11-marl-fsp`, whereby an agent needs to choose
139+
an optimal strategy based on its opponent's historical average
140+
strategies, and vice versa. In this manner, players can converge to Nash
141+
equilibrium strategy in games like rock-paper-scissors.
142+
143+
![Fictitious self-playalgorithm](../img/ch11/ch11-marl-fsp.pdf)
144+
:label:`ch011/ch11-marl-fsp`
145+
146+
[^1]: Assume that the agents use the same reward discount factor.
147+
148+
[^2]: This is regarded as a Polynomial Parity Argument, Directed (PPAD)
149+
version problem. For details, see Settling the Complexity of
150+
Computing Two-Player Nash Equilibria. Xi Chen, et al.

0 commit comments

Comments
 (0)