Skip to content

Commit d463d96

Browse files
committed
Upload sections
1 parent 9376283 commit d463d96

File tree

4 files changed

+599
-0
lines changed

4 files changed

+599
-0
lines changed
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# Distributed Reinforcement Learning System
2+
3+
The distributed reinforcement learning system is more powerful than the
4+
single-node reinforcement system we discussed earlier. It features
5+
parallel processing capability of multiple models in multiple
6+
environments, meaning it can update multiple models on multiple computer
7+
systems at the same time. As such, it significantly accelerates the
8+
learning process and improves the overall performance of the
9+
reinforcement learning system. This section focuses on common algorithms
10+
and systems in distributed reinforcement learning.
11+
12+
## Distributed RL Algorithm--A3C
13+
14+
Asynchronous Advantage Actor-Critic (A3C) was proposed by DeepMind
15+
researchers in 2016. This algorithm can update networks on multiple
16+
computing devices in parallel. Unlike the single-node reinforcement
17+
learning system, A3C creates a group of workers, allocates the workers
18+
to different computing devices, and creates an interactive environment
19+
for each worker to implement parallel sampling and model update. In
20+
addition, it uses a master node to update actor networks (policy
21+
networks) and critic networks (value networks). These two types of
22+
networks correspond to the policy and value functions in reinforcement
23+
learning, respectively. Such a design allows each worker to send the
24+
gradients computed based on the collected samples to the master node in
25+
real time in order to update the parameters on the master node. The
26+
parameters are then transferred in real time to each worker for model
27+
synchronization. Each worker can perform the computing on a GPU. In this
28+
way, the entire algorithm updates the model in parallel on a GPU
29+
cluster. Figure :numref:`ch011/ch11-a3c` depicts the algorithm structure.
30+
Research shows that in addition to accelerating model learning,
31+
distributed reinforcement learning helps stabilize learning performance.
32+
This is because the gradients in distributed reinforcement learning are
33+
computed based on environment sampled from multiple nodes.
34+
35+
![A3C distributed algorithmarchitecture](../img/ch11/ch11-a3c.pdf)
36+
:label:`ch011/ch11-a3c`
37+
38+
## Distributed RL Algorithm--IMPALA
39+
40+
Importance Weighted Actor-Learner Architecture (IMPALA) is a
41+
reinforcement learning framework proposed by Lasse Espeholt et al. in
42+
2018 to implement clustered multi-machine training. Figure
43+
:numref:`ch011/ch11-impala` depicts this architecture. Like A3C,
44+
IMPALA enables gradient computation on multiple GPUs in parallel. In
45+
IMPALA, multiple actors and learners are paralleled. Each actor has a
46+
policy network to collect samples by interacting with another
47+
environment. The collected sample trajectories are sent by actors to
48+
their respective learners for gradient computation. Among the learners,
49+
there is a master learner. It can communicate with other learners to
50+
obtain their computed gradients for the update of its model. After the
51+
model is updated, the model is delivered to other learners and actors
52+
for a new round of sampling and gradient computation. As a distributed
53+
computing architecture, IMPALA is proved to be more efficient than A3C.
54+
It benefits from a specially designed gradient computation function in
55+
learners and from V-trace target in addition to stabilizing training
56+
based on importance weights. Because the V-trace technique is not
57+
related to our area of focus here, we will not elaborate on it.
58+
Interested readers can learn more from the original paper.
59+
60+
![IMPALA distributed algorithmarchitecture](../img/ch11/ch11-impala.pdf)
61+
:label:`ch011/ch11-impala`
62+
63+
## Other Algorithms
64+
65+
Apart from A3C and IMPALA, researchers have proposed other algorithms in
66+
recent studies, for example, SEED  and Ape-X . These algorithms are more
67+
effective in distributed reinforcement learning. Readers can find out
68+
more about these algorithms from the corresponding papers. Next, we move
69+
on to some typical distributed reinforcement learning algorithm
70+
libraries.
71+
72+
## Distributed RL System -- RLlib
73+
74+
RLlib  --- based on Ray , which is a distributed computing framework
75+
initiated by several researchers from UC Berkeley --- is built for
76+
reinforcement learning. It is an open-source reinforcement learning
77+
framework oriented to industrial applications. RLlib contains a
78+
reinforcement learning algorithm library and is convenient for users who
79+
are not that experienced in reinforcement learning.
80+
81+
Figure :numref:`ch011/ch11-rllib-arch` shows the architecture of RLlib.
82+
Its bottom layer is built on Ray's basic components for distributed
83+
computing and communications. Oriented to reinforcement learning, basic
84+
components such as Trainer, Environment, and Policy are abstracted at
85+
the Python layer. There are built-in implementations for the abstracted
86+
components, and users can extend the components based on their algorithm
87+
requirements. With these built-in and customized algorithm components,
88+
researchers can quickly implement specific reinforcement learning
89+
algorithms.
90+
91+
![RLlibarchitecture](../img/ch11/ch11-rllib-arch.png)
92+
:label:`ch011/ch11-rllib-arch`
93+
94+
RLlib supports distributed reinforcement learning training of different
95+
paradigms. Figure
96+
:numref:`ch011/ch11-rllib-distributed` shows the distributed
97+
training architecture of the reinforcement learning algorithm based on
98+
synchronous sampling. Each rollout worker is an independent process and
99+
interacts with the corresponding environment to collect experience.
100+
Multiple rollout workers can interact with the environment in parallel.
101+
Trainers are responsible for coordinating rollout workers, policy
102+
optimization, and synchronization of updated policies to rollout
103+
workers.
104+
105+
![RLlib distributedtraining](../img/ch11/ch11-rllib-distributed.png)
106+
:label:`ch011/ch11-rllib-distributed`
107+
108+
Reinforcement learning is usually based on deep neural networks. For
109+
distributed learning based on such networks, we can combine RLlib with a
110+
deep learning framework such as PyTorch and TensorFlow. Adopting such an
111+
approach means that the deep learning framework takes responsibility for
112+
training and updating the policy network, with RLlib taking over the
113+
computation of the reinforcement learning algorithm. RLlib also supports
114+
interaction with paralleled vectorized environments and pluggable
115+
simulators, as well as offline reinforcement learning.
116+
117+
## Distributed RL System--Reverb and Acme
118+
119+
For management of experience replay buffer, Reverb  is an inevitable
120+
topic. At the beginning of this chapter, we introduced concepts such as
121+
state, action, and reward in reinforcement learning. The data used for
122+
training in real-world applications comes from the data samples stored
123+
in the experience buffer, and the operations performed on the data may
124+
vary depending on the data formats. Common data operations include
125+
concatenation, truncation, product, transposition, partial product, and
126+
mean or extreme value. These operations may be performed on different
127+
dimensions of the data, posing a challenge for existing reinforcement
128+
learning frameworks. In order to flexibly use data of different formats
129+
in reinforcement training, Reverb introduces the concept of *chunk*. All
130+
data used for training is stored as chunks in the buffer for management
131+
and scheduling. This design takes advantage of data being
132+
multidimensional tensors and makes data usage faster and more flexible.
133+
DeepMind recently proposed a distributed reinforcement learning
134+
framework called Acme , which is also designed for academia research and
135+
industrial applications. It provides a faster distributed reinforcement
136+
learning solution based on a distributed sampling structure and Reverb's
137+
sample buffer management. Reverb solves the efficiency problem of data
138+
management and transfer, allowing Acme to fully leverage the efficiency
139+
made possible in distributed computing. Researchers have used Acme to
140+
achieve significant speed gains in many reinforcement learning benchmark
141+
tests.
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# Introduction to Reinforcement Learning
2+
3+
## Background
4+
5+
As a branch of machine learning, reinforcement learning has attracted
6+
more and more attention in recent years. DeepMind proposed deep
7+
Q-learning in 2013, enabling AI to learn how to play video games based
8+
on images. Since then, DeepMind-led scientific institutions have made
9+
remarkable achievements in reinforcement learning --- a representative
10+
example is AlphaGo, which defeated the world's top Go player Lee Sedol
11+
in 2016. Other significant achievements include AlphaStar (agent of
12+
StarCraft), OpenAI Five (agent of Dota 2), Pluribus (Texas hold'em
13+
poker, which is a multi-player zero-sum game), and robot dog motion
14+
control algorithms. These achievements have been made possible by the
15+
rapid iterations and progress of algorithms in the reinforcement
16+
learning field over the past few years. The data-hungry deep neural
17+
networks can demonstrate a good fitting effect based on the large
18+
amounts of data generated by simulators, thereby fully leveraging the
19+
capabilities of reinforcement learning algorithms and performing
20+
comparably or even better than human experts in terms of learning.
21+
Although originally utilized in the video gaming field, reinforcement
22+
learning has since been gradually applied in a wider range of realistic
23+
and meaningful fields, including robot control, dexterous manipulation,
24+
energy system scheduling, network load distribution, and automatic
25+
trading for stocks or futures. Such applications have impacted
26+
traditional control methods and heuristic decision-making theory.
27+
28+
## Reinforcement Learning Components
29+
30+
The core of reinforcement learning is the process of continuously
31+
interacting with the environment to optimize the policy with the
32+
intention of improving the reward. Such a process is manifested as the
33+
selection of an *action* based on a specific *state*. The object that
34+
makes the decision is called an *agent*, and the impact of the decision
35+
is reflected in the *environment*. More specifically, the *state
36+
transition* and *reward* in the environment vary depending on the
37+
decision. State transition, which can be either deterministic or
38+
stochastic, is a function that specifies the environment's transition
39+
from the current state to the next state. A reward, which is generally a
40+
scalar, is the feedback of the environment on the agent's action.
41+
Figure :numref:`ch011/ch11-rl` shows the abstract process, which is the
42+
most common model description of reinforcement learning in the
43+
literature.
44+
45+
![Framework of reinforcementlearning](../img/ch11/ch11-rl.pdf)
46+
:label:`ch011/ch11-rl`
47+
48+
Take video gaming as an example. A gamer needs to gradually become
49+
familiar with the game operations in order to achieve better results.
50+
The process from getting started with the game to gradually mastering
51+
game skills is similar to the reinforcement learning process. At any
52+
given moment after the game starts, it is in a specific state. By
53+
viewing the state, the gamer can obtain an *observation* (e.g., images
54+
on the screen of the game console), based on which the gamer performs an
55+
action (e.g., firing bullets) that changes the game state and enables
56+
the game to enter the next state (e.g., defeating monsters).
57+
Furthermore, the gamer can know the effect of the current action (e.g.,
58+
defeating a monster generates a positive score, whereas losing to a
59+
monster generates a negative score). The gamer then selects a new action
60+
based on the observation of the next state, and repeats this process
61+
until the game ends. Through these repetitive operations and
62+
observations, the gamer can gradually master the skills of the game. A
63+
reinforcement learning agent learns to play the game in a similar way.
64+
65+
However, there are several key issues to be noticed in this process. (1)
66+
The observation may not be equal to the state. Instead, it is generally
67+
a function of the state, and the mapping from the state to the
68+
observation may cause information loss. The environment is *fully
69+
observable* if the observation is equal to the state or if the state of
70+
the environment can be completely restored based on the observation; in
71+
all other cases, it is *partially observable*. (2) Each action performed
72+
by a gamer may not produce immediate feedback but may produce delayed
73+
effects after many steps. Reinforcement learning models allow such a
74+
delayed feedback. (3) The feedback may not be a scalar in the human
75+
learning process. To convert the feedback received by the reinforcement
76+
learning agent into a scalar, called the reward value, we can perform
77+
mathematical abstraction on it. The reward value can be a function of
78+
the state, or a function of the state and action. The existence of the
79+
reward value is a basic assumption for reinforcement learning, and is
80+
also a major difference between reinforcement learning and supervised
81+
learning.
82+
83+
## Markov Decision Process
84+
85+
In reinforcement learning, the decision-making process is generally
86+
described by a Markov decision process[^1], and can be represented by a
87+
tuple $(\mathcal{S}, \mathcal{A}, R, \mathcal{T}, \gamma)$.
88+
$\mathcal{S}$ and $\mathcal{A}$ indicate the state space and action
89+
space, respectively. $R$ indicates the reward function. $R(s,a)$:
90+
$\mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R}$ indicates the
91+
reward value regarding the current state $s\in\mathcal{S}$ and the
92+
current action $a\in\mathcal{A}$. The probability of transitioning from
93+
the current state and action to the next state is defined as
94+
$\mathcal{T}(s^\prime|s,a)$:
95+
$\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow \mathbb{R}_+$.
96+
$\gamma\in(0,1)$ indicates the discount factor[^2] for the reward.
97+
Reinforcement learning aims to maximize the expected cumulative reward
98+
value ($\mathbb{E}[\sum_t \gamma^t r_t]$) received by the agent.
99+
100+
The Markov property in a Markov decision process is defined as follows:
101+
102+
$$\begin{aligned}
103+
\mathcal{T}(s_{t+1}|s_t) = \mathcal{T}(s_{t+1}|s_0, s_1, s_2, \dots, s_t)
104+
\end{aligned}$$
105+
106+
That is, the transition to the current state depends on
107+
the previous state only (it does not depend on historical states). We
108+
can omit action $a$ in the state transition function $\mathcal{T}$
109+
because the Markov property is part of the environment transition
110+
process and is independent of the decision process.
111+
112+
Based on the Markov property, we can further deduce that the optimal
113+
policy at any given moment depends only on the decision on the latest
114+
state --- it does not depend on the entire decision history. This
115+
conclusion is of great significance in the design of reinforcement
116+
learning algorithms because it simplifies the process of solving the
117+
optimal policy.
118+
119+
[^1]: A Markov decision process is a function in which a subsequent
120+
state depends only on the current state and action (it does not
121+
depend on historical states).
122+
123+
[^2]: Each subsequent reward value can be multiplied by the discount
124+
factor so that an infinite sequence has a limited sum of reward
125+
values.

0 commit comments

Comments
 (0)