openmlsys
diff --git a/‎chapter_reinforcement_learning/Distributed_Reinforcement_Learning_System.md‎
Lines changed: 141 additions & 0 deletions b/‎chapter_reinforcement_learning/Distributed_Reinforcement_Learning_System.md‎
Lines changed: 141 additions & 0 deletions
diff --git a/‎chapter_reinforcement_learning/Introduction_to_Reinforcement_Learning.md‎
Lines changed: 125 additions & 0 deletions b/‎chapter_reinforcement_learning/Introduction_to_Reinforcement_Learning.md‎
Lines changed: 125 additions & 0 deletions
@@ -0,0 +1,141 @@
+# Distributed Reinforcement Learning System
+
+The distributed reinforcement learning system is more powerful than the
+single-node reinforcement system we discussed earlier. It features
+parallel processing capability of multiple models in multiple
+environments, meaning it can update multiple models on multiple computer
+systems at the same time. As such, it significantly accelerates the
+learning process and improves the overall performance of the
+reinforcement learning system. This section focuses on common algorithms
+and systems in distributed reinforcement learning.
+
+## Distributed RL Algorithm--A3C
+
+Asynchronous Advantage Actor-Critic (A3C) was proposed by DeepMind
+researchers in 2016. This algorithm can update networks on multiple
+computing devices in parallel. Unlike the single-node reinforcement
+learning system, A3C creates a group of workers, allocates the workers
+to different computing devices, and creates an interactive environment
+for each worker to implement parallel sampling and model update. In
+addition, it uses a master node to update actor networks (policy
+networks) and critic networks (value networks). These two types of
+networks correspond to the policy and value functions in reinforcement
+learning, respectively. Such a design allows each worker to send the
+gradients computed based on the collected samples to the master node in
+real time in order to update the parameters on the master node. The
+parameters are then transferred in real time to each worker for model
+synchronization. Each worker can perform the computing on a GPU. In this
+way, the entire algorithm updates the model in parallel on a GPU
+cluster. Figure :numref:`ch011/ch11-a3c` depicts the algorithm structure.
+Research shows that in addition to accelerating model learning,
+distributed reinforcement learning helps stabilize learning performance.
+This is because the gradients in distributed reinforcement learning are
+computed based on environment sampled from multiple nodes.
+
+![A3C distributed algorithmarchitecture](../img/ch11/ch11-a3c.pdf)
+:label:`ch011/ch11-a3c`
+
+## Distributed RL Algorithm--IMPALA
+
+Importance Weighted Actor-Learner Architecture (IMPALA) is a
+reinforcement learning framework proposed by Lasse Espeholt et al. in
+2018 to implement clustered multi-machine training. Figure
+:numref:`ch011/ch11-impala` depicts this architecture. Like A3C,
+IMPALA enables gradient computation on multiple GPUs in parallel. In
+IMPALA, multiple actors and learners are paralleled. Each actor has a
+policy network to collect samples by interacting with another
+environment. The collected sample trajectories are sent by actors to
+their respective learners for gradient computation. Among the learners,
+there is a master learner. It can communicate with other learners to
+obtain their computed gradients for the update of its model. After the
+model is updated, the model is delivered to other learners and actors
+for a new round of sampling and gradient computation. As a distributed
+computing architecture, IMPALA is proved to be more efficient than A3C.
+It benefits from a specially designed gradient computation function in
+learners and from V-trace target in addition to stabilizing training
+based on importance weights. Because the V-trace technique is not
+related to our area of focus here, we will not elaborate on it.
+Interested readers can learn more from the original paper.
+
+![IMPALA distributed algorithmarchitecture](../img/ch11/ch11-impala.pdf)
+:label:`ch011/ch11-impala`
+
+## Other Algorithms
+
+Apart from A3C and IMPALA, researchers have proposed other algorithms in
+recent studies, for example, SEED  and Ape-X . These algorithms are more
+effective in distributed reinforcement learning. Readers can find out
+more about these algorithms from the corresponding papers. Next, we move
+on to some typical distributed reinforcement learning algorithm
+libraries.
+
+## Distributed RL System -- RLlib
+
+RLlib  --- based on Ray , which is a distributed computing framework
+initiated by several researchers from UC Berkeley --- is built for
+reinforcement learning. It is an open-source reinforcement learning
+framework oriented to industrial applications. RLlib contains a
+reinforcement learning algorithm library and is convenient for users who
+are not that experienced in reinforcement learning.
+
+Figure :numref:`ch011/ch11-rllib-arch` shows the architecture of RLlib.
+Its bottom layer is built on Ray's basic components for distributed
+computing and communications. Oriented to reinforcement learning, basic
+components such as Trainer, Environment, and Policy are abstracted at
+the Python layer. There are built-in implementations for the abstracted
+components, and users can extend the components based on their algorithm
+requirements. With these built-in and customized algorithm components,
+researchers can quickly implement specific reinforcement learning
+algorithms.
+
+![RLlibarchitecture](../img/ch11/ch11-rllib-arch.png)
+:label:`ch011/ch11-rllib-arch`
+
+RLlib supports distributed reinforcement learning training of different
+paradigms. Figure
+:numref:`ch011/ch11-rllib-distributed` shows the distributed
+training architecture of the reinforcement learning algorithm based on
+synchronous sampling. Each rollout worker is an independent process and
+interacts with the corresponding environment to collect experience.
+Multiple rollout workers can interact with the environment in parallel.
+Trainers are responsible for coordinating rollout workers, policy
+optimization, and synchronization of updated policies to rollout
+workers.
+
+![RLlib distributedtraining](../img/ch11/ch11-rllib-distributed.png)
+:label:`ch011/ch11-rllib-distributed`
+
+Reinforcement learning is usually based on deep neural networks. For
+distributed learning based on such networks, we can combine RLlib with a
+deep learning framework such as PyTorch and TensorFlow. Adopting such an
+approach means that the deep learning framework takes responsibility for
+training and updating the policy network, with RLlib taking over the
+computation of the reinforcement learning algorithm. RLlib also supports
+interaction with paralleled vectorized environments and pluggable
+simulators, as well as offline reinforcement learning.
+
+## Distributed RL System--Reverb and Acme
+
+For management of experience replay buffer, Reverb  is an inevitable
+topic. At the beginning of this chapter, we introduced concepts such as
+state, action, and reward in reinforcement learning. The data used for
+training in real-world applications comes from the data samples stored
+in the experience buffer, and the operations performed on the data may
+vary depending on the data formats. Common data operations include
+concatenation, truncation, product, transposition, partial product, and
+mean or extreme value. These operations may be performed on different
+dimensions of the data, posing a challenge for existing reinforcement
+learning frameworks. In order to flexibly use data of different formats
+in reinforcement training, Reverb introduces the concept of *chunk*. All
+data used for training is stored as chunks in the buffer for management
+and scheduling. This design takes advantage of data being
+multidimensional tensors and makes data usage faster and more flexible.
+DeepMind recently proposed a distributed reinforcement learning
+framework called Acme , which is also designed for academia research and
+industrial applications. It provides a faster distributed reinforcement
+learning solution based on a distributed sampling structure and Reverb's
+sample buffer management. Reverb solves the efficiency problem of data
+management and transfer, allowing Acme to fully leverage the efficiency
+made possible in distributed computing. Researchers have used Acme to
+achieve significant speed gains in many reinforcement learning benchmark
+tests.
@@ -0,0 +1,125 @@
+# Introduction to Reinforcement Learning
+
+## Background
+
+As a branch of machine learning, reinforcement learning has attracted
+more and more attention in recent years. DeepMind proposed deep
+Q-learning in 2013, enabling AI to learn how to play video games based
+on images. Since then, DeepMind-led scientific institutions have made
+remarkable achievements in reinforcement learning --- a representative
+example is AlphaGo, which defeated the world's top Go player Lee Sedol
+in 2016. Other significant achievements include AlphaStar (agent of
+StarCraft), OpenAI Five (agent of Dota 2), Pluribus (Texas hold'em
+poker, which is a multi-player zero-sum game), and robot dog motion
+control algorithms. These achievements have been made possible by the
+rapid iterations and progress of algorithms in the reinforcement
+learning field over the past few years. The data-hungry deep neural
+networks can demonstrate a good fitting effect based on the large
+amounts of data generated by simulators, thereby fully leveraging the
+capabilities of reinforcement learning algorithms and performing
+comparably or even better than human experts in terms of learning.
+Although originally utilized in the video gaming field, reinforcement
+learning has since been gradually applied in a wider range of realistic
+and meaningful fields, including robot control, dexterous manipulation,
+energy system scheduling, network load distribution, and automatic
+trading for stocks or futures. Such applications have impacted
+traditional control methods and heuristic decision-making theory.
+
+## Reinforcement Learning Components
+
+The core of reinforcement learning is the process of continuously
+interacting with the environment to optimize the policy with the
+intention of improving the reward. Such a process is manifested as the
+selection of an *action* based on a specific *state*. The object that
+makes the decision is called an *agent*, and the impact of the decision
+is reflected in the *environment*. More specifically, the *state
+transition* and *reward* in the environment vary depending on the
+decision. State transition, which can be either deterministic or
+stochastic, is a function that specifies the environment's transition
+from the current state to the next state. A reward, which is generally a
+scalar, is the feedback of the environment on the agent's action.
+Figure :numref:`ch011/ch11-rl` shows the abstract process, which is the
+most common model description of reinforcement learning in the
+literature.
+
+![Framework of reinforcementlearning](../img/ch11/ch11-rl.pdf)
+:label:`ch011/ch11-rl`
+
+Take video gaming as an example. A gamer needs to gradually become
+familiar with the game operations in order to achieve better results.
+The process from getting started with the game to gradually mastering
+game skills is similar to the reinforcement learning process. At any
+given moment after the game starts, it is in a specific state. By
+viewing the state, the gamer can obtain an *observation* (e.g., images
+on the screen of the game console), based on which the gamer performs an
+action (e.g., firing bullets) that changes the game state and enables
+the game to enter the next state (e.g., defeating monsters).
+Furthermore, the gamer can know the effect of the current action (e.g.,
+defeating a monster generates a positive score, whereas losing to a
+monster generates a negative score). The gamer then selects a new action
+based on the observation of the next state, and repeats this process
+until the game ends. Through these repetitive operations and
+observations, the gamer can gradually master the skills of the game. A
+reinforcement learning agent learns to play the game in a similar way.
+
+However, there are several key issues to be noticed in this process. (1)
+The observation may not be equal to the state. Instead, it is generally
+a function of the state, and the mapping from the state to the
+observation may cause information loss. The environment is *fully
+observable* if the observation is equal to the state or if the state of
+the environment can be completely restored based on the observation; in
+all other cases, it is *partially observable*. (2) Each action performed
+by a gamer may not produce immediate feedback but may produce delayed
+effects after many steps. Reinforcement learning models allow such a
+delayed feedback. (3) The feedback may not be a scalar in the human
+learning process. To convert the feedback received by the reinforcement
+learning agent into a scalar, called the reward value, we can perform
+mathematical abstraction on it. The reward value can be a function of
+the state, or a function of the state and action. The existence of the
+reward value is a basic assumption for reinforcement learning, and is
+also a major difference between reinforcement learning and supervised
+learning.
+
+## Markov Decision Process
+
+In reinforcement learning, the decision-making process is generally
+described by a Markov decision process[^1], and can be represented by a
+tuple $(\mathcal{S}, \mathcal{A}, R, \mathcal{T}, \gamma)$.
+$\mathcal{S}$ and $\mathcal{A}$ indicate the state space and action
+space, respectively. $R$ indicates the reward function. $R(s,a)$:
+$\mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R}$ indicates the
+reward value regarding the current state $s\in\mathcal{S}$ and the
+current action $a\in\mathcal{A}$. The probability of transitioning from
+the current state and action to the next state is defined as
+$\mathcal{T}(s^\prime|s,a)$:
+$\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow \mathbb{R}_+$.
+$\gamma\in(0,1)$ indicates the discount factor[^2] for the reward.
+Reinforcement learning aims to maximize the expected cumulative reward
+value ($\mathbb{E}[\sum_t \gamma^t r_t]$) received by the agent.
+
+The Markov property in a Markov decision process is defined as follows:
+
+$$\begin{aligned}
+    \mathcal{T}(s_{t+1}|s_t) = \mathcal{T}(s_{t+1}|s_0, s_1, s_2, \dots, s_t)
+\end{aligned}$$ 
+
+That is, the transition to the current state depends on
+the previous state only (it does not depend on historical states). We
+can omit action $a$ in the state transition function $\mathcal{T}$
+because the Markov property is part of the environment transition
+process and is independent of the decision process.
+
+Based on the Markov property, we can further deduce that the optimal
+policy at any given moment depends only on the decision on the latest
+state --- it does not depend on the entire decision history. This
+conclusion is of great significance in the design of reinforcement
+learning algorithms because it simplifies the process of solving the
+optimal policy.
+
+[^1]: A Markov decision process is a function in which a subsequent
+    state depends only on the current state and action (it does not
+    depend on historical states).
+
+[^2]: Each subsequent reward value can be multiplied by the discount
+    factor so that an infinite sequence has a limited sum of reward
+    values.