Skip to content

Commit b03fa88

Browse files
committed
Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into fix_test_sendrecv_portbind
2 parents b853ac8 + fb8c1cf commit b03fa88

19 files changed

+304
-66
lines changed
Lines changed: 94 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,96 @@
11
# Recurrent Group Tutorial
22

3-
TBD
3+
## Overview
4+
5+
Sequential data is common in natural language processing.
6+
7+
A sentence is a sequence of words and many sentences form a paragraph further. Therefore, a paragraph can be viewed as a nested sequence with two level, where each element of the sequence is another sequence. That is to say, sequential data could be recursive. An example of two-level recursive sequential data is that an article is composed of a sequence of sentences, and each sentence a sequence of words.
8+
9+
PaddlePaddle and PaddlePaddle v2 support two-level recursive sequential data. The two-level sequence is a very flexible data, which helps us to better describe more complex language data such as discribing paragraphs and several rounds of dialogues. Based on two-level sequence input, we can design and build a flexible, hierarchical RNN model that encodes input data from the word and sentence level. For the support of arbitrary levels, please refer to PaddlePaddle Fluid.
10+
11+
In PaddlePaddle, `recurrent_group` is an arbitrarily complex RNN unit. The user only needs to define the calculation that the RNN will complete in one time step. PaddlePaddle is responsible for the propagation of information and error in time series.
12+
13+
Furthermore, `recurrent_group` can also be extended to handle two-level sequence. By defining two nested `recurrent_group` operations at the clause level and the word level respectively, a hierarchical and complex RNN is finally achieved.
14+
15+
Currently, in the PaddlePaddle, there are `recurrent_group` and some Layers that can process bidirectional sequences. For details, refer to the document: <a href = "hierarchical_layer_en.html">Layers for supporting double-layer sequences as input.</a>
16+
17+
## Related Concepts
18+
19+
### Basic Principle
20+
`recurrent_group` is an arbitrarily complex RNN unit supported by PaddlePaddle. The user only needs to focus on the calculations that the RNN is designed to complete within a single time step. The PaddlePaddle is responsible for completing the propagation of information and gradients over time.
21+
22+
In PaddlePaddle, a simple call to `recurrent_group` is as follows:
23+
24+
``` python
25+
recurrent_group(step, input, reverse)
26+
```
27+
- step: A callable function that defines the calculations completed by the RNN unit within a time step
28+
- input: The input must be a single-layer sequence or a double-layer sequence
29+
- reverse: Whether to process the input sequence in reverse order
30+
31+
The core of using `recurrent_group` is to design the logic of the step function. The step function can be freely combined with various layers supported by PaddlePaddle to complete arbitrary arithmetic logic. The input of `recurrent_group` (input) becomes the input of the step function. Since the step function only focuses on the calculation within one time step of RNN, here `recurrent_group` completes the splitting of the original input data for us.
32+
33+
### Input
34+
The input sequence processed by `recurrent_group` is mainly divided into the following three types:
35+
36+
- **Input Data**: When putting a two-level sequence into `recurrent_group`, it will be disassembled into a single-level sequence. When putting a single-level sequence into `recurrent_group`, it will be disassembled into a non-sequence and then passed to the step function. This process is completely transparent to the user. There are two possible types: 1) User input via data_layer; 2) Output from other layers.
37+
38+
- **Read-only Memory Input**: `StaticInput` defines a read-only Memory. The input specified by `StaticInput` will not be disassembled by `recurrent_group`, and each time step of the `recurrent_group` loop will always be able to reference all inputs. It may be a non-sequence or a single-layer sequence.
39+
40+
- **Input of Sequence Generation Task**: `GeneratedInput` is only used to specify input data in a sequence generation task.
41+
42+
### Input Example
43+
44+
Sequence generation tasks mostly follow the encoder-decoer architecture. The encoder and decoder can be arbitrary neural network units capable of processing sequences and RNN is the most popular choice.
45+
46+
Given the encoder output and the current word, the decoder predicts the next most likely word each time. In this structure, the decoder accepts two inputs:
47+
48+
- Target sequence to be generated: a input of the decoder and the basis of the decoder loop. `recurrent_group` will disassemble this input type.
49+
50+
- Encoder output, an non-sequencce or single-sequence: a unbounded memory. Each time step in the decoder loop will reference the entire result and should not be disassembled. This type of input must be specified via `StaticInput`. For more discussion on Unbounded Memory, please refer to the paper [Neural Turning Machine](https://arxiv.org/abs/1410.5401).
51+
52+
In a sequence generation task, the decoder RNN always refers to the word vector of the word predicted at the previous moment as the current time input. `GeneratedInput` will automate this process.
53+
54+
### Output
55+
The `step` function must return the output of one or more Layers. The output of this Layer will be the final output of the entire `recurrent_group`. In the output process, `recurrent_group` will concatenate the output of each time step, which is also transparent to the user.
56+
57+
### Memory
58+
Memory can only be defined and used in `recurrent_group`. Memory cannot exist independently and must point to a layer defined by PaddlePaddle. Memory is referenced to get a momentary output from this layer, so memory can be interpreted as a delay operation.
59+
60+
The user can explicitly specify the output of a layer to initialize the memory. When not specified, memory is initialized to 0 by default.
61+
62+
## Sequence-level RNN Introduction
63+
64+
`recurrent_group` helps us to split the input sequence, merge the output, and loop through the sequence of computational logic.
65+
66+
Using this feature, the two nested `recurrent_group` can handle the nested two-level sequences, implementing sequence-level RNN structures at both the word and sentence levels.
67+
68+
- Word-level RNN: each state corresponds to a word.
69+
- Sequence-level RNN: a sequence-layer RNN consists of multiple word-layer RNNs. Each word-layer RNN (ie, each state of a sequence-layer RNN) has a subsequence.
70+
71+
For convenience of description, the following takes the NLP task as an example. A paragraph containing a subsequence is defined as a two-level sequence, and a sentence containing a word is defined as a single-layer sequence. Then, the zero-level sequence is a word.
72+
73+
## Usage of Sequence-level RNN
74+
75+
### Usage of Training Process
76+
Using `recurrent_group` requires the following conventions:
77+
78+
- **Single-input Single-output**: Both input and output are single layer sequences.
79+
- If there are multiple inputs, the number of words in different input sequences must be exactly equal.
80+
- A single-layer sequence is output, and the number of words in the output sequence is the same as the input sequence.
81+
- memory: define memory to point to a layer in the step function, get a moment output from this layer by referencing memory to form a recurrent connection. The is_seq parameter of memory must be false. If memory is not defined, the operations within each time step are independent.
82+
- boot_layer: the initial state of memory, set 0 by default. is_seq in memory must be false.
83+
84+
- **Double-input Double-output**: Both input and output are two-level sequence.
85+
- If there are multiple input sequences, the number of subsequence contained in different inputs must be strictly equal, but the number of words in the subsequence may not be equal.
86+
- output a two-level sequence. The number of subsequence and the number of words are the same as the specified input sequence and the first input is default.
87+
- memory: defining memory in the step function, pointing to a layer, by referring to the memory to get the output of this layer at a time, forming a recurrent connection. The memory defined in the outer `recurrent_group` step function can record the state of the previous subsequence, either as a single-level sequence (only as read-only memory) or as a word. If memory is not defined, the operations between subsequence are independent.
88+
- boot_layer: the initial state of memory. It is either a single-level sequence (only as read-only memory) or a vector. The default is not set, that is, the initial state is 0.
89+
90+
- **Double-input Single-output**: not support for now, and output the error with "In hierachical RNN, all out links should be from sequences now".
91+
92+
### Usage of Generation Process
93+
Using `beam_search` need follow those conventions:
94+
95+
- Word-level RNN: generate the next word from a word.
96+
- Sequence-level RNN: the single-layer RNN generated subsequence is concatenated into a new double-layer sequence. Semantically, there is no case where a subsequence generates the next subseq directly.

paddle/fluid/framework/details/nccl_all_reduce_op_handle.cc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ void NCCLAllReduceOpHandle::RunImpl() {
7676
}
7777
}
7878

79-
std::string NCCLAllReduceOpHandle::Name() const { return "NCCL AllReduce"; }
79+
std::string NCCLAllReduceOpHandle::Name() const { return "nccl_all_reduce"; }
8080
} // namespace details
8181
} // namespace framework
8282
} // namespace paddle

paddle/fluid/framework/details/nccl_all_reduce_op_handle.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,9 @@
1414

1515
#pragma once
1616

17+
#include <string>
18+
#include <vector>
19+
1720
#include "paddle/fluid/framework/details/op_handle_base.h"
1821
#include "paddle/fluid/framework/lod_tensor.h"
1922
#include "paddle/fluid/framework/scope.h"
@@ -34,6 +37,10 @@ struct NCCLAllReduceOpHandle : public OpHandleBase {
3437

3538
std::string Name() const override;
3639

40+
// Delay and buffer nccl_all_reduce together can significantly increase
41+
// performance. Disable this feature by returning false.
42+
bool IsMultiDeviceTransfer() override { return true; };
43+
3744
protected:
3845
void RunImpl() override;
3946
};

paddle/fluid/framework/details/op_handle_base.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@
1313
// limitations under the License.
1414

1515
#pragma once
16+
#include <string>
17+
#include <vector>
1618

1719
#include "paddle/fluid/framework/details/var_handle.h"
1820
#include "paddle/fluid/platform/device_context.h"
@@ -53,6 +55,10 @@ class OpHandleBase {
5355

5456
void AddOutput(VarHandleBase *out);
5557

58+
// If the Op involves data transfer of multiple devices that
59+
// will likely block other computations.
60+
virtual bool IsMultiDeviceTransfer() { return false; }
61+
5662
protected:
5763
virtual void RunImpl() = 0;
5864
};

paddle/fluid/framework/details/threaded_ssa_graph_executor.cc

Lines changed: 50 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -23,22 +23,36 @@ ThreadedSSAGraphExecutor::ThreadedSSAGraphExecutor(
2323
size_t num_threads, bool use_event,
2424
const std::vector<Scope *> &local_scopes,
2525
const std::vector<platform::Place> &places,
26-
std::unique_ptr<SSAGraph> &&graph)
26+
std::unique_ptr<SSAGraph> &&graph, bool allow_op_delay)
2727
: SSAGraphExecutor(std::move(graph)),
2828
pool_(num_threads >= 2 ? new ::ThreadPool(num_threads) : nullptr),
2929
local_scopes_(local_scopes),
3030
places_(places),
3131
fetch_ctxs_(places),
32-
use_event_(use_event) {}
32+
use_event_(use_event),
33+
running_ops_(0),
34+
allow_op_delay_(allow_op_delay) {}
35+
36+
void ThreadedSSAGraphExecutor::RunDelayedOps(
37+
const std::unordered_set<OpHandleBase *> &delayed_ops) {
38+
for (auto op : delayed_ops) {
39+
op->Run(use_event_);
40+
}
41+
}
3342

3443
FeedFetchList ThreadedSSAGraphExecutor::Run(
3544
const std::vector<std::string> &fetch_tensors) {
3645
std::unordered_map<OpHandleBase *, size_t> pending_ops;
3746
std::unordered_set<VarHandleBase *> pending_vars;
38-
3947
BlockingQueue<VarHandleBase *> ready_vars;
40-
4148
std::unordered_set<OpHandleBase *> ready_ops;
49+
// For ops (e.g. nccl_all_reduce) that need to coordinate multiple
50+
// streams from multiple GPUs, it's faster to buffer them and schedule
51+
// together since we currently cannot overlap computation and memcpy streams.
52+
// Should revisit it if overlapping is available.
53+
std::unordered_set<OpHandleBase *> delayed_ops;
54+
std::unordered_set<OpHandleBase *> blocked_by_delayed_ops;
55+
std::unordered_set<VarHandleBase *> delayed_vars;
4256

4357
auto InsertPendingVar = [&pending_vars, &ready_vars](VarHandleBase &var) {
4458
pending_vars.insert(&var);
@@ -106,7 +120,14 @@ FeedFetchList ThreadedSSAGraphExecutor::Run(
106120

107121
auto run_all_ready_ops = [&] {
108122
for (auto *op : ready_ops) {
109-
RunOp(ready_vars, op);
123+
if (op->IsMultiDeviceTransfer() && allow_op_delay_) {
124+
delayed_ops.insert(op);
125+
delayed_vars.insert(op->outputs_.begin(), op->outputs_.end());
126+
ready_vars.Extend(op->outputs_);
127+
continue;
128+
}
129+
running_ops_++;
130+
RunOp(&ready_vars, op);
110131
}
111132
ready_ops.clear();
112133
};
@@ -118,13 +139,13 @@ FeedFetchList ThreadedSSAGraphExecutor::Run(
118139
}
119140

120141
// Step 3. Execution
121-
while (!pending_vars.empty()) {
142+
while (!pending_vars.empty() || !ready_ops.empty() || !delayed_ops.empty()) {
122143
// 1. Run All Ready ops
123144
run_all_ready_ops();
124145

125146
// 2. Find ready variable
126147
bool timeout;
127-
auto cur_ready_vars = ready_vars.PopAll(1000, &timeout);
148+
auto cur_ready_vars = ready_vars.PopAll(1, &timeout);
128149

129150
if (timeout) {
130151
if (exception_) {
@@ -141,13 +162,29 @@ FeedFetchList ThreadedSSAGraphExecutor::Run(
141162
auto &deps = pending_ops[op];
142163
--deps;
143164
if (deps == 0) {
144-
ready_ops.insert(op);
165+
if (delayed_vars.find(ready_var) != delayed_vars.end()) {
166+
blocked_by_delayed_ops.insert(op);
167+
} else {
168+
ready_ops.insert(op);
169+
}
145170
}
146171
}
147172
}
173+
// When there are no other ops to schedule, schedule buffered delayed
174+
// ops and unblock other ops.
175+
if (ready_ops.empty() && !delayed_ops.empty() && running_ops_ == 0) {
176+
RunDelayedOps(delayed_ops);
177+
delayed_ops.clear();
178+
for (auto *op : blocked_by_delayed_ops) {
179+
ready_ops.insert(op);
180+
}
181+
blocked_by_delayed_ops.clear();
182+
}
148183
// Keep loop until all vars are ready.
149184
}
150-
185+
PADDLE_ENFORCE(ready_ops.empty());
186+
PADDLE_ENFORCE(delayed_ops.empty());
187+
PADDLE_ENFORCE(blocked_by_delayed_ops.empty());
151188
++computation_count_;
152189

153190
auto sync_computation = [&] {
@@ -182,12 +219,13 @@ FeedFetchList ThreadedSSAGraphExecutor::Run(
182219
}
183220

184221
void ThreadedSSAGraphExecutor::RunOp(
185-
BlockingQueue<VarHandleBase *> &ready_var_q, details::OpHandleBase *op) {
186-
auto op_run = [&ready_var_q, op, this] {
222+
BlockingQueue<VarHandleBase *> *ready_var_q, details::OpHandleBase *op) {
223+
auto op_run = [ready_var_q, op, this] {
187224
try {
188225
VLOG(10) << op->Name() << " : " << op->DebugString();
189226
op->Run(use_event_);
190-
ready_var_q.Extend(op->outputs_);
227+
running_ops_--;
228+
ready_var_q->Extend(op->outputs_);
191229
} catch (platform::EnforceNotMet ex) {
192230
exception_.reset(new platform::EnforceNotMet(ex));
193231
} catch (...) {

paddle/fluid/framework/details/threaded_ssa_graph_executor.h

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,12 @@
1414

1515
#pragma once
1616

17-
#include <chrono>
17+
#include <deque>
18+
#include <string>
19+
#include <unordered_set>
20+
#include <utility>
21+
#include <vector>
22+
1823
#include <functional>
1924
#include "ThreadPool.h" // ThreadPool in thrird party
2025
#include "paddle/fluid/framework/details/ssa_graph_executor.h"
@@ -70,7 +75,8 @@ class ThreadedSSAGraphExecutor : public SSAGraphExecutor {
7075
ThreadedSSAGraphExecutor(size_t num_threads, bool use_event,
7176
const std::vector<Scope *> &local_scopes,
7277
const std::vector<platform::Place> &places,
73-
std::unique_ptr<SSAGraph> &&graph);
78+
std::unique_ptr<SSAGraph> &&graph,
79+
bool allow_op_delay);
7480

7581
// Run a SSAGraph by a thread pool
7682
// Use topological sort algorithm
@@ -79,16 +85,20 @@ class ThreadedSSAGraphExecutor : public SSAGraphExecutor {
7985
~ThreadedSSAGraphExecutor() {}
8086

8187
private:
82-
void RunOp(BlockingQueue<VarHandleBase *> &ready_var_q,
88+
void RunOp(BlockingQueue<VarHandleBase *> *ready_var_q,
8389
details::OpHandleBase *op);
8490

91+
void RunDelayedOps(const std::unordered_set<OpHandleBase *> &delayed_ops);
92+
8593
private:
8694
std::unique_ptr<::ThreadPool> pool_;
8795
std::vector<Scope *> local_scopes_;
8896
std::vector<platform::Place> places_;
8997
platform::DeviceContextPool fetch_ctxs_;
9098
const bool use_event_;
9199
std::unique_ptr<platform::EnforceNotMet> exception_;
100+
std::atomic<int> running_ops_;
101+
bool allow_op_delay_;
92102

93103
size_t computation_count_{0};
94104
size_t max_async_computation{100};

paddle/fluid/framework/executor.cc

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -279,6 +279,21 @@ std::unique_ptr<ExecutorPrepareContext> Executor::Prepare(
279279
return std::unique_ptr<ExecutorPrepareContext>(ctx);
280280
}
281281

282+
std::vector<std::shared_ptr<ExecutorPrepareContext>> Executor::Prepare(
283+
const ProgramDesc& program, const std::vector<int>& block_ids) {
284+
std::vector<std::shared_ptr<ExecutorPrepareContext>> result;
285+
for (auto& bid : block_ids) {
286+
auto* ctx = new ExecutorPrepareContext(program, bid);
287+
PADDLE_ENFORCE_LT(static_cast<size_t>(bid), program.Size());
288+
auto& block = program.Block(bid);
289+
for (auto& op_desc : block.AllOps()) {
290+
ctx->ops_.push_back(OpRegistry::CreateOp(*op_desc));
291+
}
292+
result.push_back(std::shared_ptr<ExecutorPrepareContext>(ctx));
293+
}
294+
return result;
295+
}
296+
282297
void Executor::RunPreparedContext(ExecutorPrepareContext* ctx, Scope* scope,
283298
bool create_local_scope, bool create_vars) {
284299
auto& block = ctx->prog_.Block(ctx->block_id_);

paddle/fluid/framework/executor.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,9 @@ class Executor {
6161
static std::unique_ptr<ExecutorPrepareContext> Prepare(
6262
const ProgramDesc& program, int block_id);
6363

64+
static std::vector<std::shared_ptr<ExecutorPrepareContext>> Prepare(
65+
const ProgramDesc& program, const std::vector<int>& block_ids);
66+
6467
void RunPreparedContext(ExecutorPrepareContext* ctx, Scope* scope,
6568
bool create_local_scope = true,
6669
bool create_vars = true);

paddle/fluid/framework/parallel_executor.cc

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ See the License for the specific language governing permissions and
1313
limitations under the License. */
1414

1515
#include "paddle/fluid/framework/parallel_executor.h"
16+
#include "paddle/fluid/platform/profiler.h"
1617

1718
#include <string>
1819
#include <vector>
@@ -47,7 +48,7 @@ ParallelExecutor::ParallelExecutor(
4748
const std::vector<platform::Place> &places,
4849
const std::unordered_set<std::string> &params,
4950
const ProgramDesc &startup_program, const ProgramDesc &main_program,
50-
const std::string &loss_var_name, Scope *scope)
51+
const std::string &loss_var_name, Scope *scope, bool allow_op_delay)
5152
: member_(new ParallelExecutorPrivate(places)) {
5253
member_->global_scope_ = scope;
5354

@@ -82,8 +83,8 @@ ParallelExecutor::ParallelExecutor(
8283
auto graph = builder.Build(main_program);
8384

8485
member_->executor_.reset(new details::ThreadedSSAGraphExecutor(
85-
num_threads, use_event, member_->local_scopes_, places,
86-
std::move(graph)));
86+
num_threads, use_event, member_->local_scopes_, places, std::move(graph),
87+
allow_op_delay));
8788

8889
// Step 3. Create vars in each scope;
8990
for (auto *scope : member_->local_scopes_) {
@@ -151,6 +152,7 @@ void ParallelExecutor::BCastParamsToGPUs(
151152

152153
void ParallelExecutor::Run(const std::vector<std::string> &fetch_tensors,
153154
const std::string &fetched_var_name) {
155+
platform::RecordBlock b(0);
154156
auto fetch_data = member_->executor_->Run(fetch_tensors);
155157
*member_->global_scope_->Var(fetched_var_name)->GetMutable<FeedFetchList>() =
156158
fetch_data;

0 commit comments

Comments
 (0)