Skip to content

Commit 1a9f4e5

Browse files
committed
update reader doc
1 parent 6ea4582 commit 1a9f4e5

File tree

3 files changed

+78
-19
lines changed

3 files changed

+78
-19
lines changed

doc/design/cpp_data_feeding.md

Lines changed: 78 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,15 @@
22

33
While using Paddle V2 API for training, data feeding completely depends on the Python code. To get rid of the Python environment and achieve the goal of "wrapping the whole training by a while loop op" in Paddle Fluid, a C++ data feeding mechanism is required.
44

5-
In this document we show the fundamental design of a C++ data feeding process, which includes data reading, shuffling and batching.
5+
In this document, we show the fundamental design of a C++ data feeding process, which includes data reading, shuffling and batching.
6+
7+
## Overview
8+
9+
![](images/readers.pdf)
610

711
## Reader
812

9-
In order to handle the above mentioned problem, a new concept called 'Reader' is introduced. `Reader` is a series of inherited classes which can be held by our `Variable` and they are used to read or process file data.
13+
In order to handle the above-mentioned problem, a new concept called 'Reader' is introduced. `Reader` is a series of inherited classes which can be held by our `Variable` and they are used to read or process file data.
1014

1115

1216
### `ReaderBase`
@@ -26,7 +30,7 @@ class ReaderBase {
2630
// Reinitializes the reader and read the file from the beginning.
2731
virtual void ReInit() = 0;
2832

29-
virtual ~ReaderBase() {}
33+
virtual ~ReaderBase();
3034
};
3135
```
3236
@@ -37,24 +41,21 @@ class ReaderBase {
3741
```cpp
3842
class FileReader : public ReaderBase {
3943
public:
40-
explicit FileReader(const std::vector<DDim>& shapes) : shapes_(shapes) {}
41-
42-
void ReadNext(std::vector<LoDTensor>* out) override final {
43-
ReadNextImpl(out);
44-
CheckShapes(out);
45-
}
44+
explicit FileReader(const std::vector<DDim>& dims);
4645
47-
virtual void ReadNextImpl(std::vector<LoDTensor>* out) = 0;
46+
void ReadNext(std::vector<LoDTensor>* out) override;
4847
4948
protected:
50-
// Checks whether the out shapes is consistent with shapes_
51-
CheckShape(const std::vector<LoDTensor>* out);
49+
virtual void ReadNextImpl(std::vector<LoDTensor>* out) = 0;
5250
53-
std::vector<DDim> shapes_;
51+
private:
52+
std::vector<DDim> dims_;
5453
};
5554
```
5655

57-
A file reader binds with a single file, and reads one instance of data from the file at a time. Each type of file reader shall implement its own `ReadNextImpl()`, `HasNext()` and `ReInit()`.
56+
A file reader binds with a single file and reads one instance of data from the file at a time. Each type of file reader shall implement its own `ReadNextImpl()`, `HasNext()` and `ReInit()`.
57+
58+
The `ReadNextImpl()` is invoked by `ReadNext()`. Besides invoking `ReadNextImpl()`, `ReadNext()` is also in charge of checking the output, making sure that each shape of `LoDTensor` in `*out` is consistent with the one in `dims_`.
5859

5960
### DecoratedReader
6061

@@ -63,23 +64,34 @@ A decorated reader takes another reader(both file reader and decorated reader ar
6364
```cpp
6465
class DecoratedReader : public ReaderBase {
6566
public:
66-
explicit DecoratedReader(ReaderBase* reader) : reader_(reader) {
67+
explicit DecoratedReader(ReaderBase* reader) : ReaderBase(), reader_(reader) {
6768
PADDLE_ENFORCE_NOT_NULL(reader_);
6869
}
6970

7071
void ReInit() override { reader_->ReInit(); }
7172

73+
bool HasNext() const override { return reader_->HasNext(); }
74+
7275
protected:
7376
ReaderBase* reader_;
7477
};
7578
```
7679
7780
All the `FileReader` and `DecoratedReader` share exactly the same interfaces as defined in `ReaderBase`. So they can be decorated for more than one time: We can **shuffle** a reader's outputs and then **batch** the shuffle outputs. The interface consistency also allows related ops use readers without knowing what they are exactly.
7881
79-
### ThreadedReader
82+
### MultipleReader
8083
84+
All `FileReader` binds with a single file and are single-threaded. However, sometimes we need to read data from more than one file. In this case, it's not enough to only have `FileReader` and `DecoratedReader`.
8185
82-
### `ReaderHolder`
86+
So `MultipleReader` is introduced. It is also derived from `ReaderBase`. A `MultipleReader` holds several prefetching `FileReaders` and these readers run concurrently. Another pivotal part of a `MultipleReader` is a buffer channel. The channel collects data yield by all prefetching readers and makes subsequent OPs or decorated readers be able to fetch data without concerning about multiple readers scheduling.
87+
88+
![](images/multiple_reader.pdf)
89+
90+
This graph shows how a `MultipleReader` works with three prefetching file readers and two GPUs. There is a queue of files which are going to be read. Each time when a prefetching file reader is free(complete reading from one file), it fetches a new file from the queue. Each prefetching file reader runs in a separated prefetch thread and dumps their outputs to the same channel.
91+
92+
To the subsequent two decorated readers, the `MultipleReader` is **a single reader**. They don't need to concern about how prefetch readers are scheduled. They only need to invoke `MultipleReader::ReadNext()` to get the next data from the buffer channel.
93+
94+
### ReaderHolder
8395
8496
Different readers belong to different class types. This leads to a problem: How can we drop them into `Variable`s and fetch them out by a unified method? For example, if a Variable holds a `BatchReader`, we can not get it by the following code:
8597
@@ -101,10 +113,57 @@ To solve this problem, we introduce `ReaderHolder` as a wrapper. It acts as an e
101113

102114
To create and invoke readers, some new ops are introduced:
103115

104-
### `CreateReaderOp`
116+
### CreateReaderOp
105117

106118
Each reader has its creation op. File readers' creation ops have no input and yield the created file reader as its output. Decorated readers' creation ops take the underlying readers as inputs and then yield new decorated readers.
107119

108-
### `ReadOp`
120+
However, direct usage of file readers' creation ops is not recommended because a file reader can only read one file via a single thread. Using `OpenFilesOp` is a better choice.
121+
122+
### OpenFilesOp
123+
124+
The `OpenFilesOp` is the creation op of `MultipleReader`. It takes no input but requires a list of file names as one of its attributes. The newly created `MultipleReader` then creates corresponding prefetching readers according to file formats.
125+
126+
### HasNextOp
127+
128+
`HasNextOp` is used to check whether the next data batch exists via the reader's `HasNext()` interface.
129+
130+
### ResetOp
131+
132+
`ResetOp` is used to reset a reader via its `ReInit()` interface.
133+
134+
### ReadOp
109135

110136
A reader is only a Variable. It cannot trigger the reading process by itself. So we add the `ReadOp` to execute it. A `ReadOp` takes a reader Variable as its input. Each time it runs, it invokes the reader‘s `ReadNext()` function and gets a new batch of data(or only one instance of data, if we use file reader directly). The output data of a reader are in the form of `std::vector<LoDTenosr>`, so the `ReadOp` also needs to split the vector and move LoDTensors to their respective output Variables.
137+
138+
## Program with Readers
139+
140+
A `Program` holds readers as its persistable variables. These variables are created by `CreateReaderOp` or `OpenFilesOp`. Obviously, these ops shall run only once. So they shall be settled in the `startup_program`. `HasNextOp`, `ResetOp` and `ReadOp` are required by training loop, so they shall be in the `main_program`.
141+
142+
The ops of a `startup_program` with readers would be like this:
143+
144+
```
145+
multiple_reader = open_files_op(...)
146+
batch_reader = create_batch_reader_op(multiple_reader)
147+
double_buffer_reader = create_double_buffer_op(batch_reader)
148+
... (other initializers)
149+
```
150+
151+
The forwarding ops of the corresponding `main_program` would be like this:
152+
153+
```
154+
while_op {
155+
has_next = has_next_op(double_buffer_reader)
156+
if_else_op(has_next) {
157+
batch_data = read_op(double_buffer_reader)
158+
... (subsequent training ops)
159+
} else {
160+
reset_op(double_buffer_reader)
161+
}
162+
}
163+
```
164+
165+
Two things are worth mentioning when considering these two programs:
166+
167+
1. The multiple\_reader is the batch\_reader's underlying reader, and the batch\_reader is the double\_buffer\_reader's underlying reader. `read_op`, `has_next_op` and other reader related ops will only invoke the top-most reader. In this case, it's the double\_buffer\_reader.
168+
169+
2. All readers exist in both `startup_program` and `main_program`. And they are persistable.

doc/design/images/multiple_reader.pdf

38 KB
Binary file not shown.

doc/design/images/readers.pdf

264 KB
Binary file not shown.

0 commit comments

Comments
 (0)