Skip to content

Commit 60314ee

Browse files
authored
Merge pull request #9079 from JiayiFeng/dev_update_reader_doc
Update cpp reader doc
2 parents 70e500b + f617ffb commit 60314ee

File tree

3 files changed

+118
-25
lines changed

3 files changed

+118
-25
lines changed

doc/design/images/multiple_reader.png

160 KB
Loading

doc/design/images/readers.png

347 KB
Loading
Lines changed: 118 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,97 @@
11
# C++ Data Feeding
22

3-
While using Paddle V2 API for Training, data feeding completely depends on the Python code. To get rid of the Python environment and achieve the goal of "wrapping the whole training by a while loop op" in Paddle Fluid, a C++ data feeding mechanism is required.
3+
While using Paddle V2 API for training, data feeding completely depends on the Python code. To get rid of the Python environment and achieve the goal of "wrapping the whole training by a while loop op" in Paddle Fluid, a C++ data feeding mechanism is required.
44

5-
In this document we show the fundamental design of a C++ data feeding process, which includes data reading, shuffling and batching.
5+
In this document, we show the fundamental design of a C++ data feeding process, which includes data reading, shuffling and batching.
6+
7+
## Overview
8+
9+
![](images/readers.png)
610

711
## Reader
812

9-
In order to handle the above mentioned problem, a new concept called 'Reader' is introduced. `Reader` is a series of inherited classes which can be held by our `Variable` and they are used to read or process file data.
13+
In order to handle the above-mentioned problem, a new concept called 'Reader' is introduced. `Reader` is a series of inherited classes which can be held by our `Variable` and they are used to read or process file data.
1014

1115

12-
### `ReaderBase`
16+
### ReaderBase
1317

1418
`ReaderBase` is the abstract base class for all readers. It defines the interface for all readers.
1519

1620
```cpp
1721
class ReaderBase {
1822
public:
19-
explicit ReaderBase(const std::vector<DDim>& shapes) : shapes_(shapes) {
20-
PADDLE_ENFORCE(!shapes_.empty());
21-
}
22-
// Read the next batch of data. (A 'batch' can be only one instance)
23-
// If the next batch doesn't exist, '*out' will be an empty std::vector.
23+
// Reads the next batch of data. (A 'batch' can be only one instance)
24+
// If the next batch doesn't exist, it throws an exception
2425
virtual void ReadNext(std::vector<LoDTensor>* out) = 0;
2526

26-
// Reinitialize the reader and read the file from the beginning.
27-
virtual void ReInit() = 0;
27+
// Checks whether the next instance exists.
28+
virtual bool HasNext() = 0;
2829

29-
// Get a certain read in data's shape.
30-
DDim shape(size_t idx) const;
31-
// Get shapes of all read in data.
32-
std::vector<DDim> shapes() const { return shapes_; }
33-
// Set shapes of read in data.
34-
void set_shapes(const std::vector<DDim>& shapes) { shapes_ = shapes; }
30+
// Reinitializes the reader and read the file from the beginning.
31+
virtual void ReInit() = 0;
32+
33+
virtual ~ReaderBase();
34+
};
35+
```
36+
37+
### FileReader
38+
39+
`FileReader` is derived from the `ReaderBase`. It is still an abstract class and will further be derived by Readers of respective specific format.
40+
41+
```cpp
42+
class FileReader : public ReaderBase {
43+
public:
44+
explicit FileReader(const std::vector<DDim>& dims);
45+
46+
void ReadNext(std::vector<LoDTensor>* out) override;
47+
48+
protected:
49+
virtual void ReadNextImpl(std::vector<LoDTensor>* out) = 0;
50+
51+
private:
52+
std::vector<DDim> dims_;
53+
};
54+
```
55+
56+
A file reader binds with a single file and reads one data instance at a time. Each type of file reader shall implement its own `ReadNextImpl()`, `HasNext()` and `ReInit()`.
57+
58+
The `ReadNextImpl()` is invoked by `ReadNext()`. Besides invoking `ReadNextImpl()`, `ReadNext()` is also responsible for checking the output, making sure that each shape of `LoDTensor` in `*out` is consistent with the one in `dims_`.
59+
60+
### DecoratedReader
61+
62+
A decorated reader takes another reader(both file reader and decorated reader are OK) as its 'underlying reader'. It gets data from its underlying reader, does some processing on them(shuffling, batching or something else), then yields processed data. The output data of a decorated reader can be a single instance or a batch. `ShuffleReader` and `BatchReader` are both decorated readers.
63+
64+
```cpp
65+
class DecoratedReader : public ReaderBase {
66+
public:
67+
explicit DecoratedReader(ReaderBase* reader) : ReaderBase(), reader_(reader) {
68+
PADDLE_ENFORCE_NOT_NULL(reader_);
69+
}
70+
71+
void ReInit() override { reader_->ReInit(); }
3572

36-
virtual ~ReaderBase() {}
73+
bool HasNext() const override { return reader_->HasNext(); }
3774

3875
protected:
39-
std::vector<DDim> shapes_;
76+
ReaderBase* reader_;
4077
};
4178
```
4279
43-
### `FileReader` and `DecoratedReader`
80+
Both the `FileReader` and `DecoratedReader` share exactly the same interface as defined in `ReaderBase`. So they can be decorated for multiple times: We can **shuffle** a reader's outputs and then **batch** the shuffled outputs. The interface consistency also allows related ops use readers without knowing their underlying type.
81+
82+
### MultipleReader
83+
84+
All `FileReader` binds with a single file and are single-threaded. However, sometimes we need to read data from more than one file. In this case, it's not enough to only have `FileReader` and `DecoratedReader`.
4485
45-
These two classes are derived from the `ReaderBase` and will further be derived by more specific readers. Thus, in our design, there are two kinds of readers: file readers and decorated readers. A file reader reads from a file of some specific format, and yield only one instance of data at a time. For example, RecordIO reader, jpg reader, .... A decorated reader takes another reader(both file reader and decorated reader are OK) as its 'underlying reader'. It gets data from its underlying reader, does some processing on them(shuffling, or batching), then yields processed data. The output data of a decorated reader can be a single instance or a batch. `ShuffleReader` and `BatchReader` are both decorated readers.
86+
So `MultipleReader` is introduced. It is also derived from `ReaderBase`. A `MultipleReader` holds several prefetching `FileReaders` and these readers run concurrently. Another pivotal part of a `MultipleReader` is a buffer channel. The channel collects data yield by all prefetching readers and makes subsequent OPs or decorated readers be able to fetch data without concerning about multiple readers scheduling.
4687
47-
All the readers share exactly the same interface as defined in `ReaderBase`. So they can be decorated for more than one time: We can **shuffle** a reader's outputs and then **batch** the shuffle outputs. The interface consistency also allows related ops use readers without knowing what they are exactly.
88+
![](images/multiple_reader.png)
4889
90+
This graph shows how a `MultipleReader` works with three prefetching file readers and two GPUs. There is a queue of files which are going to be read. Each time when a prefetching file reader is free(complete reading from one file), it fetches a new file from the queue. Each prefetching file reader runs in a separated prefetch thread and dumps their outputs to the same channel.
4991
50-
### `ReaderHolder`
92+
To the subsequent two decorated readers, the `MultipleReader` is **a single reader**. They don't need to concern about how prefetch readers are scheduled. They only need to invoke `MultipleReader::ReadNext()` to get the next data from the buffer channel.
93+
94+
### ReaderHolder
5195
5296
Different readers belong to different class types. This leads to a problem: How can we drop them into `Variable`s and fetch them out by a unified method? For example, if a Variable holds a `BatchReader`, we can not get it by the following code:
5397
@@ -69,10 +113,59 @@ To solve this problem, we introduce `ReaderHolder` as a wrapper. It acts as an e
69113

70114
To create and invoke readers, some new ops are introduced:
71115

72-
### `CreateReaderOp`
116+
### CreateReaderOp
73117

74118
Each reader has its creation op. File readers' creation ops have no input and yield the created file reader as its output. Decorated readers' creation ops take the underlying readers as inputs and then yield new decorated readers.
75119

76-
### `ReadOp`
120+
However, direct usage of file readers' creation ops is not recommended because a file reader can only read one file via a single thread. Using `OpenFilesOp` is a better choice.
121+
122+
### OpenFilesOp
123+
124+
The `OpenFilesOp` is the creation op of `MultipleReader`. It takes no input but requires a list of file names as one of its attributes. The newly created `MultipleReader` then creates its own prefetching readers according to given file names.
125+
126+
To make sure that created prefetching readers match file formats, we need a name prefix rule to append file format tags to file names, as well as a file reader registry mechanism to map file format tags to their corresponding file readers' constructors.
127+
128+
### HasNextOp
129+
130+
`HasNextOp` is used to check whether the next data batch exists via the reader's `HasNext()` interface.
131+
132+
### ResetOp
133+
134+
`ResetOp` is used to reset a reader via its `ReInit()` interface.
135+
136+
### ReadOp
77137

78138
A reader is only a Variable. It cannot trigger the reading process by itself. So we add the `ReadOp` to execute it. A `ReadOp` takes a reader Variable as its input. Each time it runs, it invokes the reader‘s `ReadNext()` function and gets a new batch of data(or only one instance of data, if we use file reader directly). The output data of a reader are in the form of `std::vector<LoDTenosr>`, so the `ReadOp` also needs to split the vector and move LoDTensors to their respective output Variables.
139+
140+
## Program with Readers
141+
142+
A `Program` holds readers as its persistable variables. These variables are created by `CreateReaderOp` or `OpenFilesOp`. These ops shall run only once. So they shall be settled in the `startup_program`. `HasNextOp`, `ResetOp` and `ReadOp` are required by training loop, so they shall be in the `main_program`.
143+
144+
The ops of a `startup_program` with readers would be like this:
145+
146+
```
147+
multiple_reader = open_files_op(...)
148+
batch_reader = create_batch_reader_op(multiple_reader)
149+
double_buffer_reader = create_double_buffer_op(batch_reader)
150+
... (other initializers)
151+
```
152+
153+
The forwarding ops of the corresponding `main_program` would be like this:
154+
155+
```
156+
while_op {
157+
has_next = has_next_op(double_buffer_reader)
158+
if_else_op(has_next) {
159+
batch_data = read_op(double_buffer_reader)
160+
... (subsequent training ops)
161+
} else {
162+
reset_op(double_buffer_reader)
163+
}
164+
}
165+
```
166+
167+
Two important considerations for these programs are as follows:
168+
169+
1. The multiple\_reader is the batch\_reader's underlying reader, and the batch\_reader is the double\_buffer\_reader's underlying reader. `read_op`, `has_next_op` and other reader related ops will only invoke the top-most reader. In this case, it's the double\_buffer\_reader.
170+
171+
2. All readers exist in both `startup_program` and `main_program`. And they are persistable.

0 commit comments

Comments
 (0)