You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/design/reader/README.md
+37-33Lines changed: 37 additions & 33 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,25 +1,25 @@
1
1
# Python Data Reader Design Doc
2
2
3
-
At training and testing time, PaddlePaddle programs need to read data. To ease the users' work to write data reading code, we define that
3
+
During the training and testing phases, PaddlePaddle programs need to read data. To help the users write code that performs reading input data, we define the following:
4
4
5
-
- A *reader* is a function that reads data (from file, network, random number generator, etc) and yields data items.
6
-
- A *reader creator* is a function that returns a reader function.
7
-
- A *reader decorator* is a function, which accepts one or more readers, and returns a reader.
8
-
- A *batch reader* is a function that reads data (from *reader*, file, network, random number generator, etc) and yields a batch of data items.
5
+
- A *reader*: A function that reads data (from file, network, random number generator, etc) and yields the data items.
6
+
- A *reader creator*: A function that returns a reader function.
7
+
- A *reader decorator*: A function, which takes in one or more readers, and returns a reader.
8
+
- A *batch reader*: A function that reads data (from *reader*, file, network, random number generator, etc) and yields a batch of data items.
9
9
10
-
and provide function which converts reader to batch reader, frequently used reader creators and reader decorators.
10
+
and also provide a function which can convert a reader to a batch reader, frequently used reader creators and reader decorators.
11
11
12
12
## Data Reader Interface
13
13
14
-
Indeed, *data reader* doesn't have to be a function that reads and yields data items. It can be any function with no parameter that creates a iterable (anything can be used in `for x in iterable`):
14
+
*Data reader* doesn't have to be a function that reads and yields data items. It can just be any function without any parameters that creates an iterable (anything can be used in `for x in iterable`) as follows:
15
15
16
16
```
17
17
iterable = data_reader()
18
18
```
19
19
20
-
Element produced from the iterable should be a **single** entry of data, **not** a mini batch. That entry of data could be a single item, or a tuple of items. Item should be of [supported type](http://www.paddlepaddle.org/doc/ui/data_provider/pydataprovider2.html?highlight=dense_vector#input-types) (e.g., numpy 1d array of float32, int, list of int)
20
+
The item produced from the iterable should be a **single** entry of data and **not** a mini batch. The entry of data could be a single item or a tuple of items. Item should be of one of the [supported types](http://www.paddlepaddle.org/doc/ui/data_provider/pydataprovider2.html?highlight=dense_vector#input-types) (e.g., numpy 1d array of float32, int, list of int etc.)
21
21
22
-
An example implementation for single item data reader creator:
22
+
An example implementation for single item data reader creator is as follows:
*batch reader* can be any function with no parameter that creates a iterable (anything can be used in `for x in iterable`). The output of the iterable should be a batch (list) of data items. Each item inside the list must be a tuple.
43
+
*Batch reader* can be any function without any parameters that creates an iterable (anything can be used in `for x in iterable`). The output of the iterable should be a batch (list) of data items. Each item inside the list should be a tuple.
44
+
45
+
Here are some valid outputs:
44
46
45
-
Here are valid outputs:
46
47
```python
47
48
# a mini batch of three data items. Each data item consist three columns of data, each of which is 1.
48
49
[(1, 1, 1),
@@ -58,20 +59,22 @@ Here are valid outputs:
58
59
Please note that each item inside the list must be a tuple, below is an invalid output:
59
60
```python
60
61
# wrong, [1,1,1] needs to be inside a tuple: ([1,1,1],).
61
-
# Otherwise it's ambiguous whether [1,1,1] means a single column of data [1, 1, 1],
62
-
# or three column of datas, each of which is 1.
62
+
# Otherwise it is ambiguous whether [1,1,1] means a single column of data [1, 1, 1],
63
+
# or three columns of data, each of which is 1.
63
64
[[1,1,1],
64
65
[2,2,2],
65
66
[3,3,3]]
66
67
```
67
68
68
-
It's easy to convert from reader to batch reader:
69
+
It is easy to convert from a reader to a batch reader:
*Data reader decorator* takes a single or multiple data reader, returns a new data reader. It is similar to a [python decorator](https://wiki.python.org/moin/PythonDecorators), but it does not use `@` syntax.
106
+
The *Data reader decorator* takes in a single reader or multiple data readers and returns a new data reader. It is similar to a [python decorator](https://wiki.python.org/moin/PythonDecorators), but it does not use `@` in the syntax.
103
107
104
-
Since we have a strict interface for data readers (no parameter, return a single data item). Data reader can be used flexiable via data reader decorators. Following are a few examples:
108
+
Since we have a strict interface for data readers (no parameters and return a single data item), a data reader can be used in a flexible way using data reader decorators. Following are a few examples:
105
109
106
110
### Prefetch Data
107
111
108
-
Since reading data may take time and training can not proceed without data. It is generally a good idea to prefetch data.
112
+
Since reading data may take some time and training can not proceed without data, it is generally a good idea to prefetch the data.
For example, we want to use a source of real images (reusing mnist dataset), and a source of random images as input for [Generative Adversarial Networks](https://arxiv.org/abs/1406.2661).
124
+
For example, if we want to use a source of real images (say reusing mnist dataset), and a source of random images as input for [Generative Adversarial Networks](https://arxiv.org/abs/1406.2661).
Given shuffle buffer size `n`, `paddle.reader.shuffle`will return a data reader that buffers `n` data entries and shuffle them before a data entry is read.
152
+
Given the shuffle buffer size `n`, `paddle.reader.shuffle`returns a data reader that buffers `n` data entries and shuffles them before a data entry is read.
### Why reader return only a single entry, but not a mini batch?
161
+
### Why does a reader return only a single entry, and not a mini batch?
158
162
159
-
Always returning a single entry make reusing existing data readers much easier (e.g., if existing reader return not a single entry but 3 entries, training code will be more complex because it need to handle cases like batch size 2).
163
+
Returning a single entry makes reusing existing data readers much easier (for example, if an existing reader returns 3 entries instead if a single entry, the training code will be more complicated because it need to handle cases like a batch size 2).
160
164
161
-
We provide function `paddle.batch` to turn (single entry) reader into batch reader.
165
+
We provide a function:`paddle.batch` to turn (a single entry) reader into a batch reader.
162
166
163
-
### Why do we need batch reader, isn't train take reader and batch_size as arguments sufficient?
167
+
### Why do we need a batch reader, isn't is sufficient to give the reader and batch_size as arguments during training ?
164
168
165
-
In most of the case, train taking reader and batch_size as arguments would be sufficent. However sometimes user want to customize order of data entries inside a mini batch. Or even change batch size dynamically.
169
+
In most of the cases, it would be sufficient to give the reader and batch_size as arguments to the train method. However sometimes the user wants to customize the order of data entries inside a mini batch, or even change the batch size dynamically. For these cases using a batch reader is very efficient and helpful.
166
170
167
-
### Why use a dictionary but not a list to provide mapping?
171
+
### Why use a dictionary instead of a list to provide mapping?
168
172
169
-
We decided to use dictionary (`{"image":0, "label":1}`) instead of list (`["image", "label"]`) is because that user can easily resue item (e.g., using `{"image_a":0, "image_b":0, "label":1}`) or skip item (e.g., using `{"image_a":0, "label":2}`).
173
+
Using a dictionary (`{"image":0, "label":1}`) instead of a list (`["image", "label"]`) gives the advantage that the user can easily reuse the items (e.g., using `{"image_a":0, "image_b":0, "label":1}`) or even skip an item (e.g., using `{"image_a":0, "label":2}`).
0 commit comments