Skip to content

Commit ed67165

Browse files
author
yuyang18
committed
PyDataProvider English Document
Thanks to caoying to check english gramma. ISSUE=4598269 git-svn-id: https://svn.baidu.com/idl/trunk/paddle@1462 1ad973e4-5ce8-4261-8a94-b56d1f490c56
1 parent cecdede commit ed67165

File tree

6 files changed

+294
-176
lines changed

6 files changed

+294
-176
lines changed

doc/ui/api/py_data_provider_wrapper.rst

Lines changed: 0 additions & 6 deletions
This file was deleted.

doc/ui/data_provider/index.md

Lines changed: 0 additions & 55 deletions
This file was deleted.

doc/ui/data_provider/index.rst

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
PaddlePaddle DataProvider Introduction
2+
================================
3+
DataProvider is a module that loads training or testing data into cpu or gpu
4+
memory for the following triaining or testing process.
5+
6+
For simple use, users can use Python :code:`PyDataProvider` to dynamically reads
7+
the original data in any format or in any form, and then transfer them into a
8+
data format PaddlePaddle requires. The process is extremly flexible and highly
9+
customized, with sacrificing the efficiency only a little. This is extremly
10+
useful when you have to dynamically generate certain kinds of data according to,
11+
for example, the training performance.
12+
13+
Besides, users also can also customize a C++ :code:`DataProvider` for a more
14+
complex usage, or for a higher efficiency.
15+
16+
The following parameters are required to define in the PaddlePaddle network
17+
configuration file (trainer_config.py): which DataProvider is chosen to used,
18+
and specific parameters for DataProvider, including training file list
19+
(train.list) and testing file list (test.list).
20+
21+
Train.list and test.list are simply two plain text files, which defines path
22+
of training or testing data. It is recommended that directly placing them into
23+
the training directory, and reference to them by using a relative path (
24+
relative to the PaddePaddle program).
25+
26+
Testing or evaluating will not be performed during training if the test.list is
27+
not set or set to None. Otherwise, PaddlePaddle will evaluate the trained model
28+
by the specified tesing data while training, every testing period (a user
29+
defined command line parameter in PaddlePaddle) to prevent over-fitting.
30+
31+
Each line of train.list and test.list is an absolute or relative path (relative
32+
to the PaddePaddle program runtime) of data file. Fascinatingly more, each line
33+
can also be a HDFS file path or a SQL connection string. As long as the user
34+
assures how to access each file in DataProvider.
35+
36+
Please refer to the following articles for more information about the detail
37+
usages of DataProvider and how to implement a new DataProvider,
38+
39+
.. toctree::
40+
41+
pydataprovider2.rst
42+
write_new_dataprovider.rst
Lines changed: 250 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,250 @@
1+
How to use PyDataProvider2
2+
==========================
3+
4+
We highly recommand users to use PyDataProvider2 to provide training or testing
5+
data to PaddlePaddle. The user only needs to focus on how to read a single
6+
sample from the original data file by using PyDataProvider2, leaving all of the
7+
trivial work, including, transfering data into cpu/gpu memory, shuffle, binary
8+
serialization to PyDataProvider2. PyDataProvider2 uses multithreading and a
9+
fanscinating but simple cache strategy to optimize the efficiency of the data
10+
providing process.
11+
12+
DataProvider for the non-sequential model
13+
-----------------------------------------
14+
15+
Here we use the MNIST handwriting recognition data as an example to illustrate
16+
how to write a simple PyDataProvider.
17+
18+
MNIST is a handwriting classification data set. It contains 70,000 digital
19+
grayscale images. Labels of the training sample range from 0 to 9. All the
20+
images have been size-normalized and centered into images with a same size
21+
of 28 x 28 pixels.
22+
23+
A small part of the original data as an example can be found in the path below:
24+
25+
.. literalinclude:: ../../../doc_cn/ui/data_provider/mnist_train.txt
26+
27+
Each line of the data contains two parts, separated by ';'. The first part is
28+
label of an image. The second part contains 28x28 pixel float values.
29+
30+
Just write path of the above data into train.list. It looks like this:
31+
32+
.. literalinclude:: ../../../doc_cn/ui/data_provider/train.list
33+
34+
The corresponding dataprovider can be found in the path below:
35+
36+
.. literalinclude:: ../../../doc_cn/ui/data_provider/mnist_provider.py
37+
: linenos:
38+
39+
The first line imports PyDataProvider2 package.
40+
The main function is the process function, that has two parameters.
41+
The first parameter is the settings, which is not used in this example.
42+
The second parameter is the filename, that is exactly each line of train.list.
43+
This parameter is passed to the process function by PaddlePaddle.
44+
45+
:code:`@provider` is a Python
46+
`Decorator <http://www.learnpython.org/en/Decorators>`_ .
47+
It sets some properties to DataProvider, and constructs a real PaddlePaddle
48+
DataProvider from a very sample user implemented python function. It does not
49+
matter if you are not familiar with `Decorator`_. You can keep it sample by
50+
just taking :code:`@provider` as a fixed mark above the provider function you
51+
implemented.
52+
53+
`input_types`_ defines the data format that a DataProvider returns.
54+
In this example, it is set to a 28x28-dimensional dense vector and an integer
55+
scalar, whose value ranges from 0 to 9.
56+
`input_types`_ can be set to several kinds of input formats, please refer to the
57+
document of `input_types`_ for more details.
58+
59+
60+
The process method is the core part to construct a real DataProvider in
61+
PaddlePaddle. It implements how to open the text file, how to read one sample
62+
from the original text file, converted them into `input_types`_, and give them
63+
back to PaddlePaddle process at line 23.
64+
Note that data yields by the process function must follow a same order that
65+
`input_types`_ are defined.
66+
67+
68+
With the help of PyDataProvider2, user can focus on how to generate ONE traning
69+
sample by using keywords :code:`yield`.
70+
:code:`yield` is a python keyword, and a concept related to it includes
71+
:code:`generator`.
72+
73+
Only a few lines of codes need to be added into the training configuration file,
74+
you can take this as an example.
75+
76+
.. literalinclude:: ../../../doc_cn/ui/data_provider/mnist_config.py
77+
78+
Here we specify training data by 'train.list', and no testing data is specified.
79+
80+
Now, this simple example of using PyDataProvider is finished.
81+
The only thing that the user should know is how to generte **one sample** from
82+
**one data file**.
83+
And PaddlePadle will do all of the rest things\:
84+
85+
* Form a training batch
86+
* Shuffle the training data
87+
* Read data with multithreading
88+
* Cache the training data (Optional)
89+
* CPU-> GPU double buffering.
90+
91+
Is this cool?
92+
93+
DataProvider for the sequential model
94+
-------------------------------------
95+
A sequence model takes sequences as its input. A sequence is made up of several
96+
timesteps. The so-called timestep, is not necessary to have something to do
97+
with 'time'. It can also be explained to that the order of data are taken into
98+
consideration into model design and training.
99+
For example, the sentence can be interpreted as a kind of sequence data in NLP
100+
tasks.
101+
102+
Here is an example on data proivider for English sentiment classification data.
103+
The original input data are simple English text, labeled into positive or
104+
negative sentiment (marked by 0 and 1 respectively).
105+
106+
A small part of the original data as an example can be found in the path below:
107+
108+
.. literalinclude:: ../../../doc_cn/ui/data_provider/sentimental_train.txt
109+
110+
The corresponding data provider can be found in the path below:
111+
112+
.. literalinclude:: ../../../doc_cn/ui/data_provider/sentimental_provider.py
113+
114+
This data provider for sequential model is a little bit complex than that
115+
for MINST dataset.
116+
A new initialization method is introduced here.
117+
The method :code:`on_init` is configured to DataProvider by :code:`@provider`'s
118+
:code:`init_hook` parameter, and it will be invoked once DataProvider is
119+
initialized. The :code:`on_init` function has the following parameters:
120+
121+
* The first parameter is the settings object.
122+
* The rest parameters are passed by key word arguments. Some of them are passed
123+
by PaddlePaddle, see reference for `init_hook`_.
124+
The :code:`dictionary` object is a python dict object passed from the trainer
125+
configuration file, and it maps word string to word id.
126+
127+
To pass these parameters into DataProvider, the following lines should be added
128+
into trainer configuration file.
129+
130+
.. literalinclude:: ../../../doc_cn/ui/data_provider/sentimental_config.py
131+
132+
The definition is basically same as MNIST example, except:
133+
* Load dictionary in this configuration
134+
* Pass it as a parameter to the DataProvider
135+
136+
The `input_types` is configured in method :code:`on_init`. It has the same
137+
effect to configure them by :code:`@provider`'s :code:`input_types` parameter.
138+
However, the :code:`input_types` is set at runtime, so we can set it to
139+
different types according to the input data. Input of the neural network is a
140+
sequence of word id, so set :code:`seq_type` to :code:`integer_value_sequence`.
141+
142+
Durning :code:`on_init`, we save :code:`dictionary` variable to
143+
:code:`settings`, and it will be used in :code:`process`. Note the settings
144+
parameter for the process function and for the on_init's function are a same
145+
object.
146+
147+
The basic processing logic is the same as MNIST's :code:`process` method. Each
148+
sample in the data file is given back to PaddlePaddle process.
149+
150+
Thus, the basic usage of PyDataProvider is here.
151+
Please refer to the following section reference for details.
152+
153+
Reference
154+
---------
155+
156+
.. _@provider::
157+
@provider
158+
+++++++++
159+
160+
'@provider' is a Python `Decorator`_, it can construct a PyDataProvider in
161+
PaddlePaddle from a user defined function. Its parameters are:
162+
163+
* `input_types`_ defines format of the data input.
164+
* should_shuffle defines whether to shuffle data or not. By default, it is set
165+
true during training, and false during testing.
166+
* pool_size is the memory pool size (in sample number) in DataProvider.
167+
-1 means no limit.
168+
* can_over_batch_size defines whether PaddlePaddle can store little more
169+
samples than pool_size. It is better to set True to avoid some deadlocks.
170+
* calc_batch_size is a function define how to calculate batch size. This is
171+
usefull in sequential model, that defines batch size is counted upon sequence
172+
or token. By default, each sample or sequence counts to 1 when calculating
173+
batch size.
174+
* cache is a data cache strategy, see `cache`_
175+
* Init_hook function is invoked once the data provider is initialized,
176+
see `init_hook`_
177+
178+
.. _input_types::
179+
input_types
180+
+++++++++++
181+
182+
PaddlePaddle has four data types, and three sequence types.
183+
The four data types are:
184+
185+
* dense_vector represents dense float vector.
186+
* sparse_binary_vector sparse binary vector, most of the value is 0, and
187+
the non zero elements are fixed to 1.
188+
* sparse_float_vector sparse float vector, most of the value is 0, and some
189+
non zero elements that can be any float value. They are given by the user.
190+
* integer represents an integer scalar, that is especially used for label or
191+
word index.
192+
193+
194+
The three sequence types are
195+
196+
* SequenceType.NO_SEQUENCE means the sample is not a sequence
197+
* SequenceType.SEQUENCE means the sample is a sequence
198+
* SequenceType.SUB_SEQUENCE means it is a nested sequence, that each timestep of
199+
the input sequence is also a sequence.
200+
201+
Different input type has a defferenct input format. Their formats are shown
202+
in the above table.
203+
204+
+----------------------+---------------------+-----------------------------------+------------------------------------------------+
205+
| | NO_SEQUENCE | SEQUENCE | SUB_SEQUENCE |
206+
+======================+=====================+===================================+================================================+
207+
| dense_vector | [f, f, ...] | [[f, ...], [f, ...], ...] | [[[f, ...], ...], [[f, ...], ...],...] |
208+
+----------------------+---------------------+-----------------------------------+------------------------------------------------+
209+
| sparse_binary_vector | [i, i, ...] | [[i, ...], [i, ...], ...] | [[[i, ...], ...], [[i, ...], ...],...] |
210+
+----------------------+---------------------+-----------------------------------+------------------------------------------------+
211+
| sparse_float_vector | [(i,f), (i,f), ...] | [[(i,f), ...], [(i,f), ...], ...] | [[[(i,f), ...], ...], [[(i,f), ...], ...],...] |
212+
+----------------------+---------------------+-----------------------------------+------------------------------------------------+
213+
| integer_value | i | [i, i, ...] | [[i, ...], [i, ...], ...] |
214+
+----------------------+---------------------+-----------------------------------+------------------------------------------------+
215+
216+
where f represents a float value, i represents an integer value.
217+
218+
.. _init_hook::
219+
.. _settings::
220+
init_hook
221+
+++++++++
222+
223+
init_hook is a function that is invoked once the data provoder is initialized.
224+
Its parameters lists as follows:
225+
226+
* The first parameter is a settings object, which is the same to :code:'settings'
227+
in :code:`process` method. The object contains several attributes, including:
228+
* settings.input_types the input types. Reference `input_types`_
229+
* settings.logger a logging object
230+
* The rest parameters are the key word arguments. It is made up of PaddpePaddle
231+
pre-defined parameters and user defined parameters.
232+
* PaddlePaddle defines parameters including:
233+
* is_train is a bool parameter that indicates the DataProvider is used in
234+
training or testing
235+
* file_list is the list of all files.
236+
* User-defined parameters args can be set in training configuration.
237+
238+
Note, PaddlePaddle reserves the right to add pre-defined parameter, so please
239+
use :code:`**kwargs` in init_hook to ensure compatibility by accepting the
240+
parameters which your init_hook does not use.
241+
242+
.. _cache ::
243+
cache
244+
+++++
245+
DataProvider provides two simple cache strategy. They are
246+
* CacheType.NO_CACHE means do not cache any data, then data is read runtime by
247+
the user implemented python module every pass.
248+
* CacheType.CACHE_PASS_IN_MEM means the first pass reads data by the user
249+
implemented python module, and the rest passes will directly read data from
250+
memory.

0 commit comments

Comments
 (0)