|
| 1 | +How to use PyDataProvider2 |
| 2 | +========================== |
| 3 | + |
| 4 | +We highly recommand users to use PyDataProvider2 to provide training or testing |
| 5 | +data to PaddlePaddle. The user only needs to focus on how to read a single |
| 6 | +sample from the original data file by using PyDataProvider2, leaving all of the |
| 7 | +trivial work, including, transfering data into cpu/gpu memory, shuffle, binary |
| 8 | +serialization to PyDataProvider2. PyDataProvider2 uses multithreading and a |
| 9 | +fanscinating but simple cache strategy to optimize the efficiency of the data |
| 10 | +providing process. |
| 11 | + |
| 12 | +DataProvider for the non-sequential model |
| 13 | +----------------------------------------- |
| 14 | + |
| 15 | +Here we use the MNIST handwriting recognition data as an example to illustrate |
| 16 | +how to write a simple PyDataProvider. |
| 17 | + |
| 18 | +MNIST is a handwriting classification data set. It contains 70,000 digital |
| 19 | +grayscale images. Labels of the training sample range from 0 to 9. All the |
| 20 | +images have been size-normalized and centered into images with a same size |
| 21 | +of 28 x 28 pixels. |
| 22 | + |
| 23 | +A small part of the original data as an example can be found in the path below: |
| 24 | + |
| 25 | +.. literalinclude:: ../../../doc_cn/ui/data_provider/mnist_train.txt |
| 26 | + |
| 27 | +Each line of the data contains two parts, separated by ';'. The first part is |
| 28 | +label of an image. The second part contains 28x28 pixel float values. |
| 29 | + |
| 30 | +Just write path of the above data into train.list. It looks like this: |
| 31 | + |
| 32 | +.. literalinclude:: ../../../doc_cn/ui/data_provider/train.list |
| 33 | + |
| 34 | +The corresponding dataprovider can be found in the path below: |
| 35 | + |
| 36 | +.. literalinclude:: ../../../doc_cn/ui/data_provider/mnist_provider.py |
| 37 | + : linenos: |
| 38 | + |
| 39 | +The first line imports PyDataProvider2 package. |
| 40 | +The main function is the process function, that has two parameters. |
| 41 | +The first parameter is the settings, which is not used in this example. |
| 42 | +The second parameter is the filename, that is exactly each line of train.list. |
| 43 | +This parameter is passed to the process function by PaddlePaddle. |
| 44 | + |
| 45 | +:code:`@provider` is a Python |
| 46 | +`Decorator <http://www.learnpython.org/en/Decorators>`_ . |
| 47 | +It sets some properties to DataProvider, and constructs a real PaddlePaddle |
| 48 | +DataProvider from a very sample user implemented python function. It does not |
| 49 | +matter if you are not familiar with `Decorator`_. You can keep it sample by |
| 50 | +just taking :code:`@provider` as a fixed mark above the provider function you |
| 51 | +implemented. |
| 52 | + |
| 53 | +`input_types`_ defines the data format that a DataProvider returns. |
| 54 | +In this example, it is set to a 28x28-dimensional dense vector and an integer |
| 55 | +scalar, whose value ranges from 0 to 9. |
| 56 | +`input_types`_ can be set to several kinds of input formats, please refer to the |
| 57 | +document of `input_types`_ for more details. |
| 58 | + |
| 59 | + |
| 60 | +The process method is the core part to construct a real DataProvider in |
| 61 | +PaddlePaddle. It implements how to open the text file, how to read one sample |
| 62 | +from the original text file, converted them into `input_types`_, and give them |
| 63 | +back to PaddlePaddle process at line 23. |
| 64 | +Note that data yields by the process function must follow a same order that |
| 65 | +`input_types`_ are defined. |
| 66 | + |
| 67 | + |
| 68 | +With the help of PyDataProvider2, user can focus on how to generate ONE traning |
| 69 | +sample by using keywords :code:`yield`. |
| 70 | +:code:`yield` is a python keyword, and a concept related to it includes |
| 71 | +:code:`generator`. |
| 72 | + |
| 73 | +Only a few lines of codes need to be added into the training configuration file, |
| 74 | +you can take this as an example. |
| 75 | + |
| 76 | +.. literalinclude:: ../../../doc_cn/ui/data_provider/mnist_config.py |
| 77 | + |
| 78 | +Here we specify training data by 'train.list', and no testing data is specified. |
| 79 | + |
| 80 | +Now, this simple example of using PyDataProvider is finished. |
| 81 | +The only thing that the user should know is how to generte **one sample** from |
| 82 | +**one data file**. |
| 83 | +And PaddlePadle will do all of the rest things\: |
| 84 | + |
| 85 | +* Form a training batch |
| 86 | +* Shuffle the training data |
| 87 | +* Read data with multithreading |
| 88 | +* Cache the training data (Optional) |
| 89 | +* CPU-> GPU double buffering. |
| 90 | + |
| 91 | +Is this cool? |
| 92 | + |
| 93 | +DataProvider for the sequential model |
| 94 | +------------------------------------- |
| 95 | +A sequence model takes sequences as its input. A sequence is made up of several |
| 96 | +timesteps. The so-called timestep, is not necessary to have something to do |
| 97 | +with 'time'. It can also be explained to that the order of data are taken into |
| 98 | +consideration into model design and training. |
| 99 | +For example, the sentence can be interpreted as a kind of sequence data in NLP |
| 100 | +tasks. |
| 101 | + |
| 102 | +Here is an example on data proivider for English sentiment classification data. |
| 103 | +The original input data are simple English text, labeled into positive or |
| 104 | +negative sentiment (marked by 0 and 1 respectively). |
| 105 | + |
| 106 | +A small part of the original data as an example can be found in the path below: |
| 107 | + |
| 108 | +.. literalinclude:: ../../../doc_cn/ui/data_provider/sentimental_train.txt |
| 109 | + |
| 110 | +The corresponding data provider can be found in the path below: |
| 111 | + |
| 112 | +.. literalinclude:: ../../../doc_cn/ui/data_provider/sentimental_provider.py |
| 113 | + |
| 114 | +This data provider for sequential model is a little bit complex than that |
| 115 | +for MINST dataset. |
| 116 | +A new initialization method is introduced here. |
| 117 | +The method :code:`on_init` is configured to DataProvider by :code:`@provider`'s |
| 118 | +:code:`init_hook` parameter, and it will be invoked once DataProvider is |
| 119 | +initialized. The :code:`on_init` function has the following parameters: |
| 120 | + |
| 121 | +* The first parameter is the settings object. |
| 122 | +* The rest parameters are passed by key word arguments. Some of them are passed |
| 123 | + by PaddlePaddle, see reference for `init_hook`_. |
| 124 | + The :code:`dictionary` object is a python dict object passed from the trainer |
| 125 | + configuration file, and it maps word string to word id. |
| 126 | + |
| 127 | +To pass these parameters into DataProvider, the following lines should be added |
| 128 | +into trainer configuration file. |
| 129 | + |
| 130 | +.. literalinclude:: ../../../doc_cn/ui/data_provider/sentimental_config.py |
| 131 | + |
| 132 | +The definition is basically same as MNIST example, except: |
| 133 | +* Load dictionary in this configuration |
| 134 | +* Pass it as a parameter to the DataProvider |
| 135 | + |
| 136 | +The `input_types` is configured in method :code:`on_init`. It has the same |
| 137 | +effect to configure them by :code:`@provider`'s :code:`input_types` parameter. |
| 138 | +However, the :code:`input_types` is set at runtime, so we can set it to |
| 139 | +different types according to the input data. Input of the neural network is a |
| 140 | +sequence of word id, so set :code:`seq_type` to :code:`integer_value_sequence`. |
| 141 | + |
| 142 | +Durning :code:`on_init`, we save :code:`dictionary` variable to |
| 143 | +:code:`settings`, and it will be used in :code:`process`. Note the settings |
| 144 | +parameter for the process function and for the on_init's function are a same |
| 145 | +object. |
| 146 | + |
| 147 | +The basic processing logic is the same as MNIST's :code:`process` method. Each |
| 148 | +sample in the data file is given back to PaddlePaddle process. |
| 149 | + |
| 150 | +Thus, the basic usage of PyDataProvider is here. |
| 151 | +Please refer to the following section reference for details. |
| 152 | + |
| 153 | +Reference |
| 154 | +--------- |
| 155 | + |
| 156 | +.. _@provider:: |
| 157 | +@provider |
| 158 | ++++++++++ |
| 159 | + |
| 160 | +'@provider' is a Python `Decorator`_, it can construct a PyDataProvider in |
| 161 | +PaddlePaddle from a user defined function. Its parameters are: |
| 162 | + |
| 163 | +* `input_types`_ defines format of the data input. |
| 164 | +* should_shuffle defines whether to shuffle data or not. By default, it is set |
| 165 | + true during training, and false during testing. |
| 166 | +* pool_size is the memory pool size (in sample number) in DataProvider. |
| 167 | + -1 means no limit. |
| 168 | +* can_over_batch_size defines whether PaddlePaddle can store little more |
| 169 | + samples than pool_size. It is better to set True to avoid some deadlocks. |
| 170 | +* calc_batch_size is a function define how to calculate batch size. This is |
| 171 | + usefull in sequential model, that defines batch size is counted upon sequence |
| 172 | + or token. By default, each sample or sequence counts to 1 when calculating |
| 173 | + batch size. |
| 174 | +* cache is a data cache strategy, see `cache`_ |
| 175 | +* Init_hook function is invoked once the data provider is initialized, |
| 176 | + see `init_hook`_ |
| 177 | + |
| 178 | +.. _input_types:: |
| 179 | +input_types |
| 180 | ++++++++++++ |
| 181 | + |
| 182 | +PaddlePaddle has four data types, and three sequence types. |
| 183 | +The four data types are: |
| 184 | + |
| 185 | +* dense_vector represents dense float vector. |
| 186 | +* sparse_binary_vector sparse binary vector, most of the value is 0, and |
| 187 | + the non zero elements are fixed to 1. |
| 188 | +* sparse_float_vector sparse float vector, most of the value is 0, and some |
| 189 | + non zero elements that can be any float value. They are given by the user. |
| 190 | +* integer represents an integer scalar, that is especially used for label or |
| 191 | + word index. |
| 192 | + |
| 193 | + |
| 194 | +The three sequence types are |
| 195 | + |
| 196 | +* SequenceType.NO_SEQUENCE means the sample is not a sequence |
| 197 | +* SequenceType.SEQUENCE means the sample is a sequence |
| 198 | +* SequenceType.SUB_SEQUENCE means it is a nested sequence, that each timestep of |
| 199 | + the input sequence is also a sequence. |
| 200 | + |
| 201 | +Different input type has a defferenct input format. Their formats are shown |
| 202 | +in the above table. |
| 203 | + |
| 204 | ++----------------------+---------------------+-----------------------------------+------------------------------------------------+ |
| 205 | +| | NO_SEQUENCE | SEQUENCE | SUB_SEQUENCE | |
| 206 | ++======================+=====================+===================================+================================================+ |
| 207 | +| dense_vector | [f, f, ...] | [[f, ...], [f, ...], ...] | [[[f, ...], ...], [[f, ...], ...],...] | |
| 208 | ++----------------------+---------------------+-----------------------------------+------------------------------------------------+ |
| 209 | +| sparse_binary_vector | [i, i, ...] | [[i, ...], [i, ...], ...] | [[[i, ...], ...], [[i, ...], ...],...] | |
| 210 | ++----------------------+---------------------+-----------------------------------+------------------------------------------------+ |
| 211 | +| sparse_float_vector | [(i,f), (i,f), ...] | [[(i,f), ...], [(i,f), ...], ...] | [[[(i,f), ...], ...], [[(i,f), ...], ...],...] | |
| 212 | ++----------------------+---------------------+-----------------------------------+------------------------------------------------+ |
| 213 | +| integer_value | i | [i, i, ...] | [[i, ...], [i, ...], ...] | |
| 214 | ++----------------------+---------------------+-----------------------------------+------------------------------------------------+ |
| 215 | + |
| 216 | +where f represents a float value, i represents an integer value. |
| 217 | + |
| 218 | +.. _init_hook:: |
| 219 | +.. _settings:: |
| 220 | +init_hook |
| 221 | ++++++++++ |
| 222 | + |
| 223 | +init_hook is a function that is invoked once the data provoder is initialized. |
| 224 | +Its parameters lists as follows: |
| 225 | + |
| 226 | +* The first parameter is a settings object, which is the same to :code:'settings' |
| 227 | + in :code:`process` method. The object contains several attributes, including: |
| 228 | + * settings.input_types the input types. Reference `input_types`_ |
| 229 | + * settings.logger a logging object |
| 230 | +* The rest parameters are the key word arguments. It is made up of PaddpePaddle |
| 231 | + pre-defined parameters and user defined parameters. |
| 232 | + * PaddlePaddle defines parameters including: |
| 233 | + * is_train is a bool parameter that indicates the DataProvider is used in |
| 234 | + training or testing |
| 235 | + * file_list is the list of all files. |
| 236 | + * User-defined parameters args can be set in training configuration. |
| 237 | + |
| 238 | +Note, PaddlePaddle reserves the right to add pre-defined parameter, so please |
| 239 | +use :code:`**kwargs` in init_hook to ensure compatibility by accepting the |
| 240 | +parameters which your init_hook does not use. |
| 241 | + |
| 242 | +.. _cache :: |
| 243 | +cache |
| 244 | ++++++ |
| 245 | +DataProvider provides two simple cache strategy. They are |
| 246 | +* CacheType.NO_CACHE means do not cache any data, then data is read runtime by |
| 247 | + the user implemented python module every pass. |
| 248 | +* CacheType.CACHE_PASS_IN_MEM means the first pass reads data by the user |
| 249 | + implemented python module, and the rest passes will directly read data from |
| 250 | + memory. |
0 commit comments