You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now the question is: **Can the data be saved to the disk without changing the data format?**
72
+
Now the question is: **Can the data be saved to the disk without changing the
73
+
data format?**
70
74
71
75
For this we need a **file format** that can easily store our **data structure**.
72
76
73
77
.. admonition:: Data type vs. data structure vs. file format
74
78
:class: dropdown
75
79
76
-
- **Data type:** Type of a single piece of data (integer, string, float, ...).
77
-
- **Data structure:** How the data is organized in memory (individual columns, 2D-array, nested dictionaries, ...).
78
-
- **File format:** How the data is organized when it is saved to the disk (columns of strings, block of binary data, ...).
80
+
- **Data type:** Type of a single piece of data (integer, string,
81
+
float, ...).
82
+
- **Data structure:** How the data is organized in memory (individual
83
+
columns, 2D-array, nested dictionaries, ...).
84
+
- **File format:** How the data is organized when it is saved to the disk
85
+
(columns of strings, block of binary data, ...).
79
86
80
87
For example, a black and white image stored as a .png-file (**file format**)
81
-
might be stored in memory as an NxM array (**data structure**) of integers (**data type**) with each entry representing
82
-
the color value of the pixel.
88
+
might be stored in memory as an NxM array (**data structure**) of integers
89
+
(**data type**) with each entry representing the color value of the pixel.
83
90
84
91
What to look for in a file format?
85
92
----------------------------------
86
93
87
-
When deciding which file format you should use for your program, you should remember the following:
94
+
When deciding which file format you should use for your program, you should
95
+
remember the following:
88
96
89
97
**There is no file format that is good for every use case.**
90
98
@@ -98,15 +106,23 @@ There are, indeed, various standard file formats for various use cases:
98
106
99
107
Source: `xkcd #927 <https://xkcd.com/927/>`__.
100
108
101
-
Usually, you'll want to consider the following things when choosing a file format:
109
+
Usually, you'll want to consider the following things when choosing a file
110
+
format:
102
111
103
-
1. Is the file format good for my data structure (is it fast/space efficient/easy to use)?
104
-
2. Is everybody else / leading authorities in my field recommending a certain format?
112
+
1. Is the file format good for my data structure (is it fast/space
113
+
efficient/easy to use)?
114
+
2. Is everybody else / leading authorities in my field recommending a certain
115
+
format?
105
116
3. Do I need a human-readable format or is it enough to work on it using code?
106
-
4. Do I want to archive / share the data or do I just want to store it while I'm working?
117
+
4. Do I want to archive / share the data or do I just want to store it while
118
+
I'm working?
107
119
108
-
Pandas supports `many file formats <https://pandas.pydata.org/docs/user_guide/io.html>`__ for tidy data and Numpy supports `some file formats <https://numpy.org/doc/stable/reference/routines.io.html>`__ for array data.
109
-
However, there are many other file formats that can be used through other libraries.
for array data. However, there are many other file formats that can be used
125
+
through other libraries.
110
126
111
127
Table below describes some data formats:
112
128
@@ -214,7 +230,8 @@ Table below describes some data formats:
214
230
- ❌ : Bad
215
231
216
232
217
-
A More in depth analysis of the file formats mentioned above, can be found `here <work-with-data>``
233
+
A more in-depth analysis of the file formats mentioned above, can be found
234
+
:doc:`here <data-formats>`.
218
235
219
236
Pros and cons
220
237
-------------
@@ -224,90 +241,113 @@ Let's have a general look at pros and cons of some types of file formats
224
241
Binary File formats
225
242
~~~~~~~~~~~~~~~~~~~
226
243
227
-
Good things
244
+
Good things
228
245
+++++++++++
229
246
230
247
- Can represent floating point numbers with full precision.
231
248
- Can potentially save lots of space, especially, when storing numbers.
232
-
- Data reading and writing is usually much faster than loading from text files, since the format contains information
233
-
about the data structure, and thus memory allocation can be done more efficiently.
234
-
- More explicit specification for storing multiple data sets and metadata in the same file.
249
+
- Data reading and writing is usually much faster than loading from text files,
250
+
since the format contains information.
251
+
about the data structure, and thus memory allocation can be done more
252
+
efficiently.
253
+
- More explicit specification for storing multiple data sets and metadata in
254
+
the same file.
235
255
- Many binary formats allow for partial loading of the data.
236
-
This makes it possible to work with datasets that are larger than your computer's memory.
256
+
This makes it possible to work with datasets that are larger than your
257
+
computer's memory.
237
258
238
259
Bad things
239
260
++++++++++
240
261
241
-
- Commonly requires the use of a specific library to read and write the data
242
-
- Library specific formats can be version dependent
243
-
- Not human readable
244
-
- Sharing can be more difficult ( requires some expertise to be able to read the data )
245
-
- Might require more documentation efforts
262
+
- Commonly requires the use of a specific library to read and write the data.
263
+
- Library specific formats can be version dependent.
264
+
- Not human readable.
265
+
- Sharing can be more difficult (requires some expertise to be able to
266
+
read the data).
267
+
- Might require more documentation efforts.
246
268
247
269
Textual formats
248
270
~~~~~~~~~~~~~~~
249
271
250
272
Good things
251
273
+++++++++++
252
274
253
-
- Human readable
254
-
- Easy to check for (structural) errors
255
-
- Supported by many tool out of the box
256
-
- Easily shared
275
+
- Human readable.
276
+
- Easy to check for (structural) errors.
277
+
- Supported by many tool out of the box.
278
+
- Easily shared.
257
279
258
280
Bad things
259
281
++++++++++
260
282
261
-
- Can be slow to read and write
262
-
- high potential to increase required disk space substantially (e.g. when storing floating point numbers as text)
263
-
- Prone to loosing precision when storing floating point numbers
264
-
- Muli-dimensional data can be hard to represent
265
-
- While the data format might be specified, the data structure might not be clear when starting to read the data.
283
+
- Can be slow to read and write.
284
+
- High potential to increase required disk space substantially (e.g. when
285
+
storing floating point numbers as text).
286
+
- Prone to losing precision when storing floating point numbers.
287
+
- Multi-dimensional data can be hard to represent.
288
+
- While the data format might be specified, the data structure might not be
289
+
clear when starting to read the data.
266
290
267
291
Further considerations
268
292
~~~~~~~~~~~~~~~~~~~~~~
269
293
270
-
- The closer your stored data is to the code, the more likely it depends on the environment you are working in.
271
-
If you e.g. `pickle` a generated model, you can only be sure, that the model will work as intended, if you
272
-
load it in an environment, that has the same versions of all libraries the model depends on.
294
+
- The closer your stored data is to the code, the more likely it depends on the
295
+
environment you are working in. If you ``pickle``, e.g. a generated model,
296
+
you can only be sure that the model will work as intended if you load it in
297
+
an environment that has the same versions of all libraries the model depends
298
+
on.
273
299
274
300
275
301
Exercise
276
302
--------
277
303
278
304
.. challenge::
279
305
280
-
You have a model that you have been training for a while.
281
-
Lets assume it's a relatively simple neural network (consisting of a network structure and it's associated weights).
282
-
306
+
You have a model that you have been training for a while.
307
+
Lets assume it's a relatively simple neural network (consisting of a
308
+
network structure and it's associated weights).
309
+
283
310
Let's consider 2 scenarios
284
311
285
-
A: You have a different project, that is supposed to take this model, and do some processing with it to determine
286
-
it's efficiency after different times of training.
312
+
A: You have a different project, that is supposed to take this model, and
313
+
do some processing with it to determine it's efficiency after different
314
+
times of training.
287
315
288
-
B: You want to publish the model and make it available to others.
316
+
B: You want to publish the model and make it available to others.
289
317
290
318
What are good options to store the model in each of these scenarios?
291
319
292
320
.. solution::
293
321
294
-
A: Some export into a binary format that can be easily read. E.g. pickle or a specific export function from the libbrary you use.
295
-
It also depends, on whether you intend to make the intermediary steps available to others.
296
-
If you do, you might also want to consider storing structure and weights separately or use a format specific for the
297
-
type of model you are training, to keep the data independent of the library.
322
+
A:
323
+
324
+
Some export into a binary format that can be easily read. E.g. pickle
325
+
or a specific export function from the library you use.
326
+
327
+
It also depends on whether you intend to make the intermediary steps
328
+
available to others. If you do, you might also want to consider storing
329
+
structure and weights separately or use a format specific for the
330
+
type of model you are training to keep the data independent of the
331
+
library.
332
+
333
+
B:
334
+
335
+
You might want to consider a more general format that is supported by
336
+
many libraries, e.g. ONNX, or a format that is specifically designed
337
+
for the type of model you are training.
298
338
299
-
B: You might want to consider a more general format, that is supported by many libraries, e.g. ONNX, or a format that is
300
-
specifically designed for the type of model you are training.
301
-
You might also want to consider additionally storing the model in a way that is easily readable by humans, to make it easier for others
302
-
to understand the model.
339
+
You might also want to consider additionally storing the model in a way
340
+
that is easily readable by humans, to make it easier for others to
341
+
understand the model.
303
342
304
343
305
344
Efficient use of untidy data
306
345
----------------------------
307
346
308
-
Many data analysis tools (like Pandas) require tidy data, but some data is not in a suitable format.
309
-
What we have seen often in the past is people then not using the powerful tools, but write comple scripts that
310
-
extract individual pieces from the data each time they need to do a calculation.
347
+
Many data analysis tools (like Pandas) require tidy data, but some data is not
348
+
in a suitable format. What we have seen often in the past is people then not
349
+
using the powerful tools, but write comple scripts that extract individual pieces
350
+
from the data each time they need to do a calculation.
311
351
312
352
Example of "questionable pipeline":
313
353
length_array = []
@@ -330,13 +370,22 @@ Things to remember
330
370
------------------
331
371
332
372
1. **There is no file format that is good for every use case.**
333
-
2. Usually, your research question determines which libraries you want to use to solve it.
334
-
Similarly, the data format you have determines file format you want to use.
335
-
3. However, if you're using a previously existing framework or tools or you work in a specific field, you should prioritize using the formats that are used in said framework/tools/field.
336
-
4. When you're starting your project, it's a good idea to take your initial data, clean it, and store the results in a good binary format that works as a starting point for your future analysis.
337
-
If you've written the cleaning procedure as a script, you can always reproduce it.
338
-
5. Throughout your work, you should use code to turn important data to human-readable format (e.g. plots, averages, :meth:`pandas.DataFrame.head`), not to keep your full data in a human-readable format.
339
-
6. Once you've finished, you should store the data in a format that can be easily shared to other people.
373
+
2. Usually, your research question determines which libraries you want to use
374
+
to solve it. Similarly, the data format you have determines file format you
375
+
want to use.
376
+
3. However, if you're using a previously existing framework or tools or you
377
+
work in a specific field, you should prioritize using the formats that are
378
+
used in said framework/tools/field.
379
+
4. When you're starting your project, it's a good idea to take your initial
380
+
data, clean it, and store the results in a good binary format that works as
381
+
a starting point for your future analysis. If you've written the cleaning
382
+
procedure as a script, you can always reproduce it.
383
+
5. Throughout your work, you should use code to turn important data to
384
+
a human-readable format (e.g. plots, averages,
385
+
:meth:`pandas.DataFrame.head`), not to keep your full data in a
386
+
human-readable format.
387
+
6. Once you've finished, you should store the data in a format that can be
0 commit comments