Skip to content

Commit 571e74c

Browse files
committed
work-with-data: Internal cleanup (mainly splitting wrapping lines).
1 parent ddca570 commit 571e74c

File tree

1 file changed

+116
-67
lines changed

1 file changed

+116
-67
lines changed

content/work-with-data.rst

Lines changed: 116 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Working with Data
88

99
.. objectives::
1010

11-
- Learn benefits/drawbacks of common data formats.
11+
- Learn benefits/drawbacks of common data formats.
1212
- Learn how you can read and write data in a variety of formats.
1313

1414

@@ -22,8 +22,10 @@ What is a data format?
2222

2323
Data format can mean two different things
2424

25-
1. `data structure <https://en.wikipedia.org/wiki/Data_structure>`__ or how you're storing the data in memory while you're working on it;
26-
2. `file format <https://en.wikipedia.org/wiki/File_format>`__ or the way you're storing the data in the disk.
25+
1. `data structure <https://en.wikipedia.org/wiki/Data_structure>`__ or how
26+
you're storing the data in memory while you're working on it;
27+
2. `file format <https://en.wikipedia.org/wiki/File_format>`__ or the way you're
28+
storing the data in the disk.
2729

2830
Let's consider this randomly generated DataFrame with various columns::
2931

@@ -44,8 +46,8 @@ Let's consider this randomly generated DataFrame with various columns::
4446
dataset.info()
4547

4648
This DataFrame is structured in the *tidy data* format.
47-
In tidy data we have multiple columns of data that are collected in a Pandas DataFrame, where each column
48-
represents a value of a specific type.
49+
In tidy data we have multiple columns of data that are collected in a Pandas
50+
DataFrame, where each column represents a value of a specific type.
4951

5052
.. image:: img/pandas/tidy_data.png
5153

@@ -58,33 +60,39 @@ Let's consider another example::
5860

5961

6062
Here we have a different data structure: we have a two-dimensional array of numbers.
61-
This is different to a Pandas DataFrame as data is stored as one contiguous block instead of individual columns.
62-
This also means that the whole array must have one data type.
63+
This is different to a Pandas DataFrame as data is stored as one contiguous block
64+
instead of individual columns. This also means that the whole array must have one
65+
data type.
6366

6467

6568
.. figure:: https://github.com/elegant-scipy/elegant-scipy/raw/master/figures/NumPy_ndarrays_v2.png
6669

6770
Source: `Elegant Scipy <https://github.com/elegant-scipy/elegant-scipy>`__
6871

69-
Now the question is: **Can the data be saved to the disk without changing the data format?**
72+
Now the question is: **Can the data be saved to the disk without changing the
73+
data format?**
7074

7175
For this we need a **file format** that can easily store our **data structure**.
7276

7377
.. admonition:: Data type vs. data structure vs. file format
7478
:class: dropdown
7579

76-
- **Data type:** Type of a single piece of data (integer, string, float, ...).
77-
- **Data structure:** How the data is organized in memory (individual columns, 2D-array, nested dictionaries, ...).
78-
- **File format:** How the data is organized when it is saved to the disk (columns of strings, block of binary data, ...).
80+
- **Data type:** Type of a single piece of data (integer, string,
81+
float, ...).
82+
- **Data structure:** How the data is organized in memory (individual
83+
columns, 2D-array, nested dictionaries, ...).
84+
- **File format:** How the data is organized when it is saved to the disk
85+
(columns of strings, block of binary data, ...).
7986

8087
For example, a black and white image stored as a .png-file (**file format**)
81-
might be stored in memory as an NxM array (**data structure**) of integers (**data type**) with each entry representing
82-
the color value of the pixel.
88+
might be stored in memory as an NxM array (**data structure**) of integers
89+
(**data type**) with each entry representing the color value of the pixel.
8390

8491
What to look for in a file format?
8592
----------------------------------
8693

87-
When deciding which file format you should use for your program, you should remember the following:
94+
When deciding which file format you should use for your program, you should
95+
remember the following:
8896

8997
**There is no file format that is good for every use case.**
9098

@@ -98,15 +106,23 @@ There are, indeed, various standard file formats for various use cases:
98106

99107
Source: `xkcd #927 <https://xkcd.com/927/>`__.
100108

101-
Usually, you'll want to consider the following things when choosing a file format:
109+
Usually, you'll want to consider the following things when choosing a file
110+
format:
102111

103-
1. Is the file format good for my data structure (is it fast/space efficient/easy to use)?
104-
2. Is everybody else / leading authorities in my field recommending a certain format?
112+
1. Is the file format good for my data structure (is it fast/space
113+
efficient/easy to use)?
114+
2. Is everybody else / leading authorities in my field recommending a certain
115+
format?
105116
3. Do I need a human-readable format or is it enough to work on it using code?
106-
4. Do I want to archive / share the data or do I just want to store it while I'm working?
117+
4. Do I want to archive / share the data or do I just want to store it while
118+
I'm working?
107119

108-
Pandas supports `many file formats <https://pandas.pydata.org/docs/user_guide/io.html>`__ for tidy data and Numpy supports `some file formats <https://numpy.org/doc/stable/reference/routines.io.html>`__ for array data.
109-
However, there are many other file formats that can be used through other libraries.
120+
Pandas supports
121+
`many file formats <https://pandas.pydata.org/docs/user_guide/io.html>`__
122+
for tidy data and Numpy supports
123+
`some file formats <https://numpy.org/doc/stable/reference/routines.io.html>`__
124+
for array data. However, there are many other file formats that can be used
125+
through other libraries.
110126

111127
Table below describes some data formats:
112128

@@ -214,7 +230,8 @@ Table below describes some data formats:
214230
- ❌ : Bad
215231

216232

217-
A More in depth analysis of the file formats mentioned above, can be found `here <work-with-data>``
233+
A more in-depth analysis of the file formats mentioned above, can be found
234+
:doc:`here <data-formats>`.
218235

219236
Pros and cons
220237
-------------
@@ -224,90 +241,113 @@ Let's have a general look at pros and cons of some types of file formats
224241
Binary File formats
225242
~~~~~~~~~~~~~~~~~~~
226243

227-
Good things
244+
Good things
228245
+++++++++++
229246

230247
- Can represent floating point numbers with full precision.
231248
- Can potentially save lots of space, especially, when storing numbers.
232-
- Data reading and writing is usually much faster than loading from text files, since the format contains information
233-
about the data structure, and thus memory allocation can be done more efficiently.
234-
- More explicit specification for storing multiple data sets and metadata in the same file.
249+
- Data reading and writing is usually much faster than loading from text files,
250+
since the format contains information.
251+
about the data structure, and thus memory allocation can be done more
252+
efficiently.
253+
- More explicit specification for storing multiple data sets and metadata in
254+
the same file.
235255
- Many binary formats allow for partial loading of the data.
236-
This makes it possible to work with datasets that are larger than your computer's memory.
256+
This makes it possible to work with datasets that are larger than your
257+
computer's memory.
237258

238259
Bad things
239260
++++++++++
240261

241-
- Commonly requires the use of a specific library to read and write the data
242-
- Library specific formats can be version dependent
243-
- Not human readable
244-
- Sharing can be more difficult ( requires some expertise to be able to read the data )
245-
- Might require more documentation efforts
262+
- Commonly requires the use of a specific library to read and write the data.
263+
- Library specific formats can be version dependent.
264+
- Not human readable.
265+
- Sharing can be more difficult (requires some expertise to be able to
266+
read the data).
267+
- Might require more documentation efforts.
246268

247269
Textual formats
248270
~~~~~~~~~~~~~~~
249271

250272
Good things
251273
+++++++++++
252274

253-
- Human readable
254-
- Easy to check for (structural) errors
255-
- Supported by many tool out of the box
256-
- Easily shared
275+
- Human readable.
276+
- Easy to check for (structural) errors.
277+
- Supported by many tool out of the box.
278+
- Easily shared.
257279

258280
Bad things
259281
++++++++++
260282

261-
- Can be slow to read and write
262-
- high potential to increase required disk space substantially (e.g. when storing floating point numbers as text)
263-
- Prone to loosing precision when storing floating point numbers
264-
- Muli-dimensional data can be hard to represent
265-
- While the data format might be specified, the data structure might not be clear when starting to read the data.
283+
- Can be slow to read and write.
284+
- High potential to increase required disk space substantially (e.g. when
285+
storing floating point numbers as text).
286+
- Prone to losing precision when storing floating point numbers.
287+
- Multi-dimensional data can be hard to represent.
288+
- While the data format might be specified, the data structure might not be
289+
clear when starting to read the data.
266290

267291
Further considerations
268292
~~~~~~~~~~~~~~~~~~~~~~
269293

270-
- The closer your stored data is to the code, the more likely it depends on the environment you are working in.
271-
If you e.g. `pickle` a generated model, you can only be sure, that the model will work as intended, if you
272-
load it in an environment, that has the same versions of all libraries the model depends on.
294+
- The closer your stored data is to the code, the more likely it depends on the
295+
environment you are working in. If you ``pickle``, e.g. a generated model,
296+
you can only be sure that the model will work as intended if you load it in
297+
an environment that has the same versions of all libraries the model depends
298+
on.
273299

274300

275301
Exercise
276302
--------
277303

278304
.. challenge::
279305

280-
You have a model that you have been training for a while.
281-
Lets assume it's a relatively simple neural network (consisting of a network structure and it's associated weights).
282-
306+
You have a model that you have been training for a while.
307+
Lets assume it's a relatively simple neural network (consisting of a
308+
network structure and it's associated weights).
309+
283310
Let's consider 2 scenarios
284311

285-
A: You have a different project, that is supposed to take this model, and do some processing with it to determine
286-
it's efficiency after different times of training.
312+
A: You have a different project, that is supposed to take this model, and
313+
do some processing with it to determine it's efficiency after different
314+
times of training.
287315

288-
B: You want to publish the model and make it available to others.
316+
B: You want to publish the model and make it available to others.
289317

290318
What are good options to store the model in each of these scenarios?
291319

292320
.. solution::
293321

294-
A: Some export into a binary format that can be easily read. E.g. pickle or a specific export function from the libbrary you use.
295-
It also depends, on whether you intend to make the intermediary steps available to others.
296-
If you do, you might also want to consider storing structure and weights separately or use a format specific for the
297-
type of model you are training, to keep the data independent of the library.
322+
A:
323+
324+
Some export into a binary format that can be easily read. E.g. pickle
325+
or a specific export function from the library you use.
326+
327+
It also depends on whether you intend to make the intermediary steps
328+
available to others. If you do, you might also want to consider storing
329+
structure and weights separately or use a format specific for the
330+
type of model you are training to keep the data independent of the
331+
library.
332+
333+
B:
334+
335+
You might want to consider a more general format that is supported by
336+
many libraries, e.g. ONNX, or a format that is specifically designed
337+
for the type of model you are training.
298338

299-
B: You might want to consider a more general format, that is supported by many libraries, e.g. ONNX, or a format that is
300-
specifically designed for the type of model you are training.
301-
You might also want to consider additionally storing the model in a way that is easily readable by humans, to make it easier for others
302-
to understand the model.
339+
You might also want to consider additionally storing the model in a way
340+
that is easily readable by humans, to make it easier for others to
341+
understand the model.
303342

304343

305344
Efficient use of untidy data
306345
----------------------------
307346

308-
Many data analysis tools (like Pandas) require tidy data, but some data is not in a suitable format.
309-
What we have seen often in the past is people then not using the powerful tools, but write comple scripts that
310-
extract individual pieces from the data each time they need to do a calculation.
347+
Many data analysis tools (like Pandas) require tidy data, but some data is not
348+
in a suitable format. What we have seen often in the past is people then not
349+
using the powerful tools, but write comple scripts that extract individual pieces
350+
from the data each time they need to do a calculation.
311351

312352
Example of "questionable pipeline":
313353
length_array = []
@@ -330,13 +370,22 @@ Things to remember
330370
------------------
331371

332372
1. **There is no file format that is good for every use case.**
333-
2. Usually, your research question determines which libraries you want to use to solve it.
334-
Similarly, the data format you have determines file format you want to use.
335-
3. However, if you're using a previously existing framework or tools or you work in a specific field, you should prioritize using the formats that are used in said framework/tools/field.
336-
4. When you're starting your project, it's a good idea to take your initial data, clean it, and store the results in a good binary format that works as a starting point for your future analysis.
337-
If you've written the cleaning procedure as a script, you can always reproduce it.
338-
5. Throughout your work, you should use code to turn important data to human-readable format (e.g. plots, averages, :meth:`pandas.DataFrame.head`), not to keep your full data in a human-readable format.
339-
6. Once you've finished, you should store the data in a format that can be easily shared to other people.
373+
2. Usually, your research question determines which libraries you want to use
374+
to solve it. Similarly, the data format you have determines file format you
375+
want to use.
376+
3. However, if you're using a previously existing framework or tools or you
377+
work in a specific field, you should prioritize using the formats that are
378+
used in said framework/tools/field.
379+
4. When you're starting your project, it's a good idea to take your initial
380+
data, clean it, and store the results in a good binary format that works as
381+
a starting point for your future analysis. If you've written the cleaning
382+
procedure as a script, you can always reproduce it.
383+
5. Throughout your work, you should use code to turn important data to
384+
a human-readable format (e.g. plots, averages,
385+
:meth:`pandas.DataFrame.head`), not to keep your full data in a
386+
human-readable format.
387+
6. Once you've finished, you should store the data in a format that can be
388+
easily shared to other people.
340389

341390

342391
See also

0 commit comments

Comments
 (0)