Skip to content

Commit b030116

Browse files
authored
Merge pull request #310 from AaltoSciComp/work-with-data
Work with data: Modification to data formats page
2 parents 06bd002 + f7c926e commit b030116

File tree

3 files changed

+472
-274
lines changed

3 files changed

+472
-274
lines changed

content/data-formats.rst

Lines changed: 3 additions & 272 deletions
Original file line numberDiff line numberDiff line change
@@ -1,206 +1,7 @@
1-
Data formats with Pandas and Numpy
2-
==================================
3-
4-
.. questions::
5-
6-
- How do you store your data right now?
7-
- Are you doing data cleaning / preprocessing every time you load the data?
8-
9-
.. objectives::
10-
11-
- Learn the distinguishing characteristics of different data formats.
12-
- Learn how you can read and write data in a variety of formats.
13-
14-
What is a data format?
15-
----------------------
16-
17-
Data format can mean two different things
18-
19-
1. `data structure <https://en.wikipedia.org/wiki/Data_structure>`__ or how you're storing the data in memory while you're working on it;
20-
2. `file format <https://en.wikipedia.org/wiki/File_format>`__ or the way you're storing the data in the disk.
21-
22-
Let's consider this randomly generated DataFrame with various columns::
23-
24-
import pandas as pd
25-
import numpy as np
26-
27-
n_rows = 100000
28-
29-
dataset = pd.DataFrame(
30-
data={
31-
'string': np.random.choice(('apple', 'banana', 'carrot'), size=n_rows),
32-
'timestamp': pd.date_range("20130101", periods=n_rows, freq="s"),
33-
'integer': np.random.choice(range(0,10), size=n_rows),
34-
'float': np.random.uniform(size=n_rows),
35-
},
36-
)
37-
38-
dataset.info()
39-
40-
This DataFrame is structured in the tidy data format.
41-
In tidy data format we have multiple columns of data that are collected in a Pandas DataFrame.
42-
43-
.. image:: img/pandas/tidy_data.png
44-
45-
Let's consider another example::
46-
47-
n = 1000
48-
49-
data_array = np.random.uniform(size=(n,n))
50-
np.info(data_array)
51-
52-
53-
Here we have a different data structure: we have a two-dimensional array of numbers.
54-
This is different to a Pandas DataFrame as data is stored as one contiguous block instead of individual columns.
55-
This also means that the whole array must have one data type.
56-
57-
58-
.. figure:: https://github.com/elegant-scipy/elegant-scipy/raw/master/figures/NumPy_ndarrays_v2.png
59-
60-
Source: `Elegant Scipy <https://github.com/elegant-scipy/elegant-scipy>`__
61-
62-
Now the question is: **Can the data be saved to the disk without changing the data format?**
63-
64-
For this we need a **file format** that can easily store our **data structure**.
65-
66-
.. admonition:: Data type vs. data structure vs. file format
67-
:class: dropdown
68-
69-
- **Data type:** Type of a single piece of data (integer, string, float, ...).
70-
- **Data structure:** How the data is organized in memory (individual columns, 2D-array, nested dictionaries, ...).
71-
- **File format:** How the data is organized when it is saved to the disk (columns of strings, block of binary data, ...).
72-
73-
For example, a black and white image stored as a .png-file (**file format**)
74-
might be stored in memory as an NxM array (**data structure**) of integers (**data type**).
75-
76-
What to look for in a file format?
77-
----------------------------------
78-
79-
When deciding which file format you should use for your program, you should remember the following:
80-
81-
**There is no file format that is good for every use case.**
82-
83-
Instead, there are various standard file formats for various use cases:
84-
85-
.. figure:: https://imgs.xkcd.com/comics/standards.png
86-
87-
Source: `xkcd #927 <https://xkcd.com/927/>`__.
88-
89-
Usually, you'll want to consider the following things when choosing a file format:
90-
91-
1. Is the file format good for my data structure (is it fast/space efficient/easy to use)?
92-
2. Is everybody else / leading authorities in my field recommending a certain format?
93-
3. Do I need a human-readable format or is it enough to work on it using code?
94-
4. Do I want to archive / share the data or do I just want to store it while I'm working?
95-
96-
Pandas supports `many file formats <https://pandas.pydata.org/docs/user_guide/io.html>`__ for tidy data and Numpy supports `some file formats <https://numpy.org/doc/stable/reference/routines.io.html>`__ for array data.
97-
However, there are many other file formats that can be used through other libraries.
98-
99-
Table below describes some data formats:
100-
101-
.. list-table::
102-
:header-rows: 1
103-
104-
* - | Name:
105-
- | Human
106-
| readable:
107-
- | Space
108-
| efficiency:
109-
- | Arbitrary
110-
| data:
111-
- | Tidy
112-
| data:
113-
- | Array
114-
| data:
115-
- | Long term
116-
| storage/sharing:
117-
118-
* - :ref:`Pickle <pickle>`
119-
- ❌
120-
- 🟨
121-
- ✅
122-
- 🟨
123-
- 🟨
124-
- ❌
125-
126-
* - :ref:`CSV <csv>`
127-
- ✅
128-
- ❌
129-
- ❌
130-
- ✅
131-
- 🟨
132-
- ✅
133-
134-
* - :ref:`Feather <feather>`
135-
- ❌
136-
- ✅
137-
- ❌
138-
- ✅
139-
- ❌
140-
- ❌
141-
142-
* - :ref:`Parquet <parquet>`
143-
- ❌
144-
- ✅
145-
- 🟨
146-
- ✅
147-
- 🟨
148-
- ✅
149-
150-
* - :ref:`npy <npy>`
151-
- ❌
152-
- 🟨
153-
- ❌
154-
- ❌
155-
- ✅
156-
- ❌
157-
158-
* - :ref:`HDF5 <hdf5>`
159-
- ❌
160-
- ✅
161-
- ❌
162-
- ❌
163-
- ✅
164-
- ✅
165-
166-
* - :ref:`NetCDF4 <netcdf4>`
167-
- ❌
168-
- ✅
169-
- ❌
170-
- ❌
171-
- ✅
172-
- ✅
173-
174-
* - :ref:`JSON <json>`
175-
- ✅
176-
- ❌
177-
- 🟨
178-
- ❌
179-
- ❌
180-
- ✅
181-
182-
* - :ref:`Excel <excel>`
183-
- ❌
184-
- ❌
185-
- ❌
186-
- 🟨
187-
- ❌
188-
- ✅
189-
190-
* - :ref:`Graph formats <graph>`
191-
- 🟨
192-
- 🟨
193-
- ❌
194-
- ❌
195-
- ❌
196-
- 🟨
197-
198-
.. important::
199-
200-
- ✅ : Good
201-
- 🟨 : Ok / depends on a case
202-
- ❌ : Bad
1+
In depth analysis of some selected file formats
2+
===============================================
2033

4+
Here is a selection of file formats that are commonly used in data science. They are somewhat ordered by their intended use.
2045

2056
Storing arbitrary Python objects
2067
--------------------------------
@@ -548,8 +349,6 @@ You can create a HDF5 file with :external+pandas:ref:`to_hdf- and read_parquet-f
548349
dataset.to_hdf('dataset.h5', key='dataset', mode='w')
549350
dataset_hdf5 = pd.read_hdf('dataset.h5')
550351

551-
PyTables comes installed with the default Anaconda installation.
552-
553352
For writing data that is not a table, you can use the excellent `h5py-package <https://docs.h5py.org/en/stable/>`__::
554353

555354
import h5py
@@ -572,8 +371,6 @@ For writing data that is not a table, you can use the excellent `h5py-package <h
572371
# Close file
573372
h5_file.close()
574373

575-
h5py comes with Anaconda as well.
576-
577374

578375
.. _netcdf4:
579376

@@ -750,69 +547,3 @@ One can use functions in libraries such as
750547
`igraph <https://igraph.readthedocs.io/en/stable/tutorial.html#igraph-and-the-outside-world>`__
751548
to read and write graphs.
752549

753-
754-
755-
Benefits of binary file formats
756-
-------------------------------
757-
758-
Binary files come with various benefits compared to text files.
759-
760-
1. They can represent floating point numbers with full precision.
761-
2. Storing data in binary format can potentially save lots of space.
762-
This is because you do not need to write numbers as characters.
763-
Additionally some file formats support compression of the data.
764-
3. Data loading from binary files is usually much faster than loading from text files.
765-
This is because memory can be allocated for the data before data is loaded as the type of data in columns is known.
766-
4. You can often store multiple datasets and metadata to the same file.
767-
5. Many binary formats allow for partial loading of the data.
768-
This makes it possible to work with datasets that are larger than your computer's memory.
769-
770-
**Performance with tidy dataset:**
771-
772-
For the tidy ``dataset`` we had, we can test the performance of the different file formats:
773-
774-
.. csv-table::
775-
:file: format_comparison_tidy.csv
776-
:header-rows: 1
777-
778-
The relatively poor performance of HDF5-based formats in this case is due to the data being mostly one dimensional columns full of character strings.
779-
780-
781-
**Performance with data array:**
782-
783-
For the array-shaped ``data_array`` we had, we can test the performance of the different file formats:
784-
785-
786-
.. csv-table::
787-
:file: format_comparison_array.csv
788-
:header-rows: 1
789-
790-
For this kind of a data, HDF5-based formats perform much better.
791-
792-
793-
Things to remember
794-
------------------
795-
796-
1. **There is no file format that is good for every use case.**
797-
2. Usually, your research question determines which libraries you want to use to solve it.
798-
Similarly, the data format you have determines file format you want to use.
799-
3. However, if you're using a previously existing framework or tools or you work in a specific field, you should prioritize using the formats that are used in said framework/tools/field.
800-
4. When you're starting your project, it's a good idea to take your initial data, clean it, and store the results in a good binary format that works as a starting point for your future analysis.
801-
If you've written the cleaning procedure as a script, you can always reproduce it.
802-
5. Throughout your work, you should use code to turn important data to human-readable format (e.g. plots, averages, :meth:`pandas.DataFrame.head`), not to keep your full data in a human-readable format.
803-
6. Once you've finished, you should store the data in a format that can be easily shared to other people.
804-
805-
806-
See also
807-
--------
808-
809-
- `Pandas' IO tools <https://pandas.pydata.org/docs/user_guide/io.html>`__
810-
- `Tidy data comparison notebook <https://github.com/AaltoSciComp/python-for-scicomp/tree/master/extras/data-formats-comparison-tidy.ipynb>`__
811-
- `Array data comparison notebook <https://github.com/AaltoSciComp/python-for-scicomp/tree/master/extras/data-formats-comparison-array.ipynb>`__
812-
813-
814-
.. keypoints::
815-
816-
- Pandas can read and write a variety of data formats.
817-
- There are many good, standard formats, and you don't need to create your own.
818-
- There are plenty of other libraries dedicated to various formats.

content/index.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ to learn yourself as you need to.
7676
30 min ; :doc:`xarray`
7777
60 min ; :doc:`plotting-matplotlib`
7878
60 min ; :doc:`plotting-vega-altair`
79-
30 min ; :doc:`data-formats`
79+
30 min ; :doc:`work-with-data`
8080
60 min ; :doc:`scripts`
8181
40 min ; :doc:`profiling`
8282
20 min ; :doc:`productivity`
@@ -102,7 +102,7 @@ to learn yourself as you need to.
102102
xarray
103103
plotting-matplotlib
104104
plotting-vega-altair
105-
data-formats
105+
work-with-data
106106
scripts
107107
profiling
108108
productivity
@@ -122,6 +122,7 @@ to learn yourself as you need to.
122122
quick-reference
123123
exercises
124124
guide
125+
data-formats
125126

126127

127128
.. _learner-personas:

0 commit comments

Comments
 (0)