Merge pull request #310 from AaltoSciComp/work-with-data

eglerean · web-flow · commit b030116c8d40 · 2024-11-06T00:04:23.000+02:00
Work with data: Modification to data formats page
diff --git a/content/data-formats.rst b/content/data-formats.rst
@@ -1,206 +1,7 @@
-Data formats with Pandas and Numpy
-==================================
-
-.. questions::
-
-   - How do you store your data right now?
-   - Are you doing data cleaning / preprocessing every time you load the data?
-
-.. objectives::
-
-   - Learn the distinguishing characteristics of different data formats.
-   - Learn how you can read and write data in a variety of formats.
-
-What is a data format?
-----------------------
-
-Data format can mean two different things
-
-1. `data structure <https://en.wikipedia.org/wiki/Data_structure>`__ or how you're storing the data in memory while you're working on it;
-2. `file format <https://en.wikipedia.org/wiki/File_format>`__ or the way you're storing the data in the disk.
-
-Let's consider this randomly generated DataFrame with various columns::
-
-    import pandas as pd
-    import numpy as np
-
-    n_rows = 100000
-
-    dataset = pd.DataFrame(
-        data={
-            'string': np.random.choice(('apple', 'banana', 'carrot'), size=n_rows),
-            'timestamp': pd.date_range("20130101", periods=n_rows, freq="s"),
-            'integer': np.random.choice(range(0,10), size=n_rows),
-            'float': np.random.uniform(size=n_rows),
-        },
-    )
-
-    dataset.info()
-
-This DataFrame is structured in the tidy data format.
-In tidy data format we have multiple columns of data that are collected in a Pandas DataFrame.
-
-..  image:: img/pandas/tidy_data.png
-
-Let's consider another example::
-
-    n = 1000
-
-    data_array = np.random.uniform(size=(n,n))
-    np.info(data_array)
-
-
-Here we have a different data structure: we have a two-dimensional array of numbers.
-This is different to a Pandas DataFrame as data is stored as one contiguous block instead of individual columns.
-This also means that the whole array must have one data type.
-
-
-..  figure:: https://github.com/elegant-scipy/elegant-scipy/raw/master/figures/NumPy_ndarrays_v2.png
-
-    Source: `Elegant Scipy <https://github.com/elegant-scipy/elegant-scipy>`__
-
-Now the question is: **Can the data be saved to the disk without changing the data format?**
-
-For this we need a **file format** that can easily store our **data structure**.
-
-.. admonition:: Data type vs. data structure vs. file format
-   :class: dropdown
-
-   - **Data type:** Type of a single piece of data (integer, string, float, ...).
-   - **Data structure:** How the data is organized in memory (individual columns, 2D-array, nested dictionaries, ...).
-   - **File format:** How the data is organized when it is saved to the disk (columns of strings, block of binary data, ...).
-
-   For example, a black and white image stored as a .png-file (**file format**)
-   might be stored in memory as an NxM array (**data structure**) of integers (**data type**).
-
-What to look for in a file format?
-----------------------------------
-
-When deciding which file format you should use for your program, you should remember the following:
-
-**There is no file format that is good for every use case.**
-
-Instead, there are various standard file formats for various use cases:
-
-.. figure:: https://imgs.xkcd.com/comics/standards.png
-
-   Source: `xkcd #927 <https://xkcd.com/927/>`__.
-
-Usually, you'll want to consider the following things when choosing a file format:
-
-1. Is the file format good for my data structure (is it fast/space efficient/easy to use)?
-2. Is everybody else / leading authorities in my field recommending a certain format?
-3. Do I need a human-readable format or is it enough to work on it using code?
-4. Do I want to archive / share the data or do I just want to store it while I'm working?
-
-Pandas supports `many file formats <https://pandas.pydata.org/docs/user_guide/io.html>`__ for tidy data and Numpy supports `some file formats <https://numpy.org/doc/stable/reference/routines.io.html>`__ for array data.
-However, there are many other file formats that can be used through other libraries.
-
-Table below describes some data formats:
-
-.. list-table::
-   :header-rows: 1
-
-   * - | Name:
-     - | Human
-       | readable:
-     - | Space
-       | efficiency:
-     - | Arbitrary
-       | data:
-     - | Tidy
-       | data:
-     - | Array
-       | data:
-     - | Long term
-       | storage/sharing:
-
-   * - :ref:`Pickle <pickle>`
-     - ❌
-     - 🟨
-     - ✅
-     - 🟨
-     - 🟨
-     - ❌
-
-   * - :ref:`CSV <csv>`
-     - ✅
-     - ❌
-     - ❌
-     - ✅
-     - 🟨
-     - ✅
-
-   * - :ref:`Feather <feather>`
-     - ❌
-     - ✅
-     - ❌
-     - ✅
-     - ❌
-     - ❌
-
-   * - :ref:`Parquet <parquet>`
-     - ❌
-     - ✅
-     - 🟨
-     - ✅
-     - 🟨
-     - ✅
-
-   * - :ref:`npy <npy>`
-     - ❌
-     - 🟨
-     - ❌
-     - ❌
-     - ✅
-     - ❌
-
-   * - :ref:`HDF5 <hdf5>`
-     - ❌
-     - ✅
-     - ❌
-     - ❌
-     - ✅
-     - ✅
-
-   * - :ref:`NetCDF4 <netcdf4>`
-     - ❌
-     - ✅
-     - ❌
-     - ❌
-     - ✅
-     - ✅
-
-   * - :ref:`JSON <json>`
-     - ✅
-     - ❌
-     - 🟨
-     - ❌
-     - ❌
-     - ✅
-
-   * - :ref:`Excel <excel>`
-     - ❌
-     - ❌
-     - ❌
-     - 🟨
-     - ❌
-     - ✅
-
-   * - :ref:`Graph formats <graph>`
-     - 🟨
-     - 🟨
-     - ❌
-     - ❌
-     - ❌
-     - 🟨
-
-.. important::
-
-    - ✅ : Good
-    - 🟨 : Ok / depends on a case
-    - ❌ : Bad
+In depth analysis of some selected file formats
+===============================================
 
+Here is a selection of file formats that are commonly used in data science. They are somewhat ordered by their intended use.
 
 Storing arbitrary Python objects
 --------------------------------
@@ -548,8 +349,6 @@ You can create a HDF5 file with :external+pandas:ref:`to_hdf- and read_parquet-f
     dataset.to_hdf('dataset.h5', key='dataset', mode='w')
     dataset_hdf5 = pd.read_hdf('dataset.h5')
 
-PyTables comes installed with the default Anaconda installation.
-
 For writing data that is not a table, you can use the excellent `h5py-package <https://docs.h5py.org/en/stable/>`__::
 
     import h5py
@@ -572,8 +371,6 @@ For writing data that is not a table, you can use the excellent `h5py-package <h
     # Close file
     h5_file.close()
 
-h5py comes with Anaconda as well.
-
 
 .. _netcdf4:
 
@@ -750,69 +547,3 @@ One can use functions in libraries such as
 `igraph <https://igraph.readthedocs.io/en/stable/tutorial.html#igraph-and-the-outside-world>`__
 to read and write graphs.
 
-
-
-Benefits of binary file formats
--------------------------------
-
-Binary files come with various benefits compared to text files.
-
-1. They can represent floating point numbers with full precision.
-2. Storing data in binary format can potentially save lots of space.
-   This is because you do not need to write numbers as characters.
-   Additionally some file formats support compression of the data.
-3. Data loading from binary files is usually much faster than loading from text files.
-   This is because memory can be allocated for the data before data is loaded as the type of data in columns is known.
-4. You can often store multiple datasets and metadata to the same file.
-5. Many binary formats allow for partial loading of the data.
-   This makes it possible to work with datasets that are larger than your computer's memory.
-
-**Performance with tidy dataset:**
-
-For the tidy ``dataset`` we had, we can test the performance of the different file formats:
-
-.. csv-table::
-   :file: format_comparison_tidy.csv
-   :header-rows: 1
-
-The relatively poor performance of HDF5-based formats in this case is due to the data being mostly one dimensional columns full of character strings.
-
-
-**Performance with data array:**
-
-For the array-shaped ``data_array`` we had, we can test the performance of the different file formats:
-
-
-.. csv-table::
-   :file: format_comparison_array.csv
-   :header-rows: 1
-
-For this kind of a data, HDF5-based formats perform much better.
-
-
-Things to remember
-------------------
-
-1. **There is no file format that is good for every use case.**
-2. Usually, your research question determines which libraries you want to use to solve it.
-   Similarly, the data format you have determines file format you want to use.
-3. However, if you're using a previously existing framework or tools or you work in a specific field, you should prioritize using the formats that are used in said framework/tools/field.
-4. When you're starting your project, it's a good idea to take your initial data, clean it, and store the results in a good binary format that works as a starting point for your future analysis.
-   If you've written the cleaning procedure as a script, you can always reproduce it.
-5. Throughout your work, you should use code to turn important data to human-readable format (e.g. plots, averages, :meth:`pandas.DataFrame.head`), not to keep your full data in a human-readable format.
-6. Once you've finished, you should store the data in a format that can be easily shared to other people.
-
-
-See also
---------
-
-- `Pandas' IO tools <https://pandas.pydata.org/docs/user_guide/io.html>`__
-- `Tidy data comparison notebook <https://github.com/AaltoSciComp/python-for-scicomp/tree/master/extras/data-formats-comparison-tidy.ipynb>`__
-- `Array data comparison notebook <https://github.com/AaltoSciComp/python-for-scicomp/tree/master/extras/data-formats-comparison-array.ipynb>`__
-
-
-.. keypoints::
-
-   - Pandas can read and write a variety of data formats.
-   - There are many good, standard formats, and you don't need to create your own.
-   - There are plenty of other libraries dedicated to various formats.
diff --git a/content/index.rst b/content/index.rst
@@ -76,7 +76,7 @@ to learn yourself as you need to.
    30 min   ; :doc:`xarray`
    60 min   ; :doc:`plotting-matplotlib`
    60 min   ; :doc:`plotting-vega-altair`
-   30 min   ; :doc:`data-formats`
+   30 min   ; :doc:`work-with-data`
    60 min   ; :doc:`scripts`
    40 min   ; :doc:`profiling`
    20 min   ; :doc:`productivity`
@@ -102,7 +102,7 @@ to learn yourself as you need to.
    xarray
    plotting-matplotlib
    plotting-vega-altair
-   data-formats
+   work-with-data
    scripts
    profiling
    productivity
@@ -122,6 +122,7 @@ to learn yourself as you need to.
    quick-reference
    exercises
    guide
+   data-formats
 
 
 .. _learner-personas:
diff --git a/content/work-with-data.rst b/content/work-with-data.rst