You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Usually, you'll want to consider the following things when choosing a file format:
90
-
91
-
1. Is the file format good for my data structure (is it fast/space efficient/easy to use)?
92
-
2. Is everybody else / leading authorities in my field recommending a certain format?
93
-
3. Do I need a human-readable format or is it enough to work on it using code?
94
-
4. Do I want to archive / share the data or do I just want to store it while I'm working?
95
-
96
-
Pandas supports `many file formats <https://pandas.pydata.org/docs/user_guide/io.html>`__ for tidy data and Numpy supports `some file formats <https://numpy.org/doc/stable/reference/routines.io.html>`__ for array data.
97
-
However, there are many other file formats that can be used through other libraries.
98
-
99
-
Table below describes some data formats:
100
-
101
-
.. list-table::
102
-
:header-rows: 1
103
-
104
-
* - | Name:
105
-
- | Human
106
-
|readable:
107
-
- | Space
108
-
|efficiency:
109
-
- | Arbitrary
110
-
|data:
111
-
- | Tidy
112
-
|data:
113
-
- | Array
114
-
|data:
115
-
- | Long term
116
-
|storage/sharing:
117
-
118
-
* - :ref:`Pickle <pickle>`
119
-
- ❌
120
-
- 🟨
121
-
- ✅
122
-
- 🟨
123
-
- 🟨
124
-
- ❌
125
-
126
-
* - :ref:`CSV <csv>`
127
-
- ✅
128
-
- ❌
129
-
- ❌
130
-
- ✅
131
-
- 🟨
132
-
- ✅
133
-
134
-
* - :ref:`Feather <feather>`
135
-
- ❌
136
-
- ✅
137
-
- ❌
138
-
- ✅
139
-
- ❌
140
-
- ❌
141
-
142
-
* - :ref:`Parquet <parquet>`
143
-
- ❌
144
-
- ✅
145
-
- 🟨
146
-
- ✅
147
-
- 🟨
148
-
- ✅
149
-
150
-
* - :ref:`npy <npy>`
151
-
- ❌
152
-
- 🟨
153
-
- ❌
154
-
- ❌
155
-
- ✅
156
-
- ❌
157
-
158
-
* - :ref:`HDF5 <hdf5>`
159
-
- ❌
160
-
- ✅
161
-
- ❌
162
-
- ❌
163
-
- ✅
164
-
- ✅
165
-
166
-
* - :ref:`NetCDF4 <netcdf4>`
167
-
- ❌
168
-
- ✅
169
-
- ❌
170
-
- ❌
171
-
- ✅
172
-
- ✅
173
-
174
-
* - :ref:`JSON <json>`
175
-
- ✅
176
-
- ❌
177
-
- 🟨
178
-
- ❌
179
-
- ❌
180
-
- ✅
181
-
182
-
* - :ref:`Excel <excel>`
183
-
- ❌
184
-
- ❌
185
-
- ❌
186
-
- 🟨
187
-
- ❌
188
-
- ✅
189
-
190
-
* - :ref:`Graph formats <graph>`
191
-
- 🟨
192
-
- 🟨
193
-
- ❌
194
-
- ❌
195
-
- ❌
196
-
- 🟨
197
-
198
-
.. important::
199
-
200
-
- ✅ : Good
201
-
- 🟨 : Ok / depends on a case
202
-
- ❌ : Bad
1
+
In depth analysis of some selected file formats
2
+
===============================================
203
3
4
+
Here is a selection of file formats that are commonly used in data science. They are somewhat ordered by their intended use.
204
5
205
6
Storing arbitrary Python objects
206
7
--------------------------------
@@ -548,8 +349,6 @@ You can create a HDF5 file with :external+pandas:ref:`to_hdf- and read_parquet-f
Binary files come with various benefits compared to text files.
759
-
760
-
1. They can represent floating point numbers with full precision.
761
-
2. Storing data in binary format can potentially save lots of space.
762
-
This is because you do not need to write numbers as characters.
763
-
Additionally some file formats support compression of the data.
764
-
3. Data loading from binary files is usually much faster than loading from text files.
765
-
This is because memory can be allocated for the data before data is loaded as the type of data in columns is known.
766
-
4. You can often store multiple datasets and metadata to the same file.
767
-
5. Many binary formats allow for partial loading of the data.
768
-
This makes it possible to work with datasets that are larger than your computer's memory.
769
-
770
-
**Performance with tidy dataset:**
771
-
772
-
For the tidy ``dataset`` we had, we can test the performance of the different file formats:
773
-
774
-
.. csv-table::
775
-
:file: format_comparison_tidy.csv
776
-
:header-rows: 1
777
-
778
-
The relatively poor performance of HDF5-based formats in this case is due to the data being mostly one dimensional columns full of character strings.
779
-
780
-
781
-
**Performance with data array:**
782
-
783
-
For the array-shaped ``data_array`` we had, we can test the performance of the different file formats:
784
-
785
-
786
-
.. csv-table::
787
-
:file: format_comparison_array.csv
788
-
:header-rows: 1
789
-
790
-
For this kind of a data, HDF5-based formats perform much better.
791
-
792
-
793
-
Things to remember
794
-
------------------
795
-
796
-
1. **There is no file format that is good for every use case.**
797
-
2. Usually, your research question determines which libraries you want to use to solve it.
798
-
Similarly, the data format you have determines file format you want to use.
799
-
3. However, if you're using a previously existing framework or tools or you work in a specific field, you should prioritize using the formats that are used in said framework/tools/field.
800
-
4. When you're starting your project, it's a good idea to take your initial data, clean it, and store the results in a good binary format that works as a starting point for your future analysis.
801
-
If you've written the cleaning procedure as a script, you can always reproduce it.
802
-
5. Throughout your work, you should use code to turn important data to human-readable format (e.g. plots, averages, :meth:`pandas.DataFrame.head`), not to keep your full data in a human-readable format.
803
-
6. Once you've finished, you should store the data in a format that can be easily shared to other people.
0 commit comments