Skip to content

Commit 8dcf47c

Browse files
committed
content/work-with-data: Added example on tidying data
1 parent 571e74c commit 8dcf47c

File tree

1 file changed

+78
-15
lines changed

1 file changed

+78
-15
lines changed

content/work-with-data.rst

Lines changed: 78 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -247,9 +247,8 @@ Good things
247247
- Can represent floating point numbers with full precision.
248248
- Can potentially save lots of space, especially, when storing numbers.
249249
- Data reading and writing is usually much faster than loading from text files,
250-
since the format contains information.
251-
about the data structure, and thus memory allocation can be done more
252-
efficiently.
250+
since the format contains information about the data structure, and thus
251+
memory allocation can be done more efficiently.
253252
- More explicit specification for storing multiple data sets and metadata in
254253
the same file.
255254
- Many binary formats allow for partial loading of the data.
@@ -341,29 +340,93 @@ Exercise
341340
understand the model.
342341

343342

344-
Efficient use of untidy data
345-
----------------------------
343+
Case study: Converting untidy data to tidy data
344+
-----------------------------------------------
346345

347-
Many data analysis tools (like Pandas) require tidy data, but some data is not
348-
in a suitable format. What we have seen often in the past is people then not
349-
using the powerful tools, but write comple scripts that extract individual pieces
350-
from the data each time they need to do a calculation.
346+
Many data analysis tools (like Pandas) are designed to work with tidy data,
347+
but some data is not in a suitable format. What we have seen often in the
348+
past is people then not using the powerful tools, but write complicated
349+
scripts that extract individual pieces from the data each time they need
350+
to do a calculation.
351351

352-
Example of "questionable pipeline":
353-
length_array = []
352+
As an example, let's see how we can use country data from an example REST API
353+
endpoint (for more information on how to work with web APIs, see
354+
:doc:`this page <web-apis>`). Let's get the data with the following piece
355+
of code:
354356

355-
for entry in data:
356-
length_array.append(len(entry['length']))
357-
...
357+
.. code-block:: python
358358
359+
import json
360+
import requests
359361
362+
url = 'https://api.sampleapis.com/countries/countries'
360363
364+
response = requests.get(url)
361365
362-
Example of pipeline with initial conversion to pandas e.g. via json_normalize
366+
countries_json = json.loads(response.content)
363367
368+
Let's try to find the country with the largest population.
364369

370+
An example of a "questionable" way of solving this problem would be something
371+
like the following piece of code that is written in pure Python:
365372

373+
.. code-block:: python
366374
375+
max_population = 0
376+
top_population_country = ''
377+
378+
for country in countries_json:
379+
if country.get('population', 0) > max_population:
380+
top_population_country = country['name']
381+
max_population = country.get('population', 0)
382+
383+
print(top_population_country)
384+
385+
This is a very natural way of writing a solution for the problem, but it has
386+
major caveats:
387+
388+
1. We throw all of the other data out so we cannot answer any
389+
follow up questions.
390+
2. For bigger data, this would be very slow and ineffective.
391+
3. We have to write lots of code to do a simple thing.
392+
393+
Another typical solution would be something like the following code,
394+
which picks some of the data and creates a Pandas dataframe out of it:
395+
396+
.. code-block:: python
397+
398+
import pandas as pd
399+
400+
countries_list = []
401+
402+
for country in countries_json:
403+
countries_list.append([country['name'], country.get('population',0)])
404+
405+
countries_df = pd.DataFrame(countries_list, columns=['name', 'population'])
406+
407+
print(countries_df.nlargest(1, 'population')['name'].values[0])
408+
409+
This solution has many of the same problems as the previous one, but now we can
410+
use Pandas to do follow up analysis.
411+
412+
Better solution would be to use Pandas'
413+
`pandas.DataFrame.from_dict <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_dict.html>`__
414+
or `pandas.json_normalize <https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html>`__
415+
to read the full data in:
416+
417+
.. code-block:: python
418+
419+
countries_df = pd.DataFrame.from_dict(countries_json)
420+
print(countries_df.nlargest(1, 'population')['name'].values[0])
421+
422+
countries_df = pd.json_normalize(countries_json)
423+
print(countries_df.nlargest(1, 'population')['name'].values[0])
424+
425+
.. admonition:: Key points
426+
427+
- Convert your data to a format where it is easy to do analysis on it.
428+
- Check the tools you're using if they have an existing feature that can help
429+
you read the data in.
367430

368431

369432
Things to remember

0 commit comments

Comments
 (0)