@@ -247,9 +247,8 @@ Good things
247247- Can represent floating point numbers with full precision.
248248- Can potentially save lots of space, especially, when storing numbers.
249249- Data reading and writing is usually much faster than loading from text files,
250- since the format contains information.
251- about the data structure, and thus memory allocation can be done more
252- efficiently.
250+ since the format contains information about the data structure, and thus
251+ memory allocation can be done more efficiently.
253252- More explicit specification for storing multiple data sets and metadata in
254253 the same file.
255254- Many binary formats allow for partial loading of the data.
@@ -341,29 +340,93 @@ Exercise
341340 understand the model.
342341
343342
344- Efficient use of untidy data
345- ----------------------------
343+ Case study: Converting untidy data to tidy data
344+ -----------------------------------------------
346345
347- Many data analysis tools (like Pandas) require tidy data, but some data is not
348- in a suitable format. What we have seen often in the past is people then not
349- using the powerful tools, but write comple scripts that extract individual pieces
350- from the data each time they need to do a calculation.
346+ Many data analysis tools (like Pandas) are designed to work with tidy data,
347+ but some data is not in a suitable format. What we have seen often in the
348+ past is people then not using the powerful tools, but write complicated
349+ scripts that extract individual pieces from the data each time they need
350+ to do a calculation.
351351
352- Example of "questionable pipeline":
353- length_array = []
352+ As an example, let's see how we can use country data from an example REST API
353+ endpoint (for more information on how to work with web APIs, see
354+ :doc: `this page <web-apis >`). Let's get the data with the following piece
355+ of code:
354356
355- for entry in data:
356- length_array.append(len(entry['length']))
357- ...
357+ .. code-block :: python
358358
359+ import json
360+ import requests
359361
362+ url = ' https://api.sampleapis.com/countries/countries'
360363
364+ response = requests.get(url)
361365
362- Example of pipeline with initial conversion to pandas e.g. via json_normalize
366+ countries_json = json.loads(response.content)
363367
368+ Let's try to find the country with the largest population.
364369
370+ An example of a "questionable" way of solving this problem would be something
371+ like the following piece of code that is written in pure Python:
365372
373+ .. code-block :: python
366374
375+ max_population = 0
376+ top_population_country = ' '
377+
378+ for country in countries_json:
379+ if country.get(' population' , 0 ) > max_population:
380+ top_population_country = country[' name' ]
381+ max_population = country.get(' population' , 0 )
382+
383+ print (top_population_country)
384+
385+ This is a very natural way of writing a solution for the problem, but it has
386+ major caveats:
387+
388+ 1. We throw all of the other data out so we cannot answer any
389+ follow up questions.
390+ 2. For bigger data, this would be very slow and ineffective.
391+ 3. We have to write lots of code to do a simple thing.
392+
393+ Another typical solution would be something like the following code,
394+ which picks some of the data and creates a Pandas dataframe out of it:
395+
396+ .. code-block :: python
397+
398+ import pandas as pd
399+
400+ countries_list = []
401+
402+ for country in countries_json:
403+ countries_list.append([country[' name' ], country.get(' population' ,0 )])
404+
405+ countries_df = pd.DataFrame(countries_list, columns = [' name' , ' population' ])
406+
407+ print (countries_df.nlargest(1 , ' population' )[' name' ].values[0 ])
408+
409+ This solution has many of the same problems as the previous one, but now we can
410+ use Pandas to do follow up analysis.
411+
412+ Better solution would be to use Pandas'
413+ `pandas.DataFrame.from_dict <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_dict.html >`__
414+ or `pandas.json_normalize <https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html >`__
415+ to read the full data in:
416+
417+ .. code-block :: python
418+
419+ countries_df = pd.DataFrame.from_dict(countries_json)
420+ print (countries_df.nlargest(1 , ' population' )[' name' ].values[0 ])
421+
422+ countries_df = pd.json_normalize(countries_json)
423+ print (countries_df.nlargest(1 , ' population' )[' name' ].values[0 ])
424+
425+ .. admonition :: Key points
426+
427+ - Convert your data to a format where it is easy to do analysis on it.
428+ - Check the tools you're using if they have an existing feature that can help
429+ you read the data in.
367430
368431
369432Things to remember
0 commit comments