diff --git a/index.ipynb b/index.ipynb
index 5344767..ba954f8 100644
--- a/index.ipynb
+++ b/index.ipynb
@@ -1 +1 @@
-{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Data Serialization Formats - Cumulative Lab\n", "\n", "## Introduction\n", "\n", "Now that you have learned about CSV and JSON file formats individually, it's time to bring them together with a cumulative lab! Even as a junior data scientist, you can often produce novel, interesting analyses by combining multiple datasets that haven't been combined before.\n", "\n", "## Objectives\n", "\n", "You will be able to:\n", "\n", "* Practice reading serialized JSON and CSV data from files into Python objects\n", "* Practice extracting information from nested data structures\n", "* Practice cleaning data (filtering, normalizing locations, converting types)\n", "* Combine data from multiple sources into a single data structure\n", "* Interpret descriptive statistics and data visualizations to present your findings\n", "\n", "## Your Task: Analyze the Relationship between Population and World Cup Performance\n", "\n", "\n", "\n", "Photo by Fauzan Saari on Unsplash "]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Business Understanding\n", "\n", "#### What is the relationship between the population of a country and their performance in the 2018 FIFA World Cup?\n", "\n", "Intuitively, we might assume that countries with larger populations would have better performance in international sports competitions. While this has been demonstrated to be [true for the Olympics](https://www.researchgate.net/publication/308513557_Medals_at_the_Olympic_Games_The_Relationship_Between_Won_Medals_Gross_Domestic_Product_Population_Size_and_the_Weight_of_Sportive_Practice), the results for the FIFA World Cup are more mixed:\n", "\n", "
CC BY-SA 3.0 , Link
\n", "\n", "In this analysis, we are going to look specifically at the sample of World Cup games in 2018 and the corresponding 2018 populations of the participating nations, to determine the relationship between population and World Cup performance for this year."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Data Understanding\n", "\n", "The data sources for this analysis will be pulled from two separate files.\n", "\n", "#### `world_cup_2018.json`\n", "\n", "* **Source**: This dataset comes from [`football.db`](http://openfootball.github.io/), a \"free and open public domain football database & schema for use in any (programming) language\"\n", "* **Contents**: Data about all games in the 2018 World Cup, including date, location (city and stadium), teams, goals scored (and by whom), and tournament group\n", "* **Format**: Nested JSON data (dictionary containing a list of rounds, each of which contains a list of matches, each of which contains information about the teams involved and the points scored)\n", "\n", "#### `country_populations.csv`\n", "\n", "* **Source**: This dataset comes from a curated collection by [DataHub.io](https://datahub.io/core/population), originally sourced from the World Bank\n", "* **Contents**: Data about populations by country for all available years from 1960 to 2018\n", "* **Format**: CSV data, where each row contains a country name, a year, and a population"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Requirements\n", "\n", "#### 1. List of Teams in 2018 World Cup\n", "\n", "Create an alphabetically-sorted list of teams who competed in the 2018 FIFA World Cup.\n", "\n", "#### 2. Associating Countries with 2018 World Cup Performance\n", "\n", "Create a data structure that connects a team name (country name) to its performance in the 2018 FIFA World Cup. We'll use the count of games won in the entire tournament (group stage as well as knockout stage) to represent the performance.\n", "\n", "This will help create visualizations to help the reader understand the distribution of games won and the performance of each team.\n", "\n", "#### 3. Associating Countries with 2018 Population\n", "\n", "Add to the existing data structure so that it also connects each country name to its 2018 population, and create visualizations comparable to those from step 2.\n", "\n", "#### 4. Analysis of Population vs. Performance\n", "\n", "Choose an appropriate statistical measure to analyze the relationship between population and performance, and create a visualization representing this relationship."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Checking for Understanding\n", "\n", "Before moving on to the next step, pause and think about the strategy for this analysis.\n", "\n", "Remember, our business question is:\n", "\n", "> What is the relationship between the population of a country and their performance in the 2018 FIFA World Cup?\n", "\n", "#### Unit of Analysis\n", "\n", "First, what is our **unit of analysis**, and what is the **unique identifier**? In other words, what will one record in our final data structure represent, and what attribute uniquely describes it?\n", "\n", ".\n", "\n", ".\n", "\n", ".\n", "\n", "*Answer:* \n", "\n", "> What is the relationship between the population of a **country** and their performance in the 2018 FIFA World Cup?\n", "\n", "*Our unit of analysis is a* ***country*** *and the unique identifier we'll use is the* ***country name***\n", "\n", "#### Features\n", "\n", "Next, what **features** are we analyzing? In other words, what attributes of each country are we interested in?\n", "\n", ".\n", "\n", ".\n", "\n", ".\n", "\n", "*Answer:* \n", "\n", "> What is the relationship between the **population** of a country and their **performance in the 2018 FIFA World Cup**?\n", "\n", "*Our features are* ***2018 population*** *and* ***count of wins in the 2018 World Cup***\n", "\n", "#### Dataset to Start With\n", "\n", "Finally, which dataset should we **start** with? In this case, any record with missing data is not useful to us, so we want to start with the smaller dataset.\n", "\n", ".\n", "\n", ".\n", "\n", ".\n", "\n", "*Answer: There are only 32 countries that compete in the World Cup each year, compared to hundreds of countries in the world, so we should start with the* ***2018 World Cup*** *dataset. Then we can join it with the relevant records from the country population dataset.*"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Getting the Data\n", "\n", "Below we import the `json` and `csv` modules, which will be used for reading from `world_cup_2018.json` and `country_populations.csv`, respectively."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "import json\n", "import csv"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Next, we open the relevant files."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "world_cup_file = open(\"data/world_cup_2018.json\", encoding=\"utf8\")\n", "population_file = open(\"data/country_populations.csv\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["**Hint:** if your code below is not working, (e.g. `ValueError: I/O operation on closed file.`, or you get an empty list or dictionary) try re-running the cell above to reopen the files, then re-run your code.\n", "\n", "### 2018 World Cup Data\n", "\n", "In the cell below, use the `json` module to load the data from `world_cup_file` into a dictionary called `world_cup_data`"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate code\n", "world_cup_data = None\n", "\n", "# Close the file now that we're done reading from it\n", "world_cup_file.close()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Make sure the `assert` passes, ensuring that `world_cup_data` has the correct type."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "\n", "# Check that the overall data structure is a dictionary\n", "assert type(world_cup_data) == dict\n", "\n", "# Check that the dictionary has 2 keys, 'name' and 'rounds'\n", "assert list(world_cup_data.keys()) == [\"name\", \"rounds\"]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Population Data\n", "\n", "Now use the `csv` module to load the data from `population_file` into a list of dictionaries called `population_data`\n", "\n", "(Recall that you can convert a `csv.DictReader` object into a list of dictionaries using the built-in `list()` function.)"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate code\n", "population_data = None\n", "\n", "# Close the file now that we're done reading from it\n", "population_file.close()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Make sure the `assert`s pass, ensuring that `population_data` has the correct type."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "\n", "# Check that the overall data structure is a list\n", "assert type(population_data) == list\n", "\n", "# Check that the 0th element is a dictionary\n", "# (csv.DictReader interface differs slightly by Python version;\n", "# either a dict or an OrderedDict is fine here)\n", "from collections import OrderedDict\n", "\n", "assert type(population_data[0]) == dict or type(population_data[0]) == OrderedDict"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 1. List of Teams in 2018 World Cup\n", "\n", "> Create an alphabetically-sorted list of teams who competed in the 2018 FIFA World Cup.\n", "\n", "This will take several steps, some of which have been completed for you.\n", "\n", "### Exploring the Structure of the World Cup Data JSON\n", "\n", "Let's start by exploring the structure of `world_cup_data`. Here is a pretty-printed preview of its contents:\n", "\n", "```\n", "{\n", " \"name\": \"World Cup 2018\",\n", " \"rounds\": [\n", " {\n", " \"name\": \"Matchday 1\",\n", " \"matches\": [\n", " {\n", " \"num\": 1,\n", " \"date\": \"2018-06-14\",\n", " \"time\": \"18:00\",\n", " \"team1\": { \"name\": \"Russia\", \"code\": \"RUS\" },\n", " \"team2\": { \"name\": \"Saudi Arabia\", \"code\": \"KSA\" },\n", " \"score1\": 5,\n", " \"score2\": 0,\n", " \"score1i\": 2,\n", " \"score2i\": 0,\n", " \"goals1\": [\n", " { \"name\": \"Gazinsky\", \"minute\": 12, \"score1\": 1, \"score2\": 0 },\n", " { \"name\": \"Cheryshev\", \"minute\": 43, \"score1\": 2, \"score2\": 0 },\n", " { \"name\": \"Dzyuba\", \"minute\": 71, \"score1\": 3, \"score2\": 0 },\n", " { \"name\": \"Cheryshev\", \"minute\": 90, \"offset\": 1, \"score1\": 4, \"score2\": 0 },\n", " { \"name\": \"Golovin\", \"minute\": 90, \"offset\": 4, \"score1\": 5, \"score2\": 0 }\n", " ],\n", " \"goals2\": [],\n", " \"group\": \"Group A\",\n", " \"stadium\": { \"key\": \"luzhniki\", \"name\": \"Luzhniki Stadium\" },\n", " \"city\": \"Moscow\",\n", " \"timezone\": \"UTC+3\"\n", " }\n", " ]\n", " },\n", " {\n", " \"name\": \"Matchday 2\",\n", " \"matches\": [\n", " {\n", " \"num\": 2,\n", " \"date\": \"2018-06-15\",\n", " \"time\": \"17:00\",\n", " \"team1\": { \"name\": \"Egypt\", \"code\": \"EGY\" },\n", " \"team2\": { \"name\": \"Uruguay\", \"code\": \"URU\" },\n", " \"score1\": 0,\n", " \"score2\": 1,\n", " \"score1i\": 0,\n", " \"score2i\": 0,\n", " \"goals1\": [],\n", " \"goals2\": [\n", " { \"name\": \"Gim\u00e9nez\", \"minute\": 89, \"score1\": 0, \"score2\": 1 }\n", " ],\n", " \"group\": \"Group A\",\n", " \"stadium\": { \"key\": \"ekaterinburg\", \"name\": \"Ekaterinburg Arena\" }, \n", " \"city\": \"Ekaterinburg\",\n", " \"timezone\": \"UTC+5\"\n", " },\n", " ...\n", " ],\n", " },\n", " ], \n", "}\n", "```\n", "\n", "As noted previously, `world_cup_data` is a dictionary with two keys, 'name' and 'rounds'."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "world_cup_data.keys()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The value associated with the 'name' key is simply identifying the dataset."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "world_cup_data[\"name\"]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Extracting Rounds\n", "\n", "The value associated with the 'rounds' key is a list containing all of the actual information about the rounds and the matches within those rounds."]}, {"cell_type": "code", "execution_count": null, "metadata": {"scrolled": false}, "outputs": [], "source": ["# Run this cell without changes\n", "rounds = world_cup_data[\"rounds\"]\n", "\n", "print(\"type(rounds):\", type(rounds))\n", "print(\"len(rounds):\", len(rounds))\n", "print(\"type(rounds[3])\", type(rounds[3]))\n", "print(\"rounds[3]:\")\n", "rounds[3]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Translating this output into English:\n", "\n", "Starting with the original `world_cup_data` dictionary, we used the key `\"rounds\"` to extract a list of rounds, which we assigned to the variable `rounds`.\n", "\n", "`rounds` is a list of dictionaries. Each dictionary inside of `rounds` contains a name (e.g. `\"Matchday 4\"`) as well as a list of matches."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Extracting Matches\n", "\n", "Now we can go one level deeper and extract all of the matches in the tournament. Because the round is irrelevant for this analysis, we can loop over all rounds and combine all of their matches into a single list.\n", "\n", "**Hint:** This is a good use case for using the `.extend` list method rather than `.append`, since we want to combine several lists of dictionaries into a single list of dictionaries, not a list of lists of dictionaries. [Documentation here.](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists)"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate code\n", "matches = []\n", "\n", "# \"round\" is a built-in function in Python so we use \"round_\" instead\n", "for round_ in rounds:\n", " # Extract the list of matches for this round\n", " round_matches = None\n", " # Add them to the overall list of matches\n", " None\n", "\n", "matches[0]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Make sure the `assert`s pass before moving on to the next step."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "\n", "# There should be 64 matches. If the length is 20, that means\n", "# you have a list of lists instead of a list of dictionaries\n", "assert len(matches) == 64\n", "\n", "# Each match in the list should be a dictionary\n", "assert type(matches[0]) == dict"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Extracting Teams\n", "\n", "Each match has a `team1` and a `team2`. "]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "print(matches[0][\"team1\"])\n", "print(matches[0][\"team2\"])"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Create a list of all unique team names by looping over every match in `matches` and adding the `\"name\"` values associated with both `team1` and `team2`. (Same as before when creating a list of matches, it doesn't matter right now whether a given team was \"team1\" or \"team2\", we just add everything to `teams`.)\n", "\n", "We'll use a `set` data type ([documentation here](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset)) to ensure unique teams, then convert it to a sorted list at the end."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate code\n", "teams_set = set()\n", "\n", "for match in matches:\n", " # Add team1 name value to teams_set\n", " None\n", " # Add team2 name value to teams_set\n", " None\n", "\n", "teams = sorted(list(teams_set))\n", "print(teams)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Make sure the `assert`s pass before moving on to the next step."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "\n", "# teams should be a list, not a set\n", "assert type(teams) == list\n", "\n", "# 32 teams competed in the 2018 World Cup\n", "assert len(teams) == 32\n", "\n", "# Each element of teams should be a string\n", "# (the name), not a dictionary\n", "assert type(teams[0]) == str"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Step 1 complete. We have unique identifiers (names) for each of our records (countries) that we will be able to use to connect 2018 World Cup performance to 2018 population."]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 2. Associating Countries with 2018 World Cup Performance\n", "\n", "> Create a data structure that connects a team name (country name) to its performance in the 2018 FIFA World Cup. We'll use the count of games won in the entire tournament (group stage as well as knockout stage) to represent the performance.\n", "\n", "> Also, create visualizations to help the reader understand the distribution of games won and the performance of each team.\n", "\n", "So, we are building a **data structure** that connects a country name to the number of wins. There is no universal correct format for a data structure with this purpose, but we are going to use a format that resembles the \"dataframe\" format that will be introduced later in the course.\n", "\n", "Specifically, we'll build a **dictionary** where each key is the name of a country, and each value is a nested dictionary containing information about the number of wins and the 2018 population.\n", "\n", "The final result will look something like this:\n", "```\n", "{\n", " 'Argentina': { 'wins': 1, 'population': 44494502 },\n", " ...\n", " 'Uruguay': { 'wins': 4, 'population': 3449299 }\n", "}\n", "```\n", "\n", "For the current step (step 2), we'll build a data structure that looks something like this:\n", "```\n", "{\n", " 'Argentina': { 'wins': 1 },\n", " ...\n", " 'Uruguay': { 'wins': 4 }\n", "}\n", "```\n", "\n", "### Initializing with Wins Set to Zero\n", "\n", "Start by initializing a dictionary called `combined_data` containing:\n", "\n", "* Keys: the strings from `teams`\n", "* Values: each value the same, a dictionary containing the key `'wins'` with the associated value `0`. However, note that each value should be a distinct dictionary object in memory, not the same dictionary linked as a value in multiple places.\n", "\n", "Initially `combined_data` will look something like this:\n", "```\n", "{\n", " 'Argentina': { 'wins': 0 },\n", " ...\n", " 'Uruguay': { 'wins': 0 }\n", "}\n", "```"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate code\n", "\n", "# Create the variable combined_data as described above\n", "None"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Check that the `assert`s pass."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "\n", "# combined_data should be a dictionary\n", "assert type(combined_data) == dict\n", "\n", "# the keys should be strings\n", "assert type(list(combined_data.keys())[0]) == str\n", "\n", "# the values should be dictionaries\n", "assert combined_data[\"Japan\"] == {\"wins\": 0}"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Adding Wins from Matches\n", "\n", "Now it's time to revisit the `matches` list from earlier, in order to associate a team with the number of times it has won a match.\n", "\n", "This time, let's write some functions to help organize our logic.\n", "\n", "Write a function `find_winner` that takes in a `match` dictionary, and returns the name of the team that won the match. Recall that a match is structured like this:\n", "\n", "```\n", "{\n", " 'num': 1,\n", " 'date': '2018-06-14',\n", " 'time': '18:00',\n", " 'team1': { 'name': 'Russia', 'code': 'RUS' },\n", " 'team2': { 'name': 'Saudi Arabia', 'code': 'KSA' },\n", " 'score1': 5,\n", " 'score2': 0,\n", " 'score1i': 2,\n", " 'score2i': 0,\n", " 'goals1': [\n", " { 'name': 'Gazinsky', 'minute': 12, 'score1': 1, 'score2': 0 },\n", " { 'name': 'Cheryshev', 'minute': 43, 'score1': 2, 'score2': 0 },\n", " { 'name': 'Dzyuba', 'minute': 71, 'score1': 3, 'score2': 0 },\n", " { 'name': 'Cheryshev', 'minute': 90, 'offset': 1, 'score1': 4, 'score2': 0 },\n", " { 'name': 'Golovin', 'minute': 90, 'offset': 4, 'score1': 5, 'score2': 0 }\n", " ],\n", " 'goals2': [],\n", " 'group': 'Group A',\n", " 'stadium': { 'key': 'luzhniki', 'name': 'Luzhniki Stadium' },\n", " 'city': 'Moscow',\n", " 'timezone': 'UTC+3'\n", "}\n", "```\n", "\n", "The winner is determined by comparing the values associated with the `'score1'` and `'score2'` keys. If score 1 is larger, then the name associated with the `'team1'` key is the winner. If score 2 is larger, then the name associated with the `'team2'` key is the winner. If the values are the same, there is no winner, so return `None`. (Unlike the group round of the World Cup, we are only counting *wins* as our \"performance\" construct, not 3 points for a win and 1 point for a tie.)"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate code\n", "\n", "\n", "def find_winner(match):\n", " \"\"\"\n", " Given a dictionary containing information about a match,\n", " return the name of the winner (or None in the case of a tie)\n", " \"\"\"\n", " None"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "assert find_winner(matches[0]) == \"Russia\"\n", "assert find_winner(matches[1]) == \"Uruguay\"\n", "assert find_winner(matches[2]) == None"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now that we have this helper function, loop over every match in `matches`, find the winner, and add 1 to the associated count of wins in `combined_data`. If the winner is `None`, skip adding it to the dictionary."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate code\n", "\n", "for match in matches:\n", " # Get the name of the winner\n", " winner = None\n", " # Only proceed to the next step if there was\n", " # a winner\n", " if winner:\n", " # Add 1 to the associated count of wins\n", " None\n", "\n", "# Visually inspect the output to ensure the wins are\n", "# different for different countries\n", "combined_data"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Analysis of Wins\n", "\n", "While we could try to understand all 32 of those numbers just by scanning through them, let's use some descriptive statistics and data visualizations instead\n", "\n", "#### Statistical Summary of Wins\n", "\n", "The code below calculates the mean, median, and standard deviation of the number of wins. If it doesn't work, that is an indication that something went wrong with the creation of the `combined_data` variable, and you might want to look at the solution branch and fix your code before proceeding."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "import numpy as np\n", "\n", "wins = [val[\"wins\"] for val in combined_data.values()]\n", "\n", "print(\"Mean number of wins:\", np.mean(wins))\n", "print(\"Median number of wins:\", np.median(wins))\n", "print(\"Standard deviation of number of wins:\", np.std(wins))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["#### Visualizations of Wins\n", "\n", "In addition to those numbers, let's make a histogram (showing the distributions of the number of wins) and a bar graph (showing the number of wins by country)."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "import matplotlib.pyplot as plt\n", "\n", "# Set up figure and axes\n", "fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 7))\n", "fig.set_tight_layout(True)\n", "\n", "# Histogram of Wins and Frequencies\n", "ax1.hist(x=wins, bins=range(8), align=\"left\", color=\"green\")\n", "ax1.set_xticks(range(7))\n", "ax1.set_xlabel(\"Wins in 2018 World Cup\")\n", "ax1.set_ylabel(\"Frequency\")\n", "ax1.set_title(\"Distribution of Wins\")\n", "\n", "# Horizontal Bar Graph of Wins by Country\n", "ax2.barh(teams[::-1], wins[::-1], color=\"green\")\n", "ax2.set_xlabel(\"Wins in 2018 World Cup\")\n", "ax2.set_title(\"Wins by Country\");"]}, {"cell_type": "markdown", "metadata": {}, "source": ["#### Interpretation of Win Analysis\n", "\n", "Before we move to looking at the relationship between wins and population, it's useful to understand the distribution of wins alone. A few notes of interpretation:\n", "\n", "* The number of wins is skewed and looks like a [negative binomial distribution](https://en.wikipedia.org/wiki/Negative_binomial_distribution), which makes sense conceptually\n", "* The \"typical\" value here is 1 (both the median and the highest point of the histogram), meaning a typical team that qualifies for the World Cup wins once\n", "* There are a few teams we might consider outliers: Belgium and France, with 6x the wins of the \"typical\" team and 1.5x the wins of the next \"runner-up\" (Uruguay, with 4 wins)\n", "* This is a fairly small dataset, something that becomes more noticeable with such a \"spiky\" (not smooth) histogram\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 3. Associating Countries with 2018 Population\n", "\n", "> Add to the existing data structure so that it also connects each country name to its 2018 population, and create visualizations comparable to those from step 2.\n", "\n", "Now we're ready to add the 2018 population to `combined_data`, finally using the CSV file\n", "\n", "Recall that `combined_data` currently looks something like this:\n", "```\n", "{\n", " 'Argentina': { 'wins': 1 },\n", " ...\n", " 'Uruguay': { 'wins': 4 }\n", "}\n", "```\n", "\n", "And the goal is for it to look something like this:\n", "```\n", "{\n", " 'Argentina': { 'wins': 1, 'population': 44494502 },\n", " ...\n", " 'Uruguay': { 'wins': 4, 'population': 3449299 }\n", "}\n", "```\n", "\n", "To do that, we need to extract the 2018 population information from the CSV data.\n", "\n", "### Exploring the Structure of the Population Data CSV\n", "\n", "Recall that previously we loaded information from a CSV containing population data into a list of dictionaries called `population_data`."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "len(population_data)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["12,695 is a very large number of rows to print out, so let's look at some samples instead."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "np.random.seed(42)\n", "population_record_samples = np.random.choice(population_data, size=10)\n", "population_record_samples"]}, {"cell_type": "markdown", "metadata": {}, "source": ["There are **2 filtering tasks**, **1 data normalization task**, and **1 type conversion task** to be completed, based on what we can see in this sample. We'll walk through each of them below.\n", "\n", "(In a more realistic data cleaning environment, you most likely won't happen to get a sample that demonstrates all of the data cleaning steps needed, but this sample was chosen carefully for example purposes.)\n", "\n", "### Filtering Population Data\n", "\n", "We already should have suspected that this dataset would require some filtering, since there are 32 records in our current `combined_data` dataset and 12,695 records in `population_data`. Now that we have looked at this sample, we can identify 2 features we'll want to use in order to filter down the `population_data` records to just 32. Try to identify them before looking at the answer below.\n", "\n", ".\n", "\n", ".\n", "\n", ".\n", "\n", "*Answer: the two features to filter on are* ***`'Country Name'`*** *and* ***`'Year'`***. *We can see from the sample above that there are countries in `population_data` that are not present in `combined_data` (e.g. Malta) and there are years present that are not 2018.*\n", "\n", "In the cell below, create a new variable `population_data_filtered` that only includes relevant records from `population_data`. Relevant records are records where the country name is one of the countries in the `teams` list, and the year is \"2018\".\n", "\n", "(It's okay to leave 2018 as a string since we are not performing any math operations on it, just make sure you check for `\"2018\"` and not `2018`.)"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate code\n", "\n", "population_data_filtered = []\n", "\n", "for record in population_data:\n", " # Add record to population_data_filtered if relevant\n", " None\n", "\n", "len(population_data_filtered) # 27"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Hmm...what went wrong? Why do we only have 27 records, and not 32?\n", "\n", "Did we really get a dataset with 12k records that's missing 5 of the data points we need?\n", "\n", "Let's take a closer look at the population data samples again, specifically the third one:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "population_record_samples[2]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["And compare that with the value for Iran in `teams`:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "teams[13]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Ohhhh...we have a data normalization issue. One dataset refers to this country as `'Iran, Islamic Rep.'`, while the other refers to it as `'Iran'`. This is a common issue we face when using data about countries and regions, where there is no universally-accepted naming convention.\n", "\n", "### Normalizing Locations in Population Data\n", "\n", "Sometimes data normalization can be a very, very time-consuming task where you need to find \"crosswalk\" data that can link the two formats together, or you need to write advanced regex formulas to line everything up.\n", "\n", "For this task, there are only 5 missing, so we'll just go ahead and give you a function that makes the appropriate substitutions."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "def normalize_location(country_name):\n", " \"\"\"\n", " Given a country name, return the name that the\n", " country uses when playing in the FIFA World Cup\n", " \"\"\"\n", " name_sub_dict = {\n", " \"Russian Federation\": \"Russia\",\n", " \"Egypt, Arab Rep.\": \"Egypt\",\n", " \"Iran, Islamic Rep.\": \"Iran\",\n", " \"Korea, Rep.\": \"South Korea\",\n", " \"United Kingdom\": \"England\",\n", " }\n", " # The .get method returns the corresponding value from\n", " # the dict if present, otherwise returns country_name\n", " return name_sub_dict.get(country_name, country_name)\n", "\n", "\n", "# Example where normalized location is different\n", "print(normalize_location(\"Russian Federation\"))\n", "# Example where normalized location is the same\n", "print(normalize_location(\"Argentina\"))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now, write new code to create `population_data_filtered` with normalized country names."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate code\n", "\n", "population_data_filtered = []\n", "\n", "for record in population_data:\n", " # Get normalized country name\n", " None\n", " # Add record to population_data_filtered if relevant\n", " if None:\n", " # Replace the country name in the record\n", " None\n", " # Append to list\n", " None\n", "\n", "len(population_data_filtered) # 32"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Great, now we should have 32 records instead of 27.\n", "\n", "### Type Conversion of Population Data\n", "\n", "We need to do one more thing before we'll have population data that is usable for analysis. Take a look at this record from `population_data_filtered` to see if you can spot it:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "population_data_filtered[0]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Every key has the same data type (`str`), including the population value. In this example, it's `'44494502'`, when it needs to be `44494502` if we want to be able to compute statistics with it.\n", "\n", "In the cell below, loop over `population_data_filtered` and convert the data type of the value associated with the `\"Value\"` key from a string to an integer, using the built-in `int()` function."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate code\n", "for record in population_data_filtered:\n", " # Convert the population value from str to int\n", " None\n", "\n", "# Look at the last record to make sure the population\n", "# value is an int\n", "population_data_filtered[-1]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Check that it worked with the assert statement below:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "assert type(population_data_filtered[-1][\"Value\"]) == int"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Adding Population Data\n", "\n", "Now it's time to add the population data to `combined_data`. Recall that the data structure currently looks like this:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "combined_data"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The goal is for it to be structured like this:\n", "```\n", "{\n", " 'Argentina': { 'wins': 1, 'population': 44494502 },\n", " ...\n", " 'Uruguay': { 'wins': 4, 'population': 3449299 }\n", "}\n", "```"]}, {"cell_type": "markdown", "metadata": {}, "source": ["In the cell below, loop over `population_data_filtered` and add information about population to each country in `combined_data`:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate code\n", "for record in population_data_filtered:\n", " # Extract the country name from the record\n", " country = None\n", " # Extract the population value from the record\n", " population = None\n", " # Add this information to combined_data\n", " None\n", "\n", "# Look combined_data\n", "combined_data"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Check that the types are correct with these assert statements:"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "assert type(combined_data[\"Uruguay\"]) == dict\n", "assert type(combined_data[\"Uruguay\"][\"population\"]) == int"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Analysis of Population\n", "\n", "Let's perform the same analysis for population that we performed for count of wins.\n", "\n", "#### Statistical Analysis of Population"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "populations = [val[\"population\"] for val in combined_data.values()]\n", "\n", "print(\"Mean population:\", np.mean(populations))\n", "print(\"Median population:\", np.median(populations))\n", "print(\"Standard deviation of population:\", np.std(populations))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["#### Visualizations of Population"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "\n", "# Set up figure and axes\n", "fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 7))\n", "fig.set_tight_layout(True)\n", "\n", "# Histogram of Populations and Frequencies\n", "ax1.hist(x=populations, color=\"blue\")\n", "ax1.set_xlabel(\"2018 Population\")\n", "ax1.set_ylabel(\"Frequency\")\n", "ax1.set_title(\"Distribution of Population\")\n", "\n", "# Horizontal Bar Graph of Population by Country\n", "ax2.barh(teams[::-1], populations[::-1], color=\"blue\")\n", "ax2.set_xlabel(\"2018 Population\")\n", "ax2.set_title(\"Population by Country\");"]}, {"cell_type": "markdown", "metadata": {}, "source": ["#### Interpretation of Population Analysis\n", "\n", "* Similar to the distribution of the number of wins, the distribution of population is skewed.\n", "* It's hard to choose a single \"typical\" value here because there is so much variation.\n", "* The countries with the largest populations (Brazil, Nigeria, and Russia) do not overlap with the countries with the most wins (Belgium, France, and Uruguay)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## 4. Analysis of Population vs. Performance\n", "\n", "> Choose an appropriate statistical measure to analyze the relationship between population and performance, and create a visualization representing this relationship.\n", "\n", "### Statistical Measure\n", "So far we have learned about only two statistics for understanding the *relationship* between variables: **covariance** and **correlation**. We will use correlation here, because that provides a more standardized, interpretable metric."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "np.corrcoef(wins, populations)[0][1]"]}, {"cell_type": "markdown", "metadata": {}, "source": ["In the cell below, interpret this number. What direction is this correlation? Is it strong or weak?"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate code\n", "\"\"\"\n", "None\n", "\"\"\""]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Data Visualization\n", "\n", "A **scatter plot** is he most sensible form of data visualization for showing this relationship, because we have two dimensions of data, but there is no \"increasing\" variable (e.g. time) that would indicate we should use a line graph."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Run this cell without changes\n", "\n", "# Set up figure\n", "fig, ax = plt.subplots(figsize=(8, 5))\n", "\n", "# Basic scatter plot\n", "ax.scatter(x=populations, y=wins, color=\"gray\", alpha=0.5, s=100)\n", "ax.set_xlabel(\"2018 Population\")\n", "ax.set_ylabel(\"2018 World Cup Wins\")\n", "ax.set_title(\"Population vs. World Cup Wins\")\n", "\n", "# Add annotations for specific points of interest\n", "highlighted_points = {\n", " \"Belgium\": 2, # Numbers are the index of that\n", " \"Brazil\": 3, # country in populations & wins\n", " \"France\": 10,\n", " \"Nigeria\": 17,\n", "}\n", "for country, index in highlighted_points.items():\n", " # Get x and y position of data point\n", " x = populations[index]\n", " y = wins[index]\n", " # Move each point slightly down and to the left\n", " # (numbers were chosen by manually tweaking)\n", " xtext = x - (1.25e6 * len(country))\n", " ytext = y - 0.5\n", " # Annotate with relevant arguments\n", " ax.annotate(text=country, xy=(x, y), xytext=(xtext, ytext))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Data Visualization Interpretation\n", "\n", "Interpret this plot in the cell below. Does this align with the findings from the statistical measure (correlation), as well as the map shown at the beginning of this lab (showing the best results by country)?"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate text\n", "\"\"\"\n", "None\n", "\"\"\""]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Final Analysis\n", "\n", "> What is the relationship between the population of a country and their performance in the 2018 FIFA World Cup?\n", "\n", "Overall, we found a very weakly positive relationship between the population of a country and their performance in the 2018 FIFA World Cup, as demonstrated by both the correlation between populations and wins, and the scatter plot.\n", "\n", "In the cell below, write down your thoughts on these questions:\n", "\n", " - What are your thoughts on why you may see this result?\n", " - What would you research next?"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["# Replace None with appropriate text\n", "\"\"\"\n", "None\n", "\"\"\""]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Summary\n", "\n", "That was a long lab, pulling together a lot of material. You read data into Python, extracted the relevant information, cleaned the data, and combined the data into a new format to be used in analysis. While we will continue to introduce new tools and techniques, these essential steps will be present for the rest of your data science projects from here on out."]}], "metadata": {"kernelspec": {"display_name": "Python (learn-env)", "language": "python", "name": "learn-env"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5"}}, "nbformat": 4, "nbformat_minor": 4}
\ No newline at end of file
+{"cells":[{"cell_type":"markdown","metadata":{},"source":["# Data Serialization Formats - Cumulative Lab\n","\n","## Introduction\n","\n","Now that you have learned about CSV and JSON file formats individually, it's time to bring them together with a cumulative lab! Even as a junior data scientist, you can often produce novel, interesting analyses by combining multiple datasets that haven't been combined before.\n","\n","## Objectives\n","\n","You will be able to:\n","\n","* Practice reading serialized JSON and CSV data from files into Python objects\n","* Practice extracting information from nested data structures\n","* Practice cleaning data (filtering, normalizing locations, converting types)\n","* Combine data from multiple sources into a single data structure\n","* Interpret descriptive statistics and data visualizations to present your findings\n","\n","## Your Task: Analyze the Relationship between Population and World Cup Performance\n","\n","\n","\n","Photo by Fauzan Saari on Unsplash "]},{"cell_type":"markdown","metadata":{},"source":["### Business Understanding\n","\n","#### What is the relationship between the population of a country and their performance in the 2018 FIFA World Cup?\n","\n","Intuitively, we might assume that countries with larger populations would have better performance in international sports competitions. While this has been demonstrated to be [true for the Olympics](https://www.researchgate.net/publication/308513557_Medals_at_the_Olympic_Games_The_Relationship_Between_Won_Medals_Gross_Domestic_Product_Population_Size_and_the_Weight_of_Sportive_Practice), the results for the FIFA World Cup are more mixed:\n","\n","CC BY-SA 3.0 , Link
\n","\n","In this analysis, we are going to look specifically at the sample of World Cup games in 2018 and the corresponding 2018 populations of the participating nations, to determine the relationship between population and World Cup performance for this year."]},{"cell_type":"markdown","metadata":{},"source":["### Data Understanding\n","\n","The data sources for this analysis will be pulled from two separate files.\n","\n","#### `world_cup_2018.json`\n","\n","* **Source**: This dataset comes from [`football.db`](http://openfootball.github.io/), a \"free and open public domain football database & schema for use in any (programming) language\"\n","* **Contents**: Data about all games in the 2018 World Cup, including date, location (city and stadium), teams, goals scored (and by whom), and tournament group\n","* **Format**: Nested JSON data (dictionary containing a list of rounds, each of which contains a list of matches, each of which contains information about the teams involved and the points scored)\n","\n","#### `country_populations.csv`\n","\n","* **Source**: This dataset comes from a curated collection by [DataHub.io](https://datahub.io/core/population), originally sourced from the World Bank\n","* **Contents**: Data about populations by country for all available years from 1960 to 2018\n","* **Format**: CSV data, where each row contains a country name, a year, and a population"]},{"cell_type":"markdown","metadata":{},"source":["### Requirements\n","\n","#### 1. List of Teams in 2018 World Cup\n","\n","Create an alphabetically-sorted list of teams who competed in the 2018 FIFA World Cup.\n","\n","#### 2. Associating Countries with 2018 World Cup Performance\n","\n","Create a data structure that connects a team name (country name) to its performance in the 2018 FIFA World Cup. We'll use the count of games won in the entire tournament (group stage as well as knockout stage) to represent the performance.\n","\n","This will help create visualizations to help the reader understand the distribution of games won and the performance of each team.\n","\n","#### 3. Associating Countries with 2018 Population\n","\n","Add to the existing data structure so that it also connects each country name to its 2018 population, and create visualizations comparable to those from step 2.\n","\n","#### 4. Analysis of Population vs. Performance\n","\n","Choose an appropriate statistical measure to analyze the relationship between population and performance, and create a visualization representing this relationship."]},{"cell_type":"markdown","metadata":{},"source":["### Checking for Understanding\n","\n","Before moving on to the next step, pause and think about the strategy for this analysis.\n","\n","Remember, our business question is:\n","\n","> What is the relationship between the population of a country and their performance in the 2018 FIFA World Cup?\n","\n","#### Unit of Analysis\n","\n","First, what is our **unit of analysis**, and what is the **unique identifier**? In other words, what will one record in our final data structure represent, and what attribute uniquely describes it?\n","\n",".\n","\n",".\n","\n",".\n","\n","*Answer:* \n","\n","> What is the relationship between the population of a **country** and their performance in the 2018 FIFA World Cup?\n","\n","*Our unit of analysis is a* ***country*** *and the unique identifier we'll use is the* ***country name***\n","\n","#### Features\n","\n","Next, what **features** are we analyzing? In other words, what attributes of each country are we interested in?\n","\n",".\n","\n",".\n","\n",".\n","\n","*Answer:* \n","\n","> What is the relationship between the **population** of a country and their **performance in the 2018 FIFA World Cup**?\n","\n","*Our features are* ***2018 population*** *and* ***count of wins in the 2018 World Cup***\n","\n","#### Dataset to Start With\n","\n","Finally, which dataset should we **start** with? In this case, any record with missing data is not useful to us, so we want to start with the smaller dataset.\n","\n",".\n","\n",".\n","\n",".\n","\n","*Answer: There are only 32 countries that compete in the World Cup each year, compared to hundreds of countries in the world, so we should start with the* ***2018 World Cup*** *dataset. Then we can join it with the relevant records from the country population dataset.*"]},{"cell_type":"markdown","metadata":{},"source":["## Getting the Data\n","\n","Below we import the `json` and `csv` modules, which will be used for reading from `world_cup_2018.json` and `country_populations.csv`, respectively."]},{"cell_type":"code","execution_count":205,"metadata":{},"outputs":[],"source":["# Run this cell without changes\n","import json\n","import csv"]},{"cell_type":"markdown","metadata":{},"source":["Next, we open the relevant files."]},{"cell_type":"code","execution_count":211,"metadata":{},"outputs":[],"source":["# Run this cell without changes\n","world_cup_file = open(\"data/world_cup_2018.json\", encoding=\"utf8\")\n","population_file = open(\"data/country_populations.csv\")"]},{"cell_type":"markdown","metadata":{},"source":["**Hint:** if your code below is not working, (e.g. `ValueError: I/O operation on closed file.`, or you get an empty list or dictionary) try re-running the cell above to reopen the files, then re-run your code.\n","\n","### 2018 World Cup Data\n","\n","In the cell below, use the `json` module to load the data from `world_cup_file` into a dictionary called `world_cup_data`"]},{"cell_type":"code","execution_count":208,"metadata":{},"outputs":[{"data":{"text/plain":["_io.TextIOWrapper"]},"execution_count":208,"metadata":{},"output_type":"execute_result"}],"source":["type(world_cup_file)"]},{"cell_type":"code","execution_count":212,"metadata":{},"outputs":[],"source":["# Replace None with appropriate code\n","world_cup_data = json.load(world_cup_file)\n","\n","# Close the file now that we're done reading from it\n","world_cup_file.close()"]},{"cell_type":"markdown","metadata":{},"source":["Make sure the `assert` passes, ensuring that `world_cup_data` has the correct type."]},{"cell_type":"code","execution_count":213,"metadata":{},"outputs":[],"source":["# Run this cell without changes\n","\n","# Check that the overall data structure is a dictionary\n","assert type(world_cup_data) == dict\n","\n","# Check that the dictionary has 2 keys, 'name' and 'rounds'\n","assert list(world_cup_data.keys()) == [\"name\", \"rounds\"]"]},{"cell_type":"markdown","metadata":{},"source":["### Population Data\n","\n","Now use the `csv` module to load the data from `population_file` into a list of dictionaries called `population_data`\n","\n","(Recall that you can convert a `csv.DictReader` object into a list of dictionaries using the built-in `list()` function.)"]},{"cell_type":"code","execution_count":214,"metadata":{},"outputs":[{"data":{"text/plain":["_io.TextIOWrapper"]},"execution_count":214,"metadata":{},"output_type":"execute_result"}],"source":["type(population_file)"]},{"cell_type":"code","execution_count":215,"metadata":{},"outputs":[],"source":["# Replace None with appropriate code\n","population_data = list(csv.DictReader(population_file))\n","\n","# Close the file now that we're done reading from it\n","population_file.close()"]},{"cell_type":"markdown","metadata":{},"source":["Make sure the `assert`s pass, ensuring that `population_data` has the correct type."]},{"cell_type":"code","execution_count":216,"metadata":{},"outputs":[],"source":["# Run this cell without changes\n","\n","# Check that the overall data structure is a list\n","assert type(population_data) == list\n","\n","# Check that the 0th element is a dictionary\n","# (csv.DictReader interface differs slightly by Python version;\n","# either a dict or an OrderedDict is fine here)\n","from collections import OrderedDict\n","\n","assert type(population_data[0]) == dict or type(population_data[0]) == OrderedDict"]},{"cell_type":"markdown","metadata":{},"source":["## 1. List of Teams in 2018 World Cup\n","\n","> Create an alphabetically-sorted list of teams who competed in the 2018 FIFA World Cup.\n","\n","This will take several steps, some of which have been completed for you.\n","\n","### Exploring the Structure of the World Cup Data JSON\n","\n","Let's start by exploring the structure of `world_cup_data`. Here is a pretty-printed preview of its contents:\n","\n","```\n","{\n"," \"name\": \"World Cup 2018\",\n"," \"rounds\": [\n"," {\n"," \"name\": \"Matchday 1\",\n"," \"matches\": [\n"," {\n"," \"num\": 1,\n"," \"date\": \"2018-06-14\",\n"," \"time\": \"18:00\",\n"," \"team1\": { \"name\": \"Russia\", \"code\": \"RUS\" },\n"," \"team2\": { \"name\": \"Saudi Arabia\", \"code\": \"KSA\" },\n"," \"score1\": 5,\n"," \"score2\": 0,\n"," \"score1i\": 2,\n"," \"score2i\": 0,\n"," \"goals1\": [\n"," { \"name\": \"Gazinsky\", \"minute\": 12, \"score1\": 1, \"score2\": 0 },\n"," { \"name\": \"Cheryshev\", \"minute\": 43, \"score1\": 2, \"score2\": 0 },\n"," { \"name\": \"Dzyuba\", \"minute\": 71, \"score1\": 3, \"score2\": 0 },\n"," { \"name\": \"Cheryshev\", \"minute\": 90, \"offset\": 1, \"score1\": 4, \"score2\": 0 },\n"," { \"name\": \"Golovin\", \"minute\": 90, \"offset\": 4, \"score1\": 5, \"score2\": 0 }\n"," ],\n"," \"goals2\": [],\n"," \"group\": \"Group A\",\n"," \"stadium\": { \"key\": \"luzhniki\", \"name\": \"Luzhniki Stadium\" },\n"," \"city\": \"Moscow\",\n"," \"timezone\": \"UTC+3\"\n"," }\n"," ]\n"," },\n"," {\n"," \"name\": \"Matchday 2\",\n"," \"matches\": [\n"," {\n"," \"num\": 2,\n"," \"date\": \"2018-06-15\",\n"," \"time\": \"17:00\",\n"," \"team1\": { \"name\": \"Egypt\", \"code\": \"EGY\" },\n"," \"team2\": { \"name\": \"Uruguay\", \"code\": \"URU\" },\n"," \"score1\": 0,\n"," \"score2\": 1,\n"," \"score1i\": 0,\n"," \"score2i\": 0,\n"," \"goals1\": [],\n"," \"goals2\": [\n"," { \"name\": \"Giménez\", \"minute\": 89, \"score1\": 0, \"score2\": 1 }\n"," ],\n"," \"group\": \"Group A\",\n"," \"stadium\": { \"key\": \"ekaterinburg\", \"name\": \"Ekaterinburg Arena\" }, \n"," \"city\": \"Ekaterinburg\",\n"," \"timezone\": \"UTC+5\"\n"," },\n"," ...\n"," ],\n"," },\n"," ], \n","}\n","```\n","\n","As noted previously, `world_cup_data` is a dictionary with two keys, 'name' and 'rounds'."]},{"cell_type":"code","execution_count":217,"metadata":{},"outputs":[{"data":{"text/plain":["dict_keys(['name', 'rounds'])"]},"execution_count":217,"metadata":{},"output_type":"execute_result"}],"source":["# Run this cell without changes\n","world_cup_data.keys()"]},{"cell_type":"markdown","metadata":{},"source":["The value associated with the 'name' key is simply identifying the dataset."]},{"cell_type":"code","execution_count":218,"metadata":{},"outputs":[{"data":{"text/plain":["'World Cup 2018'"]},"execution_count":218,"metadata":{},"output_type":"execute_result"}],"source":["# Run this cell without changes\n","world_cup_data[\"name\"]"]},{"cell_type":"markdown","metadata":{},"source":["### Extracting Rounds\n","\n","The value associated with the 'rounds' key is a list containing all of the actual information about the rounds and the matches within those rounds."]},{"cell_type":"code","execution_count":219,"metadata":{"scrolled":false},"outputs":[{"name":"stdout","output_type":"stream","text":["type(rounds): \n","len(rounds): 20\n","type(rounds[3]) \n","rounds[3]:\n"]},{"data":{"text/plain":["{'name': 'Matchday 4',\n"," 'matches': [{'num': 9,\n"," 'date': '2018-06-17',\n"," 'time': '21:00',\n"," 'team1': {'name': 'Brazil', 'code': 'BRA'},\n"," 'team2': {'name': 'Switzerland', 'code': 'SUI'},\n"," 'score1': 1,\n"," 'score2': 1,\n"," 'score1i': 1,\n"," 'score2i': 0,\n"," 'goals1': [{'name': 'Coutinho', 'minute': 20, 'score1': 1, 'score2': 0}],\n"," 'goals2': [{'name': 'Zuber', 'minute': 50, 'score1': 1, 'score2': 1}],\n"," 'group': 'Group E',\n"," 'stadium': {'key': 'rostov', 'name': 'Rostov Arena'},\n"," 'city': 'Rostov-on-Don',\n"," 'timezone': 'UTC+3'},\n"," {'num': 10,\n"," 'date': '2018-06-17',\n"," 'time': '16:00',\n"," 'team1': {'name': 'Costa Rica', 'code': 'CRC'},\n"," 'team2': {'name': 'Serbia', 'code': 'SRB'},\n"," 'score1': 0,\n"," 'score2': 1,\n"," 'score1i': 0,\n"," 'score2i': 0,\n"," 'goals1': [],\n"," 'goals2': [{'name': 'Kolarov', 'minute': 56, 'score1': 0, 'score2': 1}],\n"," 'group': 'Group E',\n"," 'stadium': {'key': 'samara', 'name': 'Samara Arena'},\n"," 'city': 'Samara',\n"," 'timezone': 'UTC+4'},\n"," {'num': 11,\n"," 'date': '2018-06-17',\n"," 'time': '18:00',\n"," 'team1': {'name': 'Germany', 'code': 'GER'},\n"," 'team2': {'name': 'Mexico', 'code': 'MEX'},\n"," 'score1': 0,\n"," 'score2': 1,\n"," 'score1i': 0,\n"," 'score2i': 1,\n"," 'goals1': [],\n"," 'goals2': [{'name': 'Lozano', 'minute': 35, 'score1': 0, 'score2': 1}],\n"," 'group': 'Group F',\n"," 'stadium': {'key': 'luzhniki', 'name': 'Luzhniki Stadium'},\n"," 'city': 'Moscow',\n"," 'timezone': 'UTC+3'}]}"]},"execution_count":219,"metadata":{},"output_type":"execute_result"}],"source":["# Run this cell without changes\n","rounds = world_cup_data[\"rounds\"]\n","\n","print(\"type(rounds):\", type(rounds))\n","print(\"len(rounds):\", len(rounds))\n","print(\"type(rounds[3])\", type(rounds[3]))\n","print(\"rounds[3]:\")\n","rounds[3]"]},{"cell_type":"markdown","metadata":{},"source":["Translating this output into English:\n","\n","Starting with the original `world_cup_data` dictionary, we used the key `\"rounds\"` to extract a list of rounds, which we assigned to the variable `rounds`.\n","\n","`rounds` is a list of dictionaries. Each dictionary inside of `rounds` contains a name (e.g. `\"Matchday 4\"`) as well as a list of matches."]},{"cell_type":"markdown","metadata":{},"source":["### Extracting Matches\n","\n","Now we can go one level deeper and extract all of the matches in the tournament. Because the round is irrelevant for this analysis, we can loop over all rounds and combine all of their matches into a single list.\n","\n","**Hint:** This is a good use case for using the `.extend` list method rather than `.append`, since we want to combine several lists of dictionaries into a single list of dictionaries, not a list of lists of dictionaries. [Documentation here.](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists)"]},{"cell_type":"code","execution_count":220,"metadata":{},"outputs":[{"data":{"text/plain":["[{'num': 1,\n"," 'date': '2018-06-14',\n"," 'time': '18:00',\n"," 'team1': {'name': 'Russia', 'code': 'RUS'},\n"," 'team2': {'name': 'Saudi Arabia', 'code': 'KSA'},\n"," 'score1': 5,\n"," 'score2': 0,\n"," 'score1i': 2,\n"," 'score2i': 0,\n"," 'goals1': [{'name': 'Gazinsky', 'minute': 12, 'score1': 1, 'score2': 0},\n"," {'name': 'Cheryshev', 'minute': 43, 'score1': 2, 'score2': 0},\n"," {'name': 'Dzyuba', 'minute': 71, 'score1': 3, 'score2': 0},\n"," {'name': 'Cheryshev', 'minute': 90, 'offset': 1, 'score1': 4, 'score2': 0},\n"," {'name': 'Golovin', 'minute': 90, 'offset': 4, 'score1': 5, 'score2': 0}],\n"," 'goals2': [],\n"," 'group': 'Group A',\n"," 'stadium': {'key': 'luzhniki', 'name': 'Luzhniki Stadium'},\n"," 'city': 'Moscow',\n"," 'timezone': 'UTC+3'}]"]},"execution_count":220,"metadata":{},"output_type":"execute_result"}],"source":["rounds[0]['matches']"]},{"cell_type":"code","execution_count":221,"metadata":{},"outputs":[{"data":{"text/plain":["{'num': 1,\n"," 'date': '2018-06-14',\n"," 'time': '18:00',\n"," 'team1': {'name': 'Russia', 'code': 'RUS'},\n"," 'team2': {'name': 'Saudi Arabia', 'code': 'KSA'},\n"," 'score1': 5,\n"," 'score2': 0,\n"," 'score1i': 2,\n"," 'score2i': 0,\n"," 'goals1': [{'name': 'Gazinsky', 'minute': 12, 'score1': 1, 'score2': 0},\n"," {'name': 'Cheryshev', 'minute': 43, 'score1': 2, 'score2': 0},\n"," {'name': 'Dzyuba', 'minute': 71, 'score1': 3, 'score2': 0},\n"," {'name': 'Cheryshev', 'minute': 90, 'offset': 1, 'score1': 4, 'score2': 0},\n"," {'name': 'Golovin', 'minute': 90, 'offset': 4, 'score1': 5, 'score2': 0}],\n"," 'goals2': [],\n"," 'group': 'Group A',\n"," 'stadium': {'key': 'luzhniki', 'name': 'Luzhniki Stadium'},\n"," 'city': 'Moscow',\n"," 'timezone': 'UTC+3'}"]},"execution_count":221,"metadata":{},"output_type":"execute_result"}],"source":["# Replace None with appropriate code\n","matches = []\n","\n","# \"round\" is a built-in function in Python so we use \"round_\" instead\n","for round_ in rounds:\n"," # Extract the list of matches for this round\n"," round_matches = round_['matches']\n"," # Add them to the overall list of matches\n"," matches.extend(round_matches)\n"," \n","#round_\n","matches[0]"]},{"cell_type":"markdown","metadata":{},"source":["Make sure the `assert`s pass before moving on to the next step."]},{"cell_type":"code","execution_count":222,"metadata":{},"outputs":[],"source":["# Run this cell without changes\n","\n","# There should be 64 matches. If the length is 20, that means\n","# you have a list of lists instead of a list of dictionaries\n","assert len(matches) == 64\n","\n","# Each match in the list should be a dictionary\n","assert type(matches[0]) == dict"]},{"cell_type":"markdown","metadata":{},"source":["### Extracting Teams\n","\n","Each match has a `team1` and a `team2`. "]},{"cell_type":"code","execution_count":223,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["{'name': 'Russia', 'code': 'RUS'}\n","{'name': 'Saudi Arabia', 'code': 'KSA'}\n"]}],"source":["# Run this cell without changes\n","print(matches[0][\"team1\"])\n","print(matches[0][\"team2\"])"]},{"cell_type":"code","execution_count":224,"metadata":{},"outputs":[{"data":{"text/plain":["'Russia'"]},"execution_count":224,"metadata":{},"output_type":"execute_result"}],"source":["matches[0]['team1']['name']"]},{"cell_type":"markdown","metadata":{},"source":["Create a list of all unique team names by looping over every match in `matches` and adding the `\"name\"` values associated with both `team1` and `team2`. (Same as before when creating a list of matches, it doesn't matter right now whether a given team was \"team1\" or \"team2\", we just add everything to `teams`.)\n","\n","We'll use a `set` data type ([documentation here](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset)) to ensure unique teams, then convert it to a sorted list at the end."]},{"cell_type":"code","execution_count":225,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["['Argentina', 'Australia', 'Belgium', 'Brazil', 'Colombia', 'Costa Rica', 'Croatia', 'Denmark', 'Egypt', 'England', 'France', 'Germany', 'Iceland', 'Iran', 'Japan', 'Mexico', 'Morocco', 'Nigeria', 'Panama', 'Peru', 'Poland', 'Portugal', 'Russia', 'Saudi Arabia', 'Senegal', 'Serbia', 'South Korea', 'Spain', 'Sweden', 'Switzerland', 'Tunisia', 'Uruguay']\n"]}],"source":["# Replace None with appropriate code\n","teams_set = set()\n","\n","for match in matches:\n"," # Add team1 name value to teams_set\n"," teams_set.add(match[\"team1\"]['name']) #None\n"," # Add team2 name value to teams_set\n"," teams_set.add(match[\"team2\"]['name']) #None\n","\n","teams = sorted(list(teams_set))\n","print(teams)"]},{"cell_type":"markdown","metadata":{},"source":["Make sure the `assert`s pass before moving on to the next step."]},{"cell_type":"code","execution_count":226,"metadata":{},"outputs":[],"source":["# Run this cell without changes\n","\n","# teams should be a list, not a set\n","assert type(teams) == list\n","\n","# 32 teams competed in the 2018 World Cup\n","assert len(teams) == 32\n","\n","# Each element of teams should be a string\n","# (the name), not a dictionary\n","assert type(teams[0]) == str"]},{"cell_type":"markdown","metadata":{},"source":["Step 1 complete. We have unique identifiers (names) for each of our records (countries) that we will be able to use to connect 2018 World Cup performance to 2018 population."]},{"cell_type":"markdown","metadata":{},"source":["## 2. Associating Countries with 2018 World Cup Performance\n","\n","> Create a data structure that connects a team name (country name) to its performance in the 2018 FIFA World Cup. We'll use the count of games won in the entire tournament (group stage as well as knockout stage) to represent the performance.\n","\n","> Also, create visualizations to help the reader understand the distribution of games won and the performance of each team.\n","\n","So, we are building a **data structure** that connects a country name to the number of wins. There is no universal correct format for a data structure with this purpose, but we are going to use a format that resembles the \"dataframe\" format that will be introduced later in the course.\n","\n","Specifically, we'll build a **dictionary** where each key is the name of a country, and each value is a nested dictionary containing information about the number of wins and the 2018 population.\n","\n","The final result will look something like this:\n","```\n","{\n"," 'Argentina': { 'wins': 1, 'population': 44494502 },\n"," ...\n"," 'Uruguay': { 'wins': 4, 'population': 3449299 }\n","}\n","```\n","\n","For the current step (step 2), we'll build a data structure that looks something like this:\n","```\n","{\n"," 'Argentina': { 'wins': 1 },\n"," ...\n"," 'Uruguay': { 'wins': 4 }\n","}\n","```\n","\n","### Initializing with Wins Set to Zero\n","\n","Start by initializing a dictionary called `combined_data` containing:\n","\n","* Keys: the strings from `teams`\n","* Values: each value the same, a dictionary containing the key `'wins'` with the associated value `0`. However, note that each value should be a distinct dictionary object in memory, not the same dictionary linked as a value in multiple places.\n","\n","Initially `combined_data` will look something like this:\n","```\n","{\n"," 'Argentina': { 'wins': 0 },\n"," ...\n"," 'Uruguay': { 'wins': 0 }\n","}\n","```"]},{"cell_type":"code","execution_count":227,"metadata":{},"outputs":[{"data":{"text/plain":["'Argentina'"]},"execution_count":227,"metadata":{},"output_type":"execute_result"}],"source":["teams.index('Japan')\n","teams[0]"]},{"cell_type":"code","execution_count":228,"metadata":{},"outputs":[],"source":["# Replace None with appropriate code\n","\n","# Create the variable combined_data as described above\n","combined_data ={}\n","for x in teams:\n"," combined_data.update({x:{'wins':0}})"]},{"cell_type":"markdown","metadata":{},"source":["Check that the `assert`s pass."]},{"cell_type":"code","execution_count":229,"metadata":{},"outputs":[{"data":{"text/plain":["dict_keys(['Argentina', 'Australia', 'Belgium', 'Brazil', 'Colombia', 'Costa Rica', 'Croatia', 'Denmark', 'Egypt', 'England', 'France', 'Germany', 'Iceland', 'Iran', 'Japan', 'Mexico', 'Morocco', 'Nigeria', 'Panama', 'Peru', 'Poland', 'Portugal', 'Russia', 'Saudi Arabia', 'Senegal', 'Serbia', 'South Korea', 'Spain', 'Sweden', 'Switzerland', 'Tunisia', 'Uruguay'])"]},"execution_count":229,"metadata":{},"output_type":"execute_result"}],"source":["combined_data.keys()"]},{"cell_type":"code","execution_count":230,"metadata":{},"outputs":[],"source":["# Run this cell without changes\n","\n","# combined_data should be a dictionary\n","assert type(combined_data) == dict\n","\n","# the keys should be strings\n","assert type(list(combined_data.keys())[0]) == str\n","\n","# the values should be dictionaries\n","assert combined_data[\"Japan\"] == {\"wins\": 0}"]},{"cell_type":"markdown","metadata":{},"source":["### Adding Wins from Matches\n","\n","Now it's time to revisit the `matches` list from earlier, in order to associate a team with the number of times it has won a match.\n","\n","This time, let's write some functions to help organize our logic.\n","\n","Write a function `find_winner` that takes in a `match` dictionary, and returns the name of the team that won the match. Recall that a match is structured like this:\n","\n","```\n","{\n"," 'num': 1,\n"," 'date': '2018-06-14',\n"," 'time': '18:00',\n"," 'team1': { 'name': 'Russia', 'code': 'RUS' },\n"," 'team2': { 'name': 'Saudi Arabia', 'code': 'KSA' },\n"," 'score1': 5,\n"," 'score2': 0,\n"," 'score1i': 2,\n"," 'score2i': 0,\n"," 'goals1': [\n"," { 'name': 'Gazinsky', 'minute': 12, 'score1': 1, 'score2': 0 },\n"," { 'name': 'Cheryshev', 'minute': 43, 'score1': 2, 'score2': 0 },\n"," { 'name': 'Dzyuba', 'minute': 71, 'score1': 3, 'score2': 0 },\n"," { 'name': 'Cheryshev', 'minute': 90, 'offset': 1, 'score1': 4, 'score2': 0 },\n"," { 'name': 'Golovin', 'minute': 90, 'offset': 4, 'score1': 5, 'score2': 0 }\n"," ],\n"," 'goals2': [],\n"," 'group': 'Group A',\n"," 'stadium': { 'key': 'luzhniki', 'name': 'Luzhniki Stadium' },\n"," 'city': 'Moscow',\n"," 'timezone': 'UTC+3'\n","}\n","```\n","\n","The winner is determined by comparing the values associated with the `'score1'` and `'score2'` keys. If score 1 is larger, then the name associated with the `'team1'` key is the winner. If score 2 is larger, then the name associated with the `'team2'` key is the winner. If the values are the same, there is no winner, so return `None`. (Unlike the group round of the World Cup, we are only counting *wins* as our \"performance\" construct, not 3 points for a win and 1 point for a tie.)"]},{"cell_type":"code","execution_count":231,"metadata":{},"outputs":[],"source":["# Replace None with appropriate code\n","\n","\n","def find_winner(match):\n"," \"\"\"\n"," Given a dictionary containing information about a match,\n"," return the name of the winner (or None in the case of a tie)\n"," \"\"\"\n"," #return match['score1']\n"," if match['score1'] > match['score2']:\n"," winner = match['team1']['name']\n"," else:\n"," winner = match['team2']['name']\n","\n"," return winner"]},{"cell_type":"code","execution_count":232,"metadata":{},"outputs":[{"data":{"text/plain":["'Spain'"]},"execution_count":232,"metadata":{},"output_type":"execute_result"}],"source":["find_winner(matches[2])"]},{"cell_type":"code","execution_count":233,"metadata":{},"outputs":[{"data":{"text/plain":["5"]},"execution_count":233,"metadata":{},"output_type":"execute_result"}],"source":["len(matches)\n","type(matches)\n","matches[0]['score1']"]},{"cell_type":"code","execution_count":234,"metadata":{},"outputs":[],"source":["# Run this cell without changes\n","assert find_winner(matches[0]) == \"Russia\"\n","assert find_winner(matches[1]) == \"Uruguay\"\n","assert find_winner(matches[2]) == \"Spain\""]},{"cell_type":"markdown","metadata":{},"source":["Now that we have this helper function, loop over every match in `matches`, find the winner, and add 1 to the associated count of wins in `combined_data`. If the winner is `None`, skip adding it to the dictionary."]},{"cell_type":"code","execution_count":235,"metadata":{},"outputs":[{"data":{"text/plain":["'Russia'"]},"execution_count":235,"metadata":{},"output_type":"execute_result"}],"source":["combined_data['Argentina']['wins']+1\n","find_winner(matches[0])"]},{"cell_type":"code","execution_count":236,"metadata":{},"outputs":[{"data":{"text/plain":["{'Argentina': {'wins': 1},\n"," 'Australia': {'wins': 1},\n"," 'Belgium': {'wins': 6},\n"," 'Brazil': {'wins': 3},\n"," 'Colombia': {'wins': 2},\n"," 'Costa Rica': {'wins': 1},\n"," 'Croatia': {'wins': 4},\n"," 'Denmark': {'wins': 2},\n"," 'Egypt': {'wins': 0},\n"," 'England': {'wins': 5},\n"," 'France': {'wins': 7},\n"," 'Germany': {'wins': 1},\n"," 'Iceland': {'wins': 1},\n"," 'Iran': {'wins': 1},\n"," 'Japan': {'wins': 1},\n"," 'Mexico': {'wins': 2},\n"," 'Morocco': {'wins': 1},\n"," 'Nigeria': {'wins': 1},\n"," 'Panama': {'wins': 0},\n"," 'Peru': {'wins': 1},\n"," 'Poland': {'wins': 1},\n"," 'Portugal': {'wins': 2},\n"," 'Russia': {'wins': 3},\n"," 'Saudi Arabia': {'wins': 1},\n"," 'Senegal': {'wins': 2},\n"," 'Serbia': {'wins': 1},\n"," 'South Korea': {'wins': 1},\n"," 'Spain': {'wins': 2},\n"," 'Sweden': {'wins': 3},\n"," 'Switzerland': {'wins': 2},\n"," 'Tunisia': {'wins': 1},\n"," 'Uruguay': {'wins': 4}}"]},"execution_count":236,"metadata":{},"output_type":"execute_result"}],"source":["# Replace None with appropriate code\n","\n","for match in matches:\n"," # Get the name of the winner\n"," winner = find_winner(match)\n"," # Only proceed to the next step if there was\n"," # a winner\n"," if winner:\n"," # Add 1 to the associated count of wins\n"," combined_data[winner]['wins'] +=1\n","\n","# Visually inspect the output to ensure the wins are\n","# different for different countries\n","combined_data"]},{"cell_type":"markdown","metadata":{},"source":["### Analysis of Wins\n","\n","While we could try to understand all 32 of those numbers just by scanning through them, let's use some descriptive statistics and data visualizations instead\n","\n","#### Statistical Summary of Wins\n","\n","The code below calculates the mean, median, and standard deviation of the number of wins. If it doesn't work, that is an indication that something went wrong with the creation of the `combined_data` variable, and you might want to look at the solution branch and fix your code before proceeding."]},{"cell_type":"code","execution_count":237,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["Mean number of wins: 2.0\n","Median number of wins: 1.0\n","Standard deviation of number of wins: 1.620185174601965\n"]}],"source":["# Run this cell without changes\n","import numpy as np\n","\n","wins = [val[\"wins\"] for val in combined_data.values()]\n","\n","print(\"Mean number of wins:\", np.mean(wins))\n","print(\"Median number of wins:\", np.median(wins))\n","print(\"Standard deviation of number of wins:\", np.std(wins))"]},{"cell_type":"markdown","metadata":{},"source":["#### Visualizations of Wins\n","\n","In addition to those numbers, let's make a histogram (showing the distributions of the number of wins) and a bar graph (showing the number of wins by country)."]},{"cell_type":"code","execution_count":238,"metadata":{},"outputs":[{"data":{"image/png":"","text/plain":[""]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["# Run this cell without changes\n","import matplotlib.pyplot as plt\n","\n","# Set up figure and axes\n","fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 7))\n","fig.set_tight_layout(True)\n","\n","# Histogram of Wins and Frequencies\n","ax1.hist(x=wins, bins=range(8), align=\"left\", color=\"green\")\n","ax1.set_xticks(range(7))\n","ax1.set_xlabel(\"Wins in 2018 World Cup\")\n","ax1.set_ylabel(\"Frequency\")\n","ax1.set_title(\"Distribution of Wins\")\n","\n","# Horizontal Bar Graph of Wins by Country\n","ax2.barh(teams[::-1], wins[::-1], color=\"green\")\n","ax2.set_xlabel(\"Wins in 2018 World Cup\")\n","ax2.set_title(\"Wins by Country\");"]},{"cell_type":"markdown","metadata":{},"source":["#### Interpretation of Win Analysis\n","\n","Before we move to looking at the relationship between wins and population, it's useful to understand the distribution of wins alone. A few notes of interpretation:\n","\n","* The number of wins is skewed and looks like a [negative binomial distribution](https://en.wikipedia.org/wiki/Negative_binomial_distribution), which makes sense conceptually\n","* The \"typical\" value here is 1 (both the median and the highest point of the histogram), meaning a typical team that qualifies for the World Cup wins once\n","* There are a few teams we might consider outliers: Belgium and France, with 6x the wins of the \"typical\" team and 1.5x the wins of the next \"runner-up\" (Uruguay, with 4 wins)\n","* This is a fairly small dataset, something that becomes more noticeable with such a \"spiky\" (not smooth) histogram\n"]},{"cell_type":"markdown","metadata":{},"source":["## 3. Associating Countries with 2018 Population\n","\n","> Add to the existing data structure so that it also connects each country name to its 2018 population, and create visualizations comparable to those from step 2.\n","\n","Now we're ready to add the 2018 population to `combined_data`, finally using the CSV file\n","\n","Recall that `combined_data` currently looks something like this:\n","```\n","{\n"," 'Argentina': { 'wins': 1 },\n"," ...\n"," 'Uruguay': { 'wins': 4 }\n","}\n","```\n","\n","And the goal is for it to look something like this:\n","```\n","{\n"," 'Argentina': { 'wins': 1, 'population': 44494502 },\n"," ...\n"," 'Uruguay': { 'wins': 4, 'population': 3449299 }\n","}\n","```\n","\n","To do that, we need to extract the 2018 population information from the CSV data.\n","\n","### Exploring the Structure of the Population Data CSV\n","\n","Recall that previously we loaded information from a CSV containing population data into a list of dictionaries called `population_data`."]},{"cell_type":"code","execution_count":239,"metadata":{},"outputs":[{"data":{"text/plain":["12695"]},"execution_count":239,"metadata":{},"output_type":"execute_result"}],"source":["# Run this cell without changes\n","len(population_data)"]},{"cell_type":"markdown","metadata":{},"source":["12,695 is a very large number of rows to print out, so let's look at some samples instead."]},{"cell_type":"code","execution_count":240,"metadata":{},"outputs":[{"data":{"text/plain":["array([{'': '9984', 'Country Name': 'Malta', 'Country Code': 'MLT', 'Year': '1983', 'Value': '330524'},\n"," {'': '3574', 'Country Name': 'Bahrain', 'Country Code': 'BHR', 'Year': '1994', 'Value': '549583'},\n"," {'': '8104', 'Country Name': 'Iran, Islamic Rep.', 'Country Code': 'IRN', 'Year': '1988', 'Value': '53077313'},\n"," {'': '7905', 'Country Name': 'Iceland', 'Country Code': 'ISL', 'Year': '1966', 'Value': '195570'},\n"," {'': '14678', 'Country Name': 'United Arab Emirates', 'Country Code': 'ARE', 'Year': '1966', 'Value': '159976'},\n"," {'': '13998', 'Country Name': 'Thailand', 'Country Code': 'THA', 'Year': '1994', 'Value': '58875269'},\n"," {'': '8448', 'Country Name': 'Jamaica', 'Country Code': 'JAM', 'Year': '1978', 'Value': '2105907'},\n"," {'': '8979', 'Country Name': 'Kuwait', 'Country Code': 'KWT', 'Year': '1978', 'Value': '1224067'},\n"," {'': '3180', 'Country Name': 'Argentina', 'Country Code': 'ARG', 'Year': '2013', 'Value': '42202935'},\n"," {'': '7140', 'Country Name': 'Gibraltar', 'Country Code': 'GIB', 'Year': '1968', 'Value': '27685'}],\n"," dtype=object)"]},"execution_count":240,"metadata":{},"output_type":"execute_result"}],"source":["# Run this cell without changes\n","np.random.seed(42)\n","population_record_samples = np.random.choice(population_data, size=10)\n","population_record_samples"]},{"cell_type":"markdown","metadata":{},"source":["There are **2 filtering tasks**, **1 data normalization task**, and **1 type conversion task** to be completed, based on what we can see in this sample. We'll walk through each of them below.\n","\n","(In a more realistic data cleaning environment, you most likely won't happen to get a sample that demonstrates all of the data cleaning steps needed, but this sample was chosen carefully for example purposes.)\n","\n","### Filtering Population Data\n","\n","We already should have suspected that this dataset would require some filtering, since there are 32 records in our current `combined_data` dataset and 12,695 records in `population_data`. Now that we have looked at this sample, we can identify 2 features we'll want to use in order to filter down the `population_data` records to just 32. Try to identify them before looking at the answer below.\n","\n",".\n","\n",".\n","\n",".\n","\n","*Answer: the two features to filter on are* ***`'Country Name'`*** *and* ***`'Year'`***. *We can see from the sample above that there are countries in `population_data` that are not present in `combined_data` (e.g. Malta) and there are years present that are not 2018.*\n","\n","In the cell below, create a new variable `population_data_filtered` that only includes relevant records from `population_data`. Relevant records are records where the country name is one of the countries in the `teams` list, and the year is \"2018\".\n","\n","(It's okay to leave 2018 as a string since we are not performing any math operations on it, just make sure you check for `\"2018\"` and not `2018`.)"]},{"cell_type":"code","execution_count":241,"metadata":{},"outputs":[{"data":{"text/plain":["'1960'"]},"execution_count":241,"metadata":{},"output_type":"execute_result"}],"source":["type(population_data)\n","#population_data[0]['Country Name']\n","population_data[0]['Year']"]},{"cell_type":"code","execution_count":242,"metadata":{},"outputs":[{"data":{"text/plain":["27"]},"execution_count":242,"metadata":{},"output_type":"execute_result"}],"source":["# Replace None with appropriate code\n","\n","population_data_filtered = []\n","\n","for record in population_data:\n"," # Add record to population_data_filtered if relevant\n"," if record['Country Name'] in teams and record['Year'] =='2018':\n"," population_data_filtered.append(record)\n","\n","len(population_data_filtered) # 27\n","#record['Year']"]},{"cell_type":"markdown","metadata":{},"source":["Hmm...what went wrong? Why do we only have 27 records, and not 32?\n","\n","Did we really get a dataset with 12k records that's missing 5 of the data points we need?\n","\n","Let's take a closer look at the population data samples again, specifically the third one:"]},{"cell_type":"code","execution_count":243,"metadata":{},"outputs":[{"data":{"text/plain":["{'': '8104',\n"," 'Country Name': 'Iran, Islamic Rep.',\n"," 'Country Code': 'IRN',\n"," 'Year': '1988',\n"," 'Value': '53077313'}"]},"execution_count":243,"metadata":{},"output_type":"execute_result"}],"source":["# Run this cell without changes\n","population_record_samples[2]"]},{"cell_type":"markdown","metadata":{},"source":["And compare that with the value for Iran in `teams`:"]},{"cell_type":"code","execution_count":244,"metadata":{},"outputs":[{"data":{"text/plain":["'Iran'"]},"execution_count":244,"metadata":{},"output_type":"execute_result"}],"source":["# Run this cell without changes\n","teams[13]"]},{"cell_type":"markdown","metadata":{},"source":["Ohhhh...we have a data normalization issue. One dataset refers to this country as `'Iran, Islamic Rep.'`, while the other refers to it as `'Iran'`. This is a common issue we face when using data about countries and regions, where there is no universally-accepted naming convention.\n","\n","### Normalizing Locations in Population Data\n","\n","Sometimes data normalization can be a very, very time-consuming task where you need to find \"crosswalk\" data that can link the two formats together, or you need to write advanced regex formulas to line everything up.\n","\n","For this task, there are only 5 missing, so we'll just go ahead and give you a function that makes the appropriate substitutions."]},{"cell_type":"code","execution_count":245,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["Russia\n","Argentina\n"]}],"source":["# Run this cell without changes\n","def normalize_location(country_name):\n"," \"\"\"\n"," Given a country name, return the name that the\n"," country uses when playing in the FIFA World Cup\n"," \"\"\"\n"," name_sub_dict = {\n"," \"Russian Federation\": \"Russia\",\n"," \"Egypt, Arab Rep.\": \"Egypt\",\n"," \"Iran, Islamic Rep.\": \"Iran\",\n"," \"Korea, Rep.\": \"South Korea\",\n"," \"United Kingdom\": \"England\",\n"," }\n"," # The .get method returns the corresponding value from\n"," # the dict if present, otherwise returns country_name\n"," return name_sub_dict.get(country_name, country_name)\n","\n","\n","# Example where normalized location is different\n","print(normalize_location(\"Russian Federation\"))\n","# Example where normalized location is the same\n","print(normalize_location(\"Argentina\"))"]},{"cell_type":"markdown","metadata":{},"source":["Now, write new code to create `population_data_filtered` with normalized country names."]},{"cell_type":"code","execution_count":246,"metadata":{},"outputs":[{"data":{"text/plain":["32"]},"execution_count":246,"metadata":{},"output_type":"execute_result"}],"source":["# Replace None with appropriate code\n","\n","population_data_filtered = []\n","\n","for record in population_data:\n"," # Get normalized country name\n"," record['Country Name'] = normalize_location(record['Country Name'])\n"," # Add record to population_data_filtered if relevant\n"," if record['Country Name'] in teams and record['Year'] =='2018':\n"," \n"," # Replace the country name in the record\n"," None\n"," # Append to list\n"," population_data_filtered.append(record) #None\n","\n","len(population_data_filtered) # 32"]},{"cell_type":"markdown","metadata":{},"source":["Great, now we should have 32 records instead of 27.\n","\n","### Type Conversion of Population Data\n","\n","We need to do one more thing before we'll have population data that is usable for analysis. Take a look at this record from `population_data_filtered` to see if you can spot it:"]},{"cell_type":"code","execution_count":247,"metadata":{},"outputs":[{"data":{"text/plain":["{'': '3185',\n"," 'Country Name': 'Argentina',\n"," 'Country Code': 'ARG',\n"," 'Year': '2018',\n"," 'Value': '44494502'}"]},"execution_count":247,"metadata":{},"output_type":"execute_result"}],"source":["# Run this cell without changes\n","population_data_filtered[0]"]},{"cell_type":"code","execution_count":248,"metadata":{},"outputs":[{"data":{"text/plain":["[{'': '3185',\n"," 'Country Name': 'Argentina',\n"," 'Country Code': 'ARG',\n"," 'Year': '2018',\n"," 'Value': '44494502'},\n"," {'': '3362',\n"," 'Country Name': 'Australia',\n"," 'Country Code': 'AUS',\n"," 'Year': '2018',\n"," 'Value': '24982688'},\n"," {'': '3834',\n"," 'Country Name': 'Belgium',\n"," 'Country Code': 'BEL',\n"," 'Year': '2018',\n"," 'Value': '11433256'},\n"," {'': '4306',\n"," 'Country Name': 'Brazil',\n"," 'Country Code': 'BRA',\n"," 'Year': '2018',\n"," 'Value': '209469333'},\n"," {'': '5250',\n"," 'Country Name': 'Colombia',\n"," 'Country Code': 'COL',\n"," 'Year': '2018',\n"," 'Value': '49648685'},\n"," {'': '5486',\n"," 'Country Name': 'Costa Rica',\n"," 'Country Code': 'CRI',\n"," 'Year': '2018',\n"," 'Value': '4999441'},\n"," {'': '5604',\n"," 'Country Name': 'Croatia',\n"," 'Country Code': 'HRV',\n"," 'Year': '2018',\n"," 'Value': '4087843'},\n"," {'': '5899',\n"," 'Country Name': 'Denmark',\n"," 'Country Code': 'DNK',\n"," 'Year': '2018',\n"," 'Value': '5793636'},\n"," {'': '6194',\n"," 'Country Name': 'Egypt',\n"," 'Country Code': 'EGY',\n"," 'Year': '2018',\n"," 'Value': '98423595'},\n"," {'': '6777',\n"," 'Country Name': 'France',\n"," 'Country Code': 'FRA',\n"," 'Year': '2018',\n"," 'Value': '66977107'},\n"," {'': '7072',\n"," 'Country Name': 'Germany',\n"," 'Country Code': 'DEU',\n"," 'Year': '2018',\n"," 'Value': '82905782'},\n"," {'': '7957',\n"," 'Country Name': 'Iceland',\n"," 'Country Code': 'ISL',\n"," 'Year': '2018',\n"," 'Value': '352721'},\n"," {'': '8134',\n"," 'Country Name': 'Iran',\n"," 'Country Code': 'IRN',\n"," 'Year': '2018',\n"," 'Value': '81800269'},\n"," {'': '8547',\n"," 'Country Name': 'Japan',\n"," 'Country Code': 'JPN',\n"," 'Year': '2018',\n"," 'Value': '126529100'},\n"," {'': '8901',\n"," 'Country Name': 'South Korea',\n"," 'Country Code': 'KOR',\n"," 'Year': '2018',\n"," 'Value': '51606633'},\n"," {'': '10255',\n"," 'Country Name': 'Mexico',\n"," 'Country Code': 'MEX',\n"," 'Year': '2018',\n"," 'Value': '126190788'},\n"," {'': '10609',\n"," 'Country Name': 'Morocco',\n"," 'Country Code': 'MAR',\n"," 'Year': '2018',\n"," 'Value': '36029138'},\n"," {'': '11258',\n"," 'Country Name': 'Nigeria',\n"," 'Country Code': 'NGA',\n"," 'Year': '2018',\n"," 'Value': '195874740'},\n"," {'': '11671',\n"," 'Country Name': 'Panama',\n"," 'Country Code': 'PAN',\n"," 'Year': '2018',\n"," 'Value': '4176873'},\n"," {'': '11848',\n"," 'Country Name': 'Peru',\n"," 'Country Code': 'PER',\n"," 'Year': '2018',\n"," 'Value': '31989256'},\n"," {'': '11966',\n"," 'Country Name': 'Poland',\n"," 'Country Code': 'POL',\n"," 'Year': '2018',\n"," 'Value': '37974750'},\n"," {'': '12025',\n"," 'Country Name': 'Portugal',\n"," 'Country Code': 'PRT',\n"," 'Year': '2018',\n"," 'Value': '10283822'},\n"," {'': '12261',\n"," 'Country Name': 'Russia',\n"," 'Country Code': 'RUS',\n"," 'Year': '2018',\n"," 'Value': '144478050'},\n"," {'': '12556',\n"," 'Country Name': 'Saudi Arabia',\n"," 'Country Code': 'SAU',\n"," 'Year': '2018',\n"," 'Value': '33699947'},\n"," {'': '12615',\n"," 'Country Name': 'Senegal',\n"," 'Country Code': 'SEN',\n"," 'Year': '2018',\n"," 'Value': '15854360'},\n"," {'': '12644',\n"," 'Country Name': 'Serbia',\n"," 'Country Code': 'SRB',\n"," 'Year': '2018',\n"," 'Value': '6982604'},\n"," {'': '13255',\n"," 'Country Name': 'Spain',\n"," 'Country Code': 'ESP',\n"," 'Year': '2018',\n"," 'Value': '46796540'},\n"," {'': '13727',\n"," 'Country Name': 'Sweden',\n"," 'Country Code': 'SWE',\n"," 'Year': '2018',\n"," 'Value': '10175214'},\n"," {'': '13786',\n"," 'Country Name': 'Switzerland',\n"," 'Country Code': 'CHE',\n"," 'Year': '2018',\n"," 'Value': '8513227'},\n"," {'': '14317',\n"," 'Country Name': 'Tunisia',\n"," 'Country Code': 'TUN',\n"," 'Year': '2018',\n"," 'Value': '11565204'},\n"," {'': '14789',\n"," 'Country Name': 'England',\n"," 'Country Code': 'GBR',\n"," 'Year': '2018',\n"," 'Value': '66460344'},\n"," {'': '14907',\n"," 'Country Name': 'Uruguay',\n"," 'Country Code': 'URY',\n"," 'Year': '2018',\n"," 'Value': '3449299'}]"]},"execution_count":248,"metadata":{},"output_type":"execute_result"}],"source":["population_data_filtered"]},{"cell_type":"markdown","metadata":{},"source":["Every key has the same data type (`str`), including the population value. In this example, it's `'44494502'`, when it needs to be `44494502` if we want to be able to compute statistics with it.\n","\n","In the cell below, loop over `population_data_filtered` and convert the data type of the value associated with the `\"Value\"` key from a string to an integer, using the built-in `int()` function."]},{"cell_type":"code","execution_count":249,"metadata":{},"outputs":[{"data":{"text/plain":["{'': '14907',\n"," 'Country Name': 'Uruguay',\n"," 'Country Code': 'URY',\n"," 'Year': '2018',\n"," 'Value': 3449299}"]},"execution_count":249,"metadata":{},"output_type":"execute_result"}],"source":["# Replace None with appropriate code\n","for record in population_data_filtered:\n"," # Convert the population value from str to int\n"," record['Value'] = int(record['Value'])\n","\n","# Look at the last record to make sure the population\n","# value is an int\n","population_data_filtered[-1]"]},{"cell_type":"markdown","metadata":{},"source":["Check that it worked with the assert statement below:"]},{"cell_type":"code","execution_count":250,"metadata":{},"outputs":[],"source":["# Run this cell without changes\n","assert type(population_data_filtered[-1][\"Value\"]) == int"]},{"cell_type":"markdown","metadata":{},"source":["### Adding Population Data\n","\n","Now it's time to add the population data to `combined_data`. Recall that the data structure currently looks like this:"]},{"cell_type":"code","execution_count":251,"metadata":{},"outputs":[{"data":{"text/plain":["{'Argentina': {'wins': 1},\n"," 'Australia': {'wins': 1},\n"," 'Belgium': {'wins': 6},\n"," 'Brazil': {'wins': 3},\n"," 'Colombia': {'wins': 2},\n"," 'Costa Rica': {'wins': 1},\n"," 'Croatia': {'wins': 4},\n"," 'Denmark': {'wins': 2},\n"," 'Egypt': {'wins': 0},\n"," 'England': {'wins': 5},\n"," 'France': {'wins': 7},\n"," 'Germany': {'wins': 1},\n"," 'Iceland': {'wins': 1},\n"," 'Iran': {'wins': 1},\n"," 'Japan': {'wins': 1},\n"," 'Mexico': {'wins': 2},\n"," 'Morocco': {'wins': 1},\n"," 'Nigeria': {'wins': 1},\n"," 'Panama': {'wins': 0},\n"," 'Peru': {'wins': 1},\n"," 'Poland': {'wins': 1},\n"," 'Portugal': {'wins': 2},\n"," 'Russia': {'wins': 3},\n"," 'Saudi Arabia': {'wins': 1},\n"," 'Senegal': {'wins': 2},\n"," 'Serbia': {'wins': 1},\n"," 'South Korea': {'wins': 1},\n"," 'Spain': {'wins': 2},\n"," 'Sweden': {'wins': 3},\n"," 'Switzerland': {'wins': 2},\n"," 'Tunisia': {'wins': 1},\n"," 'Uruguay': {'wins': 4}}"]},"execution_count":251,"metadata":{},"output_type":"execute_result"}],"source":["# Run this cell without changes\n","combined_data"]},{"cell_type":"markdown","metadata":{},"source":["The goal is for it to be structured like this:\n","```\n","{\n"," 'Argentina': { 'wins': 1, 'population': 44494502 },\n"," ...\n"," 'Uruguay': { 'wins': 4, 'population': 3449299 }\n","}\n","```"]},{"cell_type":"markdown","metadata":{},"source":["In the cell below, loop over `population_data_filtered` and add information about population to each country in `combined_data`:"]},{"cell_type":"code","execution_count":274,"metadata":{},"outputs":[{"data":{"text/plain":["{'Argentina': {'wins': 1, 'population': 44494502},\n"," 'Australia': {'wins': 1, 'population': 24982688},\n"," 'Belgium': {'wins': 6, 'population': 11433256},\n"," 'Brazil': {'wins': 3, 'population': 209469333},\n"," 'Colombia': {'wins': 2, 'population': 49648685},\n"," 'Costa Rica': {'wins': 1, 'population': 4999441},\n"," 'Croatia': {'wins': 4, 'population': 4087843},\n"," 'Denmark': {'wins': 2, 'population': 5793636},\n"," 'Egypt': {'wins': 0, 'population': 98423595},\n"," 'England': {'wins': 5, 'population': 66460344},\n"," 'France': {'wins': 7, 'population': 66977107},\n"," 'Germany': {'wins': 1, 'population': 82905782},\n"," 'Iceland': {'wins': 1, 'population': 352721},\n"," 'Iran': {'wins': 1, 'population': 81800269},\n"," 'Japan': {'wins': 1, 'population': 126529100},\n"," 'Mexico': {'wins': 2, 'population': 126190788},\n"," 'Morocco': {'wins': 1, 'population': 36029138},\n"," 'Nigeria': {'wins': 1, 'population': 195874740},\n"," 'Panama': {'wins': 0, 'population': 4176873},\n"," 'Peru': {'wins': 1, 'population': 31989256},\n"," 'Poland': {'wins': 1, 'population': 37974750},\n"," 'Portugal': {'wins': 2, 'population': 10283822},\n"," 'Russia': {'wins': 3, 'population': 144478050},\n"," 'Saudi Arabia': {'wins': 1, 'population': 33699947},\n"," 'Senegal': {'wins': 2, 'population': 15854360},\n"," 'Serbia': {'wins': 1, 'population': 6982604},\n"," 'South Korea': {'wins': 1, 'population': 51606633},\n"," 'Spain': {'wins': 2, 'population': 46796540},\n"," 'Sweden': {'wins': 3, 'population': 10175214},\n"," 'Switzerland': {'wins': 2, 'population': 8513227},\n"," 'Tunisia': {'wins': 1, 'population': 11565204},\n"," 'Uruguay': {'wins': 4, 'population': 3449299},\n"," 'population': 3449299}"]},"execution_count":274,"metadata":{},"output_type":"execute_result"}],"source":["# Replace None with appropriate code\n","for record in population_data_filtered:\n"," # Extract the country name from the record\n"," country = record['Country Name'] #None\n"," # Extract the population value from the record\n"," population = record['Value'] #None\n"," # Add this information to combined_data\n"," combined_data[country].update({'population': population}) #None\n","\n","# Look combined_data\n","combined_data"]},{"cell_type":"markdown","metadata":{},"source":["Check that the types are correct with these assert statements:"]},{"cell_type":"code","execution_count":263,"metadata":{},"outputs":[],"source":["# Run this cell without changes\n","assert type(combined_data[\"Uruguay\"]) == dict\n","assert type(combined_data[\"Uruguay\"][\"population\"]) == int"]},{"cell_type":"code","execution_count":275,"metadata":{},"outputs":[{"data":{"text/plain":["dict_keys(['wins', 'population'])"]},"execution_count":275,"metadata":{},"output_type":"execute_result"}],"source":["combined_data['Argentina'].keys()"]},{"cell_type":"code","execution_count":273,"metadata":{},"outputs":[{"data":{"text/plain":["'population'"]},"execution_count":273,"metadata":{},"output_type":"execute_result"}],"source":["for val in combined_data:#.values():\n"," countryd =val\n","countryd"]},{"cell_type":"code","execution_count":278,"metadata":{},"outputs":[{"data":{"text/plain":["('population', 3449299)"]},"execution_count":278,"metadata":{},"output_type":"execute_result"}],"source":["#combined_data.popitem()"]},{"cell_type":"markdown","metadata":{},"source":["### Analysis of Population\n","\n","Let's perform the same analysis for population that we performed for count of wins.\n","\n","#### Statistical Analysis of Population"]},{"cell_type":"code","execution_count":279,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["Mean population: 51687460.84375\n","Median population: 34864542.5\n","Standard deviation of population: 55195121.60871871\n"]}],"source":["# Run this cell without changes\n","populations = [val[\"population\"] for val in combined_data.values()]\n","\n","print(\"Mean population:\", np.mean(populations))\n","print(\"Median population:\", np.median(populations))\n","print(\"Standard deviation of population:\", np.std(populations))"]},{"cell_type":"markdown","metadata":{},"source":["#### Visualizations of Population"]},{"cell_type":"code","execution_count":280,"metadata":{},"outputs":[{"data":{"image/png":"","text/plain":[""]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["# Run this cell without changes\n","\n","# Set up figure and axes\n","fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 7))\n","fig.set_tight_layout(True)\n","\n","# Histogram of Populations and Frequencies\n","ax1.hist(x=populations, color=\"blue\")\n","ax1.set_xlabel(\"2018 Population\")\n","ax1.set_ylabel(\"Frequency\")\n","ax1.set_title(\"Distribution of Population\")\n","\n","# Horizontal Bar Graph of Population by Country\n","ax2.barh(teams[::-1], populations[::-1], color=\"blue\")\n","ax2.set_xlabel(\"2018 Population\")\n","ax2.set_title(\"Population by Country\");"]},{"cell_type":"markdown","metadata":{},"source":["#### Interpretation of Population Analysis\n","\n","* Similar to the distribution of the number of wins, the distribution of population is skewed.\n","* It's hard to choose a single \"typical\" value here because there is so much variation.\n","* The countries with the largest populations (Brazil, Nigeria, and Russia) do not overlap with the countries with the most wins (Belgium, France, and Uruguay)"]},{"cell_type":"markdown","metadata":{},"source":["## 4. Analysis of Population vs. Performance\n","\n","> Choose an appropriate statistical measure to analyze the relationship between population and performance, and create a visualization representing this relationship.\n","\n","### Statistical Measure\n","So far we have learned about only two statistics for understanding the *relationship* between variables: **covariance** and **correlation**. We will use correlation here, because that provides a more standardized, interpretable metric."]},{"cell_type":"code","execution_count":281,"metadata":{},"outputs":[{"data":{"text/plain":["-0.006217143149754775"]},"execution_count":281,"metadata":{},"output_type":"execute_result"}],"source":["# Run this cell without changes\n","np.corrcoef(wins, populations)[0][1]"]},{"cell_type":"markdown","metadata":{},"source":["In the cell below, interpret this number. What direction is this correlation? Is it strong or weak?"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Replace None with appropriate code\n","\"\"\"\n","None: A country populaton and perfomance are negatively correlated\n","\"\"\""]},{"cell_type":"markdown","metadata":{},"source":["### Data Visualization\n","\n","A **scatter plot** is he most sensible form of data visualization for showing this relationship, because we have two dimensions of data, but there is no \"increasing\" variable (e.g. time) that would indicate we should use a line graph."]},{"cell_type":"code","execution_count":282,"metadata":{},"outputs":[{"data":{"image/png":"","text/plain":[""]},"metadata":{"needs_background":"light"},"output_type":"display_data"}],"source":["# Run this cell without changes\n","\n","# Set up figure\n","fig, ax = plt.subplots(figsize=(8, 5))\n","\n","# Basic scatter plot\n","ax.scatter(x=populations, y=wins, color=\"gray\", alpha=0.5, s=100)\n","ax.set_xlabel(\"2018 Population\")\n","ax.set_ylabel(\"2018 World Cup Wins\")\n","ax.set_title(\"Population vs. World Cup Wins\")\n","\n","# Add annotations for specific points of interest\n","highlighted_points = {\n"," \"Belgium\": 2, # Numbers are the index of that\n"," \"Brazil\": 3, # country in populations & wins\n"," \"France\": 10,\n"," \"Nigeria\": 17,\n","}\n","for country, index in highlighted_points.items():\n"," # Get x and y position of data point\n"," x = populations[index]\n"," y = wins[index]\n"," # Move each point slightly down and to the left\n"," # (numbers were chosen by manually tweaking)\n"," xtext = x - (1.25e6 * len(country))\n"," ytext = y - 0.5\n"," # Annotate with relevant arguments\n"," ax.annotate(text=country, xy=(x, y), xytext=(xtext, ytext))"]},{"cell_type":"markdown","metadata":{},"source":["### Data Visualization Interpretation\n","\n","Interpret this plot in the cell below. Does this align with the findings from the statistical measure (correlation), as well as the map shown at the beginning of this lab (showing the best results by country)?"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Replace None with appropriate text\n","\"\"\"\n","yes #None\n","\"\"\""]},{"cell_type":"markdown","metadata":{},"source":["### Final Analysis\n","\n","> What is the relationship between the population of a country and their performance in the 2018 FIFA World Cup?\n","\n","Overall, we found a very weakly positive relationship between the population of a country and their performance in the 2018 FIFA World Cup, as demonstrated by both the correlation between populations and wins, and the scatter plot.\n","\n","In the cell below, write down your thoughts on these questions:\n","\n"," - What are your thoughts on why you may see this result?\n"," - What would you research next?"]},{"cell_type":"code","execution_count":null,"metadata":{},"outputs":[],"source":["# Replace None with appropriate text\n","\"\"\"\n","None\n","\"\"\""]},{"cell_type":"markdown","metadata":{},"source":["## Summary\n","\n","That was a long lab, pulling together a lot of material. You read data into Python, extracted the relevant information, cleaned the data, and combined the data into a new format to be used in analysis. While we will continue to introduce new tools and techniques, these essential steps will be present for the rest of your data science projects from here on out."]}],"metadata":{"kernelspec":{"display_name":"learn-env","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.5"}},"nbformat":4,"nbformat_minor":4}