Add tutorial for country mentions over time

linusha · linusha · commit 0afe66a1520f · 2025-04-30T21:31:31.000+02:00
diff --git a/.gitignore b/.gitignore
@@ -114,3 +114,6 @@ dmypy.json
 # jupyterlite
 *.doit.db
 _output
+
+# Custom
+content/data/*.csv
diff --git a/content/country_trends.ipynb b/content/country_trends.ipynb
@@ -0,0 +1,360 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "BO2vOh8UMOmf"
+      },
+      "source": [
+        "# Over the years, which countries have been the focus of parliamentary discussions?\n",
+        "\n",
+        "In this tutorial we will look at votes that focus on events in, or relations with, countries. We won't be concerned with the outcomes of these votes, but focus on their frequency over time - seeing how current, geopolitical topics are reflected in Parliament.\n",
+        "\n",
+        "To follow allong with the tutorial, you should already be familiar with data analysis in Python using `pandas`. You don’t need prior knowledge about the European Parliament.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "PQRf2gM3Iyh_"
+      },
+      "source": [
+        "## Context: The European Union Vocabulary\n",
+        "\n",
+        "The institutions of the European Union maintain [common vocabularies](https://op.europa.eu/en/web/eu-vocabularies) for many areas. One of this is the [Corporate list of countries and territories](https://op.europa.eu/en/web/eu-vocabularies/countries-and-territories).\n",
+        "This vocabulary provides a list of countries acknowledged by the European Union as well as agreed upon terminology for other relevant territories.\n",
+        "\n",
+        "Because the European Parliament provides relevant terms from this vocabulary for all applicable votes, we can heavily rely on this for our task at hand!"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "ljA_CkomH1qx"
+      },
+      "source": [
+        "## Retrieving the Votes\n",
+        "\n",
+        "We begin by using the `votes` table from the HowTheyVote.eu data set. This will give us a list of all roll-call votes in the Parliaments plenary."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "trusted": true
+      },
+      "outputs": [],
+      "source": [
+        "import pandas as pd\n",
+        "import matplotlib.pyplot as plt "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "da7W2nn-MLX3",
+        "trusted": true
+      },
+      "outputs": [],
+      "source": [
+        "votes_df = pd.read_csv('data/votes.csv')\n",
+        "votes_df.head()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "UjDAJvApMrLx"
+      },
+      "source": [
+        "For the contents of this tutorial, we are only concerned with the overall topic of texts (also called reports in the Parliament). Therefore, we can safely filter all votes on amendments from the `votes` table. For more information on amendments, take a look at [one of our other tutorials](https://howtheyvote.github.io/tutorials/lab/index.html?path=close_amendment_votes.ipynb)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "lpF8_Ig0MYOd",
+        "trusted": true
+      },
+      "outputs": [],
+      "source": [
+        "reports = votes_df[votes_df[\"is_main\"] == True].copy()\n",
+        "reports.head()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "_gEWmkO4OzZ6"
+      },
+      "source": [
+        "This essentially leaves us with a list of all final texts that parliament held a roll-call vote on."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Finding Countries related to Votes\n",
+        "\n",
+        "Now, one approach to identify with which countries each text was concerned could be to take a look at the `procedure_title` column and to try to find all countries in there. Extracting parts of unstructured text is always error prone and there are additional hurdles to this approach, for example we would need to bring our own list of countries and regions.\n",
+        "\n",
+        "Luckily, by relying on the *Countries and territories vocabulary* mentioned above, the HowTheyVote.eu data set provides us with a significantly better approach by utilizing the `geo_area_votes` table. It shows us the `geo_area_code`s for each vote that relates to specific geopgrahic areas:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "trusted": true
+      },
+      "outputs": [],
+      "source": [
+        "geo_areas_df = pd.read_csv('data/geo_area_votes.csv')\n",
+        "geo_areas_df.head()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "As of course multiple texts can be concerned with the same countries, the entries in the `geo_area_code`column are not unique.\n",
+        "We can also quickly confirm that the entries in the `vote_id` column are not unqiue as well:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "trusted": true
+      },
+      "outputs": [],
+      "source": [
+        "vote_counts = geo_areas_df['vote_id'].value_counts().reset_index()\n",
+        "vote_counts.head(5)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Therefore, we can have many reports all concerned with the same country and we can also have reports that are concerned with more than one country.\n",
+        "\n",
+        "We match each report with all it's relevant geo codes. This will lead to reports getting duplicated it they relate to more than one geo code, but allows us to easily tally the codes afterwards:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "scrolled": true,
+        "trusted": true
+      },
+      "outputs": [],
+      "source": [
+        "reports_with_areas = reports.merge(geo_areas_df, left_on=\"id\", right_on=\"vote_id\")\n",
+        "reports_with_areas"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Now, we could for example easily construct a list of the top 5 most mentioned countries:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "trusted": true
+      },
+      "outputs": [],
+      "source": [
+        "top_countries = reports_with_areas['geo_area_code'].value_counts().head(5).reset_index()\n",
+        "top_countries"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "To make this list easier to understand, we can also retrieve the full names of the geo areas. These are stored in the `geo_areas.csv` and can easily be joined onto our table, which contains the `code` of the areas:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "trusted": true
+      },
+      "outputs": [],
+      "source": [
+        "geo_areas = pd.read_csv(\"data/geo_areas.csv\")\n",
+        "geo_areas.head()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "trusted": true
+      },
+      "outputs": [],
+      "source": [
+        "top_countries.merge(geo_areas[[\"code\", \"label\"]], left_on=\"geo_area_code\", right_on=\"code\")[[\"label\", \"count\"]]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Mentions of Countries per Month\n",
+        "\n",
+        "As for this tutorial, we are interested in the number of mentions of specific countries and territories over time, we need to decide on a time unit of analysis. Since Parliament usually meets monthly in Strasbourg, we decide to count the mentions per month.\n",
+        "This requires us to convert the timestamp to a proper `datetime` object and afterwards extracting its month:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "reports_with_areas['timestamp'] = pd.to_datetime(reports_with_areas['timestamp'], format='ISO8601')\n",
+        "reports_with_areas['month'] = reports_with_areas['timestamp'].dt.to_period('M')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "As we are interested in the number of mentions per month **per geo area**, we will use `pivot_table` to effectively group by geo area, counting the relevant reports per month and storing these values in a separate colum for each geo area:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "monthly_reports = pd.pivot_table(\n",
+        "    reports_with_areas,\n",
+        "    index='month',\n",
+        "    columns='geo_area_code',\n",
+        "    values='timestamp',\n",
+        "    aggfunc='count',\n",
+        "    fill_value=0\n",
+        ").reset_index()\n",
+        "monthly_reports.head(8)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Taking a look at the output above, we can spot that this approach creates gaps in our dataframe for months in which no geo coded votes took place. For example, the March to May 2020 are missing. In the following code will ensure that we have a row for each month (between the earliest and latest month in the data), filling the columns with 0 for months that we now artificially add for the sake of completeness:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "date_range = pd.date_range(start=reports_with_areas['month'].min().to_timestamp(), end=reports_with_areas['month'].max().to_timestamp(), freq='MS')\n",
+        "all_months = pd.DataFrame({'month': pd.to_datetime(date_range).to_period('M')})\n",
+        "\n",
+        "monthly_reports = pd.merge(all_months, monthly_reports, on='month', how='left').fillna(0)\n",
+        "monthly_reports = monthly_reports.reset_index(drop=True)\n",
+        "monthly_reports.head(9)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Analysis and Visualization\n",
+        "\n",
+        "With this dataframe, we are now ready to analyze patterns and trend over time. For this tutorial, we will focus on two countries: Russia and Ukraine, plotting how often Parliament voted on matters relating to these countries over time:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Subset the data frame\n",
+        "ukr_rus = monthly_reports[[\"month\", \"UKR\", \"RUS\"]].copy()\n",
+        "# convert month to date time again for nice plotting\n",
+        "ukr_rus[\"month\"] = ukr_rus[\"month\"].dt.to_timestamp()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "plt.plot(ukr_rus['month'], ukr_rus['UKR'], label=\"Ukraine\", marker='.')\n",
+        "plt.plot(ukr_rus['month'], ukr_rus['RUS'], label=\"Russia\", marker='.')\n",
+        "\n",
+        "plt.title('EP roll-call votes over time')\n",
+        "plt.xlabel('Month')\n",
+        "plt.ylabel('Counts')\n",
+        "plt.xticks(rotation=45) \n",
+        "plt.legend()\n",
+        "\n",
+        "plt.show()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "This quick, eploratory visualization already shows some interesting trends that intuitively make sense: For both Russia and Ukraine, we can see a stark peak and also overall increase in related votes with Russias large scale invasion in Ukraine in February 2022. In general, we can see a lot of similar movements for both countries, which could indicate that many votes relate to both countries at once, for example [resolutions condemn this invasion itself](https://howtheyvote.eu/votes/165965). For most months, we also see more votes related to Ukraine than to Russia.\n",
+        "\n",
+        "## Further Analysis\n",
+        "\n",
+        "This is of course only a very rudimentary analysis, but provides us with an interesting starting point. Some further investigations could include:\n",
+        "\n",
+        "- relationships between other countries\n",
+        "- analyzing mentions per continent\n",
+        "- analyzing mentions per type of text voted on (i.e., legislative or non-legislative)\n",
+        "\n",
+        "## Wrapping Up\n",
+        "\n",
+        "In this tutorial we have seen how we can leverage the `geo_area_votes` table of the HowTheyVote data set to identify which texts that Parliament voted on relate to which countries. The tables can easily be combined by joining on the `id` of each vote."
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "provenance": [],
+      "toc_visible": true
+    },
+    "kernelspec": {
+      "display_name": "base",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.12.9"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 4
+}