Skip to content

Commit 0afe66a

Browse files
committed
Add tutorial for country mentions over time
1 parent 2964601 commit 0afe66a

File tree

2 files changed

+363
-0
lines changed

2 files changed

+363
-0
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,3 +114,6 @@ dmypy.json
114114
# jupyterlite
115115
*.doit.db
116116
_output
117+
118+
# Custom
119+
content/data/*.csv

content/country_trends.ipynb

Lines changed: 360 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,360 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {
6+
"id": "BO2vOh8UMOmf"
7+
},
8+
"source": [
9+
"# Over the years, which countries have been the focus of parliamentary discussions?\n",
10+
"\n",
11+
"In this tutorial we will look at votes that focus on events in, or relations with, countries. We won't be concerned with the outcomes of these votes, but focus on their frequency over time - seeing how current, geopolitical topics are reflected in Parliament.\n",
12+
"\n",
13+
"To follow allong with the tutorial, you should already be familiar with data analysis in Python using `pandas`. You don’t need prior knowledge about the European Parliament.\n"
14+
]
15+
},
16+
{
17+
"cell_type": "markdown",
18+
"metadata": {
19+
"id": "PQRf2gM3Iyh_"
20+
},
21+
"source": [
22+
"## Context: The European Union Vocabulary\n",
23+
"\n",
24+
"The institutions of the European Union maintain [common vocabularies](https://op.europa.eu/en/web/eu-vocabularies) for many areas. One of this is the [Corporate list of countries and territories](https://op.europa.eu/en/web/eu-vocabularies/countries-and-territories).\n",
25+
"This vocabulary provides a list of countries acknowledged by the European Union as well as agreed upon terminology for other relevant territories.\n",
26+
"\n",
27+
"Because the European Parliament provides relevant terms from this vocabulary for all applicable votes, we can heavily rely on this for our task at hand!"
28+
]
29+
},
30+
{
31+
"cell_type": "markdown",
32+
"metadata": {
33+
"id": "ljA_CkomH1qx"
34+
},
35+
"source": [
36+
"## Retrieving the Votes\n",
37+
"\n",
38+
"We begin by using the `votes` table from the HowTheyVote.eu data set. This will give us a list of all roll-call votes in the Parliaments plenary."
39+
]
40+
},
41+
{
42+
"cell_type": "code",
43+
"execution_count": null,
44+
"metadata": {
45+
"trusted": true
46+
},
47+
"outputs": [],
48+
"source": [
49+
"import pandas as pd\n",
50+
"import matplotlib.pyplot as plt "
51+
]
52+
},
53+
{
54+
"cell_type": "code",
55+
"execution_count": null,
56+
"metadata": {
57+
"id": "da7W2nn-MLX3",
58+
"trusted": true
59+
},
60+
"outputs": [],
61+
"source": [
62+
"votes_df = pd.read_csv('data/votes.csv')\n",
63+
"votes_df.head()"
64+
]
65+
},
66+
{
67+
"cell_type": "markdown",
68+
"metadata": {
69+
"id": "UjDAJvApMrLx"
70+
},
71+
"source": [
72+
"For the contents of this tutorial, we are only concerned with the overall topic of texts (also called reports in the Parliament). Therefore, we can safely filter all votes on amendments from the `votes` table. For more information on amendments, take a look at [one of our other tutorials](https://howtheyvote.github.io/tutorials/lab/index.html?path=close_amendment_votes.ipynb)."
73+
]
74+
},
75+
{
76+
"cell_type": "code",
77+
"execution_count": null,
78+
"metadata": {
79+
"id": "lpF8_Ig0MYOd",
80+
"trusted": true
81+
},
82+
"outputs": [],
83+
"source": [
84+
"reports = votes_df[votes_df[\"is_main\"] == True].copy()\n",
85+
"reports.head()"
86+
]
87+
},
88+
{
89+
"cell_type": "markdown",
90+
"metadata": {
91+
"id": "_gEWmkO4OzZ6"
92+
},
93+
"source": [
94+
"This essentially leaves us with a list of all final texts that parliament held a roll-call vote on."
95+
]
96+
},
97+
{
98+
"cell_type": "markdown",
99+
"metadata": {},
100+
"source": [
101+
"## Finding Countries related to Votes\n",
102+
"\n",
103+
"Now, one approach to identify with which countries each text was concerned could be to take a look at the `procedure_title` column and to try to find all countries in there. Extracting parts of unstructured text is always error prone and there are additional hurdles to this approach, for example we would need to bring our own list of countries and regions.\n",
104+
"\n",
105+
"Luckily, by relying on the *Countries and territories vocabulary* mentioned above, the HowTheyVote.eu data set provides us with a significantly better approach by utilizing the `geo_area_votes` table. It shows us the `geo_area_code`s for each vote that relates to specific geopgrahic areas:"
106+
]
107+
},
108+
{
109+
"cell_type": "code",
110+
"execution_count": null,
111+
"metadata": {
112+
"trusted": true
113+
},
114+
"outputs": [],
115+
"source": [
116+
"geo_areas_df = pd.read_csv('data/geo_area_votes.csv')\n",
117+
"geo_areas_df.head()"
118+
]
119+
},
120+
{
121+
"cell_type": "markdown",
122+
"metadata": {},
123+
"source": [
124+
"As of course multiple texts can be concerned with the same countries, the entries in the `geo_area_code`column are not unique.\n",
125+
"We can also quickly confirm that the entries in the `vote_id` column are not unqiue as well:"
126+
]
127+
},
128+
{
129+
"cell_type": "code",
130+
"execution_count": null,
131+
"metadata": {
132+
"trusted": true
133+
},
134+
"outputs": [],
135+
"source": [
136+
"vote_counts = geo_areas_df['vote_id'].value_counts().reset_index()\n",
137+
"vote_counts.head(5)"
138+
]
139+
},
140+
{
141+
"cell_type": "markdown",
142+
"metadata": {},
143+
"source": [
144+
"Therefore, we can have many reports all concerned with the same country and we can also have reports that are concerned with more than one country.\n",
145+
"\n",
146+
"We match each report with all it's relevant geo codes. This will lead to reports getting duplicated it they relate to more than one geo code, but allows us to easily tally the codes afterwards:"
147+
]
148+
},
149+
{
150+
"cell_type": "code",
151+
"execution_count": null,
152+
"metadata": {
153+
"scrolled": true,
154+
"trusted": true
155+
},
156+
"outputs": [],
157+
"source": [
158+
"reports_with_areas = reports.merge(geo_areas_df, left_on=\"id\", right_on=\"vote_id\")\n",
159+
"reports_with_areas"
160+
]
161+
},
162+
{
163+
"cell_type": "markdown",
164+
"metadata": {},
165+
"source": [
166+
"Now, we could for example easily construct a list of the top 5 most mentioned countries:"
167+
]
168+
},
169+
{
170+
"cell_type": "code",
171+
"execution_count": null,
172+
"metadata": {
173+
"trusted": true
174+
},
175+
"outputs": [],
176+
"source": [
177+
"top_countries = reports_with_areas['geo_area_code'].value_counts().head(5).reset_index()\n",
178+
"top_countries"
179+
]
180+
},
181+
{
182+
"cell_type": "markdown",
183+
"metadata": {},
184+
"source": [
185+
"To make this list easier to understand, we can also retrieve the full names of the geo areas. These are stored in the `geo_areas.csv` and can easily be joined onto our table, which contains the `code` of the areas:"
186+
]
187+
},
188+
{
189+
"cell_type": "code",
190+
"execution_count": null,
191+
"metadata": {
192+
"trusted": true
193+
},
194+
"outputs": [],
195+
"source": [
196+
"geo_areas = pd.read_csv(\"data/geo_areas.csv\")\n",
197+
"geo_areas.head()"
198+
]
199+
},
200+
{
201+
"cell_type": "code",
202+
"execution_count": null,
203+
"metadata": {
204+
"trusted": true
205+
},
206+
"outputs": [],
207+
"source": [
208+
"top_countries.merge(geo_areas[[\"code\", \"label\"]], left_on=\"geo_area_code\", right_on=\"code\")[[\"label\", \"count\"]]"
209+
]
210+
},
211+
{
212+
"cell_type": "markdown",
213+
"metadata": {},
214+
"source": [
215+
"## Mentions of Countries per Month\n",
216+
"\n",
217+
"As for this tutorial, we are interested in the number of mentions of specific countries and territories over time, we need to decide on a time unit of analysis. Since Parliament usually meets monthly in Strasbourg, we decide to count the mentions per month.\n",
218+
"This requires us to convert the timestamp to a proper `datetime` object and afterwards extracting its month:"
219+
]
220+
},
221+
{
222+
"cell_type": "code",
223+
"execution_count": null,
224+
"metadata": {},
225+
"outputs": [],
226+
"source": [
227+
"reports_with_areas['timestamp'] = pd.to_datetime(reports_with_areas['timestamp'], format='ISO8601')\n",
228+
"reports_with_areas['month'] = reports_with_areas['timestamp'].dt.to_period('M')"
229+
]
230+
},
231+
{
232+
"cell_type": "markdown",
233+
"metadata": {},
234+
"source": [
235+
"As we are interested in the number of mentions per month **per geo area**, we will use `pivot_table` to effectively group by geo area, counting the relevant reports per month and storing these values in a separate colum for each geo area:"
236+
]
237+
},
238+
{
239+
"cell_type": "code",
240+
"execution_count": null,
241+
"metadata": {},
242+
"outputs": [],
243+
"source": [
244+
"monthly_reports = pd.pivot_table(\n",
245+
" reports_with_areas,\n",
246+
" index='month',\n",
247+
" columns='geo_area_code',\n",
248+
" values='timestamp',\n",
249+
" aggfunc='count',\n",
250+
" fill_value=0\n",
251+
").reset_index()\n",
252+
"monthly_reports.head(8)"
253+
]
254+
},
255+
{
256+
"cell_type": "markdown",
257+
"metadata": {},
258+
"source": [
259+
"Taking a look at the output above, we can spot that this approach creates gaps in our dataframe for months in which no geo coded votes took place. For example, the March to May 2020 are missing. In the following code will ensure that we have a row for each month (between the earliest and latest month in the data), filling the columns with 0 for months that we now artificially add for the sake of completeness:"
260+
]
261+
},
262+
{
263+
"cell_type": "code",
264+
"execution_count": null,
265+
"metadata": {},
266+
"outputs": [],
267+
"source": [
268+
"date_range = pd.date_range(start=reports_with_areas['month'].min().to_timestamp(), end=reports_with_areas['month'].max().to_timestamp(), freq='MS')\n",
269+
"all_months = pd.DataFrame({'month': pd.to_datetime(date_range).to_period('M')})\n",
270+
"\n",
271+
"monthly_reports = pd.merge(all_months, monthly_reports, on='month', how='left').fillna(0)\n",
272+
"monthly_reports = monthly_reports.reset_index(drop=True)\n",
273+
"monthly_reports.head(9)"
274+
]
275+
},
276+
{
277+
"cell_type": "markdown",
278+
"metadata": {},
279+
"source": [
280+
"## Analysis and Visualization\n",
281+
"\n",
282+
"With this dataframe, we are now ready to analyze patterns and trend over time. For this tutorial, we will focus on two countries: Russia and Ukraine, plotting how often Parliament voted on matters relating to these countries over time:"
283+
]
284+
},
285+
{
286+
"cell_type": "code",
287+
"execution_count": null,
288+
"metadata": {},
289+
"outputs": [],
290+
"source": [
291+
"# Subset the data frame\n",
292+
"ukr_rus = monthly_reports[[\"month\", \"UKR\", \"RUS\"]].copy()\n",
293+
"# convert month to date time again for nice plotting\n",
294+
"ukr_rus[\"month\"] = ukr_rus[\"month\"].dt.to_timestamp()"
295+
]
296+
},
297+
{
298+
"cell_type": "code",
299+
"execution_count": null,
300+
"metadata": {},
301+
"outputs": [],
302+
"source": [
303+
"plt.plot(ukr_rus['month'], ukr_rus['UKR'], label=\"Ukraine\", marker='.')\n",
304+
"plt.plot(ukr_rus['month'], ukr_rus['RUS'], label=\"Russia\", marker='.')\n",
305+
"\n",
306+
"plt.title('EP roll-call votes over time')\n",
307+
"plt.xlabel('Month')\n",
308+
"plt.ylabel('Counts')\n",
309+
"plt.xticks(rotation=45) \n",
310+
"plt.legend()\n",
311+
"\n",
312+
"plt.show()"
313+
]
314+
},
315+
{
316+
"cell_type": "markdown",
317+
"metadata": {},
318+
"source": [
319+
"This quick, eploratory visualization already shows some interesting trends that intuitively make sense: For both Russia and Ukraine, we can see a stark peak and also overall increase in related votes with Russias large scale invasion in Ukraine in February 2022. In general, we can see a lot of similar movements for both countries, which could indicate that many votes relate to both countries at once, for example [resolutions condemn this invasion itself](https://howtheyvote.eu/votes/165965). For most months, we also see more votes related to Ukraine than to Russia.\n",
320+
"\n",
321+
"## Further Analysis\n",
322+
"\n",
323+
"This is of course only a very rudimentary analysis, but provides us with an interesting starting point. Some further investigations could include:\n",
324+
"\n",
325+
"- relationships between other countries\n",
326+
"- analyzing mentions per continent\n",
327+
"- analyzing mentions per type of text voted on (i.e., legislative or non-legislative)\n",
328+
"\n",
329+
"## Wrapping Up\n",
330+
"\n",
331+
"In this tutorial we have seen how we can leverage the `geo_area_votes` table of the HowTheyVote data set to identify which texts that Parliament voted on relate to which countries. The tables can easily be combined by joining on the `id` of each vote."
332+
]
333+
}
334+
],
335+
"metadata": {
336+
"colab": {
337+
"provenance": [],
338+
"toc_visible": true
339+
},
340+
"kernelspec": {
341+
"display_name": "base",
342+
"language": "python",
343+
"name": "python3"
344+
},
345+
"language_info": {
346+
"codemirror_mode": {
347+
"name": "ipython",
348+
"version": 3
349+
},
350+
"file_extension": ".py",
351+
"mimetype": "text/x-python",
352+
"name": "python",
353+
"nbconvert_exporter": "python",
354+
"pygments_lexer": "ipython3",
355+
"version": "3.12.9"
356+
}
357+
},
358+
"nbformat": 4,
359+
"nbformat_minor": 4
360+
}

0 commit comments

Comments
 (0)