Webscraping Wikipedia Country Demographics

Web-scraping Wikipedia pages using Requests and Beautiful Soup and Pandas

You can better view the Jupyter notebook here on Jovian.

I am very interested in geography and learning statistics about other countries in the world. I frequently use Wikipedia to quickly get this information. In this project I will show how I used Python programming libraries (Beautiful Soup, Requests, Pandas) to gather demographic information about the many countries of the world by scraping Wikipedia web pages.

Wikipedia is a collaborative online publicly editable encyclopedia. It is the largest and most-read reference work in history, and consistently one of the 15 most popular websites visitedRef. Though anyone in theory can edit a wiki page, there are some light governances in effect that attempt to suggest how some pages are structured. One of the ways that this can be done is via WikiProjects, where a group of contriburors attempt to work together as a team to improve a topic, often by attempting to standardize it to some extent. "Demographics" is one such WikiProject. In this project, each country has a demographics page, and many of these pages have a table that would be displayed in the right hand colum of each country page. This table contains information on things like population breakdown by race, sex, religion, population density, birth and death rate, language, government type and so forth. It is this demographics summary table that this project attempts to find and scrape for each country of the world.

.

Project Steps:

Scrape the main demographics page to build a list of countries and their Wiki-links
- Using the python Requests and Beautiful Soup librarys we will scrape the Wikipedia page that lists the links of all county wiki pages
- Parsing the HTML data we will make a list of dictionaries, 1 dictionary for each country that contains the country name and the link to the individual page
- Using Pandas, we will read this data into a CSV file
Scrape individual country pages for their demographics
- Read the countries and links into a dataframe using Pandas
- Crawl the links in the CSV file and see how many country pages have the table I am trying to scrape
- Write code specific to the majority of demographics page tables to scrape the data I want
- Handle instances where the relevant tables do not exist in the demographics page
Clean the data scraped
- Look at the data in Pandas and see if there is some 1st order way to clean the data further
- Write the demographics info to a CSV file

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
country_demographics.csv		country_demographics.csv
country_list.csv		country_list.csv
webscraping-wiki-country-demographics.ipynb		webscraping-wiki-country-demographics.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Webscraping Wikipedia Country Demographics

Project Steps:

About

Uh oh!

Releases

Packages

Languages

srobertsphd/webscraping-wiki-country-demographics

Folders and files

Latest commit

History

Repository files navigation

Webscraping Wikipedia Country Demographics

Project Steps:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages