Skip to content

dariokueffer/Project-Network-Science-HS24

Repository files navigation

Network Science Project HS2024

This repository is part of the final project of the Network Science course at the University of Zurich, autumn term 2024. We acquired data from Wikipedia, created different graphs and performed different analyses on these graphs. A detailed description of what we did, why we did it and the results can be found in the final report. A very short summary can be found in the About section of this readme.

Table of Contents

About

We would like to look into the following questions:

  • How are Wikipedia contributors in the articles of a specific subcategory organized in terms of their co-authorships and contributions?
  • Are there contributors who act as central hubs in the community, playing a key role in coordinating or influencing content creation and editing?
  • Does the contributor network exhibit scale-free properties, where a few contributors dominate interactions while most others have fewer connections?

To do this, we built a graph where the nodes represent contributors and the edges occur when two contributors have contributed to the same article at least once. To organise it flexibly, we first crawl Wikipedia articles for certain main categories and store all the necessary information in a SQLite database, so that we can later freely construct different graphs (e.g. weighted, unweighted, ...). We crawled four different categories, created graphs for all of them and performed different analyses on these graphs. The following sections describe our approach from a technical perspective, while the final report and presentation describe it more from a research perspective.

Data

The table below sows the main categories crawled and some key metrics for each main category. The following graphs show the evolution of some of these metrics over time for each category.

Category Number of Subcategories Number of Pages Number of Contributors Number of Revisions
Amiga CD32 games 2 143 8013 29370
Game boy games 6 167 8953 30851
Machine learning 61 1548 75217 273260
Artificial intelligence 109 8695 610800 29368

The generated graphs and the database file can be found here in this Dropbox. The database file needs to be placed in src/acquisition/models/db and the graph files in outputs/graphs.

Requirements and Installation

To improve performance, we use graph-tool instead of networkx. In graph-tool the core algorithms and data structures are implemented in C++. To install graph-tool, follow the instructions below:

You can install graph-tool using Conda or directly with system packages and pip.

Using Conda

If you are using Conda, you can install graph-tool from the conda-forge channel:

conda install -c conda-forge graph-tool

Make sure that the conda-forge channel is enabled in your environment.

Plain Installation (System Packages + pip)

For a plain installation, you need to ensure that your system has the required dependencies before installing graph-tool.

On Debian/Ubuntu:

sudo apt-get update
sudo apt-get install -y build-essential libboost-all-dev libcgal-dev python3-pip python3-dev
pip install graph-tool

On macOS (using Homebrew):

brew install boost cgal
pip install graph-tool

For other systems, consult the official graph-tool documentation for required dependencies.

Install the remaining Python dependencies:

pip install -r requirements.txt

Usage

There are several Jupyter notebooks, which all import code that is implemented in seperate classes to keep things tidy.

Data Acquistion

The Jupyter notebook for acquiring the data and building the graphs is located at 01_acquisition.ipynb. Additional classes that are used and imported in this notebook are implemented in src/acquisition.

Data Analysis

The analysis of the graphs is conducted in different Jupyter notebook files, seperated by analysis type and/or by the graph analyzed. The two notebooks 02_analysis_articles_contributors.ipynb and 02_analysis_scale_free.ipynb contain a distribution analysis for all four networks. The notebooks 03_analysis_Amiga_CD32_games.ipynb and 03_analysis_Machine_learning.ipynb contain general graph analysis for the two networks, Amiga CD32 games and Machine learning. And finally, the notebooks 04_analysis_communities_Amiga_CD32_games.ipynb and 04_analysis_communities_Machine_learning.ipynb contain a community and centrality analysis for the two networks, Amiga CD32 games and Machine learning. All additional classes for the analysis notebooks is implemented in the directory src/analysis.

Features

We started to implement the code using networkx, but soon realised that it took a very long time to run certain functions with networkx. We looked for other solutions and found graph-tool, which is a powerful and high-performance library that implements its core algorithms and data structures in C++. This greatly increased the speed of certain functions and made it possible to perform analysis and large graphs.

Data Acquisition

For the data acquisition part, we implemented three different classes: the Database Manager, which is responsible for storing information in a structured way, the Wikipedia Category Crawler, which fetches data from the Wikipedia API, and the Contributor Graph Builder, which can build unweighted and weighted graphs from the data stored in the database. The three classes are described in more detail below.

Database Manager

In the file database_manager.py we have implemented a class that manages a SQLite database, providing the basic functionality for the database to work, plus additional methods needed to build the graph or perform certain analyses. The file models.py contains the models for the database entities. The following figure shows the relationships.

Database Relations

Wikipedia Category Crawler

The Wikipedia Category Crawler class implemented in wikipedia_crawler.py uses the mwclient library to retrieve data from Wikipedia. Categories are hierarchical: they can contain articles, or they can contain other categories, which can contain articles or categories. So the user can define a main (or root) category and a depth, which will then define the process of fetching the articles. First, we recursively fetch all subcategories for the defined root category and depth, then we fetch the pages of all these categories and store all relevant information in the database, revision, page (i.e. article) and contributor, according to the scheme above. At the end, an entry for the crawl is also added to the database, this is just metadata and more for information purposes.

Contributor Graph Builder

The Contributor Graph Builder class implemented in src/acquisition/graph_tool/contributor_graph_builder.py creates a graph-tool graph based on the crawled data. It builds either a weighted graph, where the weights represent the number of times two contributors have contributed together, or an unweighted graph, which simply adds an edge if two contributors have contributed to the same article once. When instantiating an object of this class, a boolean flag can be set in the constructor to indicate whether a weighted graph should be created or not. To build the graph, the build() method must be called, at the end it will store the built graph in outputs/graphs with the name given in the constructor.

Data Analysis

Basic Graph Analyzer

The Basic Graph Analyzer class implemented in src/analysis/basic_graph_analyzer.py provides some tools for basic graph analysis.The following features are currently implemented:

  • Plot Degree vs. Average Degree of neighbors
  • Log-log plot of probability density

Scale Free Analyzer

  • fit degree distribution to exponential distribution and to power-law distribution
  • for unweighted and weighted graphs
  • fit power-law distribution for values above a variable threshold
  • calculate exponent alpha for power-law fit
  • plot these distributions

Basic usage of ScaleFreeAnalyzer:

scale_free_analyzer = ScaleFreeAnalyzer(G)
scale_free_analyzer.plot_degree_distribution(main_category)

Only fit to degree values above 10^3:

scale_free_analyzer.set_filter(10**3)
scale_free_analyzer.plot_degree_distribution(main_category)

Use with weighted graphs:

scale_free_analyzer = ScaleFreeAnalyzer(G, is_weighted=True)
scale_free_analyzer.plot_degree_distribution(f'Weighted {main_category}')

Centrality Analyzer

The Centrality Analyzer class implemented in src/analysis/centrality_analyzer.py provides some tools for centrality analysis.The following features are currently implemented:

  • Calculate and plot centralities: degree, betweenness, closeness, eigenvector
  • Plot comparison of different centralities
  • Plot comparison of real to randomized network for all centralities

Plot the for different centralities: degree, betweenness, closeness and eigenvector:

centrality_analyzer = CentralityAnalyzer(G, network_name=main_category)
centrality_analyzer.plot_centralities()

Compare each centrality with every other in a scatter plot:

centrality_analyzer.plot_centralities_comparison()

Compare each centrality to the centrality of a randomized version of the graph:

centrality_analyzer.plot_centralities_comparison_randomized()

GraphCommunityAnalyzer

The GraphCommunityAnalyzer class implemented in src/analysis/graph_community_analyzer_graph_tool.py provides some tools for basic community analysis. The following features are currently implemented:

  • Community detection with greedy modularity maximization
  • Community detection with label propagation algorithm
  • Comparison to randomized networks

Acknowledgement

We would like to thank the Network Science HS24 team of Prof. Dr Claudio J. Tessone. Special thanks to Dr. Nicolò Vallarano and Dr. Taehoon Kim for their great support during the semester and for answering our questions regarding the final project.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors