Skip to content

kynoptic/wikipedia-reliable-sources

Repository files navigation

Wikipedia Goggles

This repository hosts two Brave Search Goggles that reranks search results to promote sources considered reliable by the Wikipedia community:

  1. wikipedia-reliable-sources-only.goggle – Boosts reliable sources and downranks contentious ones, showing no other results. Search using this Goggle.
  2. wikipedia-reliable-sources.goggle – Similar to the first, but it allows additional sources while discarding those deemed unreliable. Search using this Goggle.

Reliability sources

This project leverages reliability ratings from various Wikipedia sources to assess the trustworthiness of content. The ratings are based on the following Wikipedia pages:

Additionally, sources frequently used in featured articles (FA) and good articles (GA) are included.

Project structure

  • core/ – Python modules with data processing logic
  • scripts/ – Standalone command-line utilities
  • tests/ – Pytest suite
  • docs/ – Documentation and roadmap
  • data/ – Raw and processed datasets
  • outputs/ – Generated analysis results

How reliability affects rankings

The reliability ratings are adjusted using the following parameters:

  • $boost=2 – Applied to sources considered "Generally reliable" or "Reliable"
  • $downrank=2 – Used for sources labeled with "No consensus"
  • $discard – Assigned to sources determined as "Unreliable," "Blacklisted," or "Deprecated"

Extracting perennial sources

Run scripts/fetch_perennial_sources.py to download and parse the perennial sources list from Wikipedia. The script cleans and validates the data then writes perennial_sources.json and perennial_sources.csv containing structured records.

python scripts/fetch_perennial_sources.py

JSON output

Running the script prints the number of parsed entries (for example, Fetched 485 sources) and writes perennial_sources.json. The file is a JSON array where each object has the following fields:

Field Description
source_name Name of the publication or website.
reliability_status Two letter code from the WP:RSPSTATUS legend (e.g. gr = generally reliable, gu = generally unreliable, nc = no consensus, d = deprecated, m = marginal).
notes Summary of discussions about the source.

Example entry:

[
  {
    "source_name": "ABC News (US)",
    "reliability_status": "gr",
    "notes": "There is consensus that ABC News, the news division of the American Broadcasting Company, is generally reliable. It is not to be confused with other publications of the same name."
  }
]

CSV output

The script also writes a perennial_sources.csv file with the same fields for easy spreadsheet analysis.

Checking for updates

Run scripts/update_checker.py periodically. It compares the current revision IDs of the perennial sources subpages against revision_ids.json. If any page has changed since the last run, the script re-fetches the tables and updates perennial_sources.json and perennial_sources.csv.

python scripts/update_checker.py

Extracting WikiProject sources

Run scripts/fetch_wikiproject_sources.py to download the reliability tables maintained by several WikiProjects. The command outputs wikiproject_sources.json and wikiproject_sources.csv at the repository root.

python scripts/fetch_wikiproject_sources.py

Citation normalization pipeline

The project now includes a modular workflow for gathering citation data and normalizing source URLs.

  1. Fetch article lists

    python -m core.fetch_articles

    This writes good_articles.json and featured_articles.json under data/raw/.

  2. Download wikitext for each article

    python -m core.fetch_wikitext

    Wikitext files are stored in data/raw/wikitext/.

  3. Extract citation URLs

    python -m core.extract_refs

    The extracted references are written to data/processed/refs_extracted.json.

  4. Normalize and rank sources

    python -m core.clean_sources

The script applies domain aliases from data/alias_map.json, writes canonical counts to data/processed/sources_canonical.csv, and outputs the top sources to outputs/top_sources.csv.

Updating domain aliases

Add new mappings in data/alias_map.json to normalize additional domains. Each entry maps a short hostname to its canonical form. These aliases are loaded by core/clean_sources, so updates affect how sources are deduplicated.

Running tests

Install the dependencies listed in requirements.txt and execute the test suite with pytest:

pip install -r requirements.txt
pytest

Contributors should run the tests before committing changes to ensure nothing breaks.

Programmatic API usage

Import functions from the core package to integrate the pipeline into your own scripts. See docs/api_usage.md for complete examples.

About

Two Brave Search Goggles designed to rerank search results based on the reliability of sources as determined by the Wikipedia community.

Topics

Resources

License

Stars

Watchers

Forks

Contributors