This repository hosts two Brave Search Goggles that reranks search results to promote sources considered reliable by the Wikipedia community:
wikipedia-reliable-sources-only.goggle– Boosts reliable sources and downranks contentious ones, showing no other results. Search using this Goggle.wikipedia-reliable-sources.goggle– Similar to the first, but it allows additional sources while discarding those deemed unreliable. Search using this Goggle.
This project leverages reliability ratings from various Wikipedia sources to assess the trustworthiness of content. The ratings are based on the following Wikipedia pages:
- Reliable sources/Perennial sources
- WikiProject Video games/Sources
- WikiProject Film/Resources
- WikiProject Albums/Sources
- WikiProject Christian music/Sources
- WikiProject Professional wrestling/Sources
- WikiProject Korea/Reliable sources
Additionally, sources frequently used in featured articles (FA) and good articles (GA) are included.
core/– Python modules with data processing logicscripts/– Standalone command-line utilitiestests/– Pytest suitedocs/– Documentation and roadmapdata/– Raw and processed datasetsoutputs/– Generated analysis results
The reliability ratings are adjusted using the following parameters:
$boost=2– Applied to sources considered "Generally reliable" or "Reliable"$downrank=2– Used for sources labeled with "No consensus"$discard– Assigned to sources determined as "Unreliable," "Blacklisted," or "Deprecated"
Run scripts/fetch_perennial_sources.py to download and parse the perennial
sources list from Wikipedia. The script cleans and validates the data then
writes perennial_sources.json and perennial_sources.csv containing
structured records.
python scripts/fetch_perennial_sources.pyRunning the script prints the number of parsed entries (for example,
Fetched 485 sources) and writes perennial_sources.json. The file is a JSON
array where each object has the following fields:
| Field | Description |
|---|---|
source_name |
Name of the publication or website. |
reliability_status |
Two letter code from the WP:RSPSTATUS legend (e.g. gr = generally reliable, gu = generally unreliable, nc = no consensus, d = deprecated, m = marginal). |
notes |
Summary of discussions about the source. |
Example entry:
[
{
"source_name": "ABC News (US)",
"reliability_status": "gr",
"notes": "There is consensus that ABC News, the news division of the American Broadcasting Company, is generally reliable. It is not to be confused with other publications of the same name."
}
]The script also writes a perennial_sources.csv file with the same fields for
easy spreadsheet analysis.
Run scripts/update_checker.py periodically. It compares the current revision
IDs of the perennial sources subpages against revision_ids.json. If any page
has changed since the last run, the script re-fetches the tables and updates
perennial_sources.json and perennial_sources.csv.
python scripts/update_checker.pyRun scripts/fetch_wikiproject_sources.py to download the reliability tables
maintained by several WikiProjects. The command outputs
wikiproject_sources.json and wikiproject_sources.csv at the repository
root.
python scripts/fetch_wikiproject_sources.pyThe project now includes a modular workflow for gathering citation data and normalizing source URLs.
-
Fetch article lists
python -m core.fetch_articles
This writes
good_articles.jsonandfeatured_articles.jsonunderdata/raw/. -
Download wikitext for each article
python -m core.fetch_wikitext
Wikitext files are stored in
data/raw/wikitext/. -
Extract citation URLs
python -m core.extract_refs
The extracted references are written to
data/processed/refs_extracted.json. -
Normalize and rank sources
python -m core.clean_sources
The script applies domain aliases from data/alias_map.json, writes canonical counts to data/processed/sources_canonical.csv, and outputs the top sources to outputs/top_sources.csv.
Add new mappings in data/alias_map.json to normalize additional domains.
Each entry maps a short hostname to its canonical form. These aliases are loaded
by core/clean_sources, so updates affect how sources are deduplicated.
Install the dependencies listed in requirements.txt and execute the test suite with pytest:
pip install -r requirements.txt
pytestContributors should run the tests before committing changes to ensure nothing breaks.
Import functions from the core package to integrate the pipeline into your own
scripts. See docs/api_usage.md for complete examples.