Skip to content

Conversation

Abdul-Omira
Copy link

Title: Add wikipedia-2023-redirects dataset (redirect resolution + pageviews)

Summary

  • New dataset loader: wikipedia_2023_redirects
  • Canonical Wikipedia pages enriched with:
    • redirects (aliases pointing to the page)
    • 2023 pageviews (aggregated)
  • Streaming support; robust parsing; license notes included
  • Tests with tiny dummy data (XML + TSVs); covers streaming

Motivation
RAG/retrieval often benefits from:

  • Query expansion via redirect aliases
  • Popularity prior via pageviews
    This loader offers a practical, maintenance-light way to access canonical pages alongside their redirect aliases and 2023 pageview totals.

Features

  • id: string
  • title: string
  • url: string
  • text: string
  • redirects: list[string]
  • pageviews_2023: int32
  • timestamp: string

Licensing

  • Wikipedia text: CC BY-SA 3.0 (attribution and share-alike apply)
  • Pageviews: public domain
    The PR docs mention both, and the module docstring cites sources.

Notes

Testing

  • make style && make quality
  • pytest -q tests/test_dataset_wikipedia_2023_redirects.py

Example

from datasets import load_dataset
ds = load_dataset("wikipedia_2023_redirects", split="train")
print(ds[0]["title"], ds[0]["redirects"][:5], ds[0]["pageviews_2023"])

Acknowledgements

  • Wikipedia/Wikimedia Foundation for the source data
  • Hugging Face Datasets for the dataset infrastructure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant