A professional-grade data ingestion pipeline designed to fetch, extract, and persist structured knowledge from Wikipedia at scale.
- Rate Limiting: Intelligent delays between requests to respect robots.txt and prevent IP bans.
- Job Queue: Managed processing of target URLs using an asynchronous-ready architecture.
- Deduplication: MongoDB unique index enforcement to ensure data integrity and zero redundancy.
- Structured Storage: Extraction of titles, summaries, and complex tables into a document-oriented database.
- Resilience: Robust error handling and logging for production stability.
The service is built with modularity in mind:
pipeline.py: Orchestrates the ingestion flow.ingestion.py: Contains the logic for BeautifulSoup parsing and cleaning.storage.py: Manages the connection and operations with MongoDB.config.py: Centralized configuration for easy deployment.
- Language: Python 3.x
- Parsing: BeautifulSoup4, Requests
- Database: MongoDB (via Pymongo)
- Logging: Python Standard Logging
- MongoDB: Ensure you have MongoDB installed and running locally (
mongodb://localhost:27017). - Python: Version 3.8 or higher.
- Clone the repository and navigate to the project directory.
- Install the required dependencies:
pip install pymongo requests beautifulsoup4
python pipeline.pyYou can provide one or more URLs directly via the command line. These will be processed first, followed by the default list.
python pipeline.py https://en.wikipedia.org/wiki/SpaceX https://en.wikipedia.org/wiki/NASAThe service will fetch the pages, apply rate limiting, and store the structured data in MongoDB while automatically handling duplicates.
Check out challenges.md for a deep dive into the complexities of scraping at scale and how this service addresses them.