Skip to content

Scalable Knowledge Ingestion Pipeline for wiki-style content | Python • MongoDB • Adaptive Rate Limiting

Notifications You must be signed in to change notification settings

brooktewabe/Knowledge-Ingestion-Tool-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Knowledge Ingestion Service (Wikipedia)

A professional-grade data ingestion pipeline designed to fetch, extract, and persist structured knowledge from Wikipedia at scale.

🚀 Key Features

  • Rate Limiting: Intelligent delays between requests to respect robots.txt and prevent IP bans.
  • Job Queue: Managed processing of target URLs using an asynchronous-ready architecture.
  • Deduplication: MongoDB unique index enforcement to ensure data integrity and zero redundancy.
  • Structured Storage: Extraction of titles, summaries, and complex tables into a document-oriented database.
  • Resilience: Robust error handling and logging for production stability.

🏛️ Architecture

The service is built with modularity in mind:

  • pipeline.py: Orchestrates the ingestion flow.
  • ingestion.py: Contains the logic for BeautifulSoup parsing and cleaning.
  • storage.py: Manages the connection and operations with MongoDB.
  • config.py: Centralized configuration for easy deployment.

🛠️ Tech Stack

  • Language: Python 3.x
  • Parsing: BeautifulSoup4, Requests
  • Database: MongoDB (via Pymongo)
  • Logging: Python Standard Logging

🏁 Getting Started

Prerequisites

  • MongoDB: Ensure you have MongoDB installed and running locally (mongodb://localhost:27017).
  • Python: Version 3.8 or higher.

Installation

  1. Clone the repository and navigate to the project directory.
  2. Install the required dependencies:
    pip install pymongo requests beautifulsoup4

🚀 Usage

Run with defaults (defined in config.py)

python pipeline.py

Run with specific URLs

You can provide one or more URLs directly via the command line. These will be processed first, followed by the default list.

python pipeline.py https://en.wikipedia.org/wiki/SpaceX https://en.wikipedia.org/wiki/NASA

The service will fetch the pages, apply rate limiting, and store the structured data in MongoDB while automatically handling duplicates.

📚 Why is this better than a simple scraper?

Check out challenges.md for a deep dive into the complexities of scraping at scale and how this service addresses them.

About

Scalable Knowledge Ingestion Pipeline for wiki-style content | Python • MongoDB • Adaptive Rate Limiting

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages