Knowledge Ingestion Service (Wikipedia)

A professional-grade data ingestion pipeline designed to fetch, extract, and persist structured knowledge from Wikipedia at scale.

🚀 Key Features

Rate Limiting: Intelligent delays between requests to respect robots.txt and prevent IP bans.
Job Queue: Managed processing of target URLs using an asynchronous-ready architecture.
Deduplication: MongoDB unique index enforcement to ensure data integrity and zero redundancy.
Structured Storage: Extraction of titles, summaries, and complex tables into a document-oriented database.
Resilience: Robust error handling and logging for production stability.

🏛️ Architecture

The service is built with modularity in mind:

pipeline.py: Orchestrates the ingestion flow.
ingestion.py: Contains the logic for BeautifulSoup parsing and cleaning.
storage.py: Manages the connection and operations with MongoDB.
config.py: Centralized configuration for easy deployment.

🛠️ Tech Stack

Language: Python 3.x
Parsing: BeautifulSoup4, Requests
Database: MongoDB (via Pymongo)
Logging: Python Standard Logging

🏁 Getting Started

Prerequisites

MongoDB: Ensure you have MongoDB installed and running locally (mongodb://localhost:27017).
Python: Version 3.8 or higher.

Installation

Clone the repository and navigate to the project directory.

Install the required dependencies:

pip install pymongo requests beautifulsoup4

🚀 Usage

Run with defaults (defined in `config.py`)

python pipeline.py

Run with specific URLs

You can provide one or more URLs directly via the command line. These will be processed first, followed by the default list.

python pipeline.py https://en.wikipedia.org/wiki/SpaceX https://en.wikipedia.org/wiki/NASA

The service will fetch the pages, apply rate limiting, and store the structured data in MongoDB while automatically handling duplicates.

📚 Why is this better than a simple scraper?

Check out challenges.md for a deep dive into the complexities of scraping at scale and how this service addresses them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledge Ingestion Service (Wikipedia)

🚀 Key Features

🏛️ Architecture

🛠️ Tech Stack

🏁 Getting Started

Prerequisites

Installation

🚀 Usage

Run with defaults (defined in `config.py`)

Run with specific URLs

📚 Why is this better than a simple scraper?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
README.md		README.md
challenges.md		challenges.md
config.py		config.py
ingestion.py		ingestion.py
pipeline.py		pipeline.py
storage.py		storage.py

Folders and files

Latest commit

History

Repository files navigation

Knowledge Ingestion Service (Wikipedia)

🚀 Key Features

🏛️ Architecture

🛠️ Tech Stack

🏁 Getting Started

Prerequisites

Installation

🚀 Usage

Run with defaults (defined in config.py)

Run with specific URLs

📚 Why is this better than a simple scraper?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Run with defaults (defined in `config.py`)

Packages