Skip to content

Add detailed architecture diagram to README for Wikimedia Dumps pipeline #5

@alizahh-7

Description

@alizahh-7

The current README provides a textual explanation of the Wikimedia Dumps automation workflow, but it lacks a visual representation of how the different components interact with each other.

To improve onboarding for new contributors and make the system easier to understand, a detailed architecture diagram should be added to the README.

Proposed update:

  1. Add a Mermaid-based flowchart that illustrates the complete high-level pipeline, including:
  2. Wikimedia dumps HTML pages (dumps.wikimedia.org)
  3. Crawling and HTML parsing using wiki_dumps_crawler.py
  4. Detection of newly published and finished dumps
  5. Storage of discovered dump URLs in crawled_urls.txt
  6. Publishing workflow via wikimedia_publish.py
  7. Publication of RDF metadata to the Databus API
  8. Availability of metadata in the Databus knowledge graph for SPARQL queries

The diagram should clearly show:

  1. Data flow between components
  2. Decision points (e.g., presence of new dumps)
  3. The end-to-end automation from dump discovery to Databus publication

This visual overview will make the repository more accessible to first-time contributors and clarify how the crawler and publishing scripts work together within the DBpedia infrastructure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions