-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
The current README provides a textual explanation of the Wikimedia Dumps automation workflow, but it lacks a visual representation of how the different components interact with each other.
To improve onboarding for new contributors and make the system easier to understand, a detailed architecture diagram should be added to the README.
Proposed update:
- Add a Mermaid-based flowchart that illustrates the complete high-level pipeline, including:
- Wikimedia dumps HTML pages (dumps.wikimedia.org)
- Crawling and HTML parsing using wiki_dumps_crawler.py
- Detection of newly published and finished dumps
- Storage of discovered dump URLs in crawled_urls.txt
- Publishing workflow via wikimedia_publish.py
- Publication of RDF metadata to the Databus API
- Availability of metadata in the Databus knowledge graph for SPARQL queries
The diagram should clearly show:
- Data flow between components
- Decision points (e.g., presence of new dumps)
- The end-to-end automation from dump discovery to Databus publication
This visual overview will make the repository more accessible to first-time contributors and clarify how the crawler and publishing scripts work together within the DBpedia infrastructure.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels