Conversation
… reconciliation and RDF conversion.
… and ingestion process
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a complete overhaul of the SIMSSA DB ingestion pipeline, replacing legacy SQL-based scripts with a cleaner, more maintainable Python-based approach using CSV exports and pandas operations.
Key changes:
- New export and merge scripts using pandas for data processing instead of complex SQL queries
- OpenRefine configuration files for entity reconciliation workflows
- RDF conversion configuration to integrate with the shared conversion pipeline
- Comprehensive documentation updates explaining the database structure and ingestion process
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| simssa/src/export_all_tables.py | Exports all PostgreSQL tables to categorized CSV files |
| simssa/src/merge.py | Merges and processes raw CSVs into consolidated entity files |
| simssa/src/flattening/SQL_query.py | Removed legacy SQL-based flattening script |
| simssa/src/flattening/restructure.py | Removed legacy pandas restructuring script |
| simssa/openrefine/history/history_work.json | Reconciliation workflow for musical works |
| simssa/openrefine/history/history_person.json | Reconciliation workflow for persons |
| simssa/openrefine/export/export_work.json | Export configuration for reconciled work data |
| simssa/openrefine/export/export_person.json | Export configuration for reconciled person data |
| shared/rdf_config/simssadb.toml | RDF conversion configuration for SIMSSA DB entities |
| simssa/README.md | Comprehensive ingestion documentation with database schema overview |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) | ||
| # Clean work_title by removing brackets and quotes | ||
| work_df["work_title"] = work_df["work_title"].str.replace( | ||
| r"[\[\]'\"']", "", regex=True |
There was a problem hiding this comment.
The regex pattern contains a redundant character class with both a double quote and a curly quote. The pattern [\[\]'\"'] includes both \" (escaped double quote) and ' (curly right single quotation mark U+2019), which appears unintentional. If both straight and curly quotes should be removed, this should be documented. If only straight quotes are intended, remove the curly quote character.
| r"[\[\]'\"']", "", regex=True | |
| r"[\[\]'\"\"]", "", regex=True |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…alake into simssadb-ingestion
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR introduces the scripts, configuration files, and documentation necessary to ingest SIMSSA DB data.
Scripts (
simssa/src/)export_all_tables.py
merge.py
Deleted scripts
Both deleted scripts are fully replaced by the new scripts, which are more concise and easier to maintain.
OpenRefine Configuration Files (
simssa/openrefine/)Export Settings
export/export_person.jsonexport/export_work.jsonConfigure proper export formats for
person.csvandwork.csv(the only files needing reconciliation).History / Reconciliation Procedures
history/history_person.jsonhistory/history_work.jsonAllow users to automatically reapply the same reconciliation steps
RDF Conversion Config
shared/rdf_config/simssadb.toml
shared/rdfconv/convert.py).Documentation
simssa/README.md