Skip to content

SIMSSA DB Ingestion#455

Open
SCN-MNG wants to merge 20 commits intomainfrom
simssadb-ingestion
Open

SIMSSA DB Ingestion#455
SCN-MNG wants to merge 20 commits intomainfrom
simssadb-ingestion

Conversation

@SCN-MNG
Copy link
Contributor

@SCN-MNG SCN-MNG commented Nov 14, 2025

This PR introduces the scripts, configuration files, and documentation necessary to ingest SIMSSA DB data.

Scripts (simssa/src/)

export_all_tables.py

  • Exports all SIMSSA DB tables from PostgreSQL dump to CSV.
  • Classify CSVs into appropriate subdirectories to facilitate downstream processing

merge.py

  • Merges and processes exported CSVs.

Deleted scripts

  • flattening/SQL_query.py:
  • flattening/restructure.py:

Both deleted scripts are fully replaced by the new scripts, which are more concise and easier to maintain.


OpenRefine Configuration Files (simssa/openrefine/)

Export Settings

  • export/export_person.json
  • export/export_work.json

Configure proper export formats for person.csv and work.csv (the only files needing reconciliation).

History / Reconciliation Procedures

  • history/history_person.json
  • history/history_work.json

Allow users to automatically reapply the same reconciliation steps


RDF Conversion Config

shared/rdf_config/simssadb.toml

  • Enables conversion of SIMSSA DB entities into RDF using the shared conversion pipeline (shared/rdfconv/convert.py).

Documentation

simssa/README.md

  • Add documentation on PostgreSQL dump import
  • Add documentation on SIMSSADB data schema and entity types
  • Remove sections on outdated ingestion procedures.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a complete overhaul of the SIMSSA DB ingestion pipeline, replacing legacy SQL-based scripts with a cleaner, more maintainable Python-based approach using CSV exports and pandas operations.

Key changes:

  • New export and merge scripts using pandas for data processing instead of complex SQL queries
  • OpenRefine configuration files for entity reconciliation workflows
  • RDF conversion configuration to integrate with the shared conversion pipeline
  • Comprehensive documentation updates explaining the database structure and ingestion process

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
simssa/src/export_all_tables.py Exports all PostgreSQL tables to categorized CSV files
simssa/src/merge.py Merges and processes raw CSVs into consolidated entity files
simssa/src/flattening/SQL_query.py Removed legacy SQL-based flattening script
simssa/src/flattening/restructure.py Removed legacy pandas restructuring script
simssa/openrefine/history/history_work.json Reconciliation workflow for musical works
simssa/openrefine/history/history_person.json Reconciliation workflow for persons
simssa/openrefine/export/export_work.json Export configuration for reconciled work data
simssa/openrefine/export/export_person.json Export configuration for reconciled person data
shared/rdf_config/simssadb.toml RDF conversion configuration for SIMSSA DB entities
simssa/README.md Comprehensive ingestion documentation with database schema overview

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

)
# Clean work_title by removing brackets and quotes
work_df["work_title"] = work_df["work_title"].str.replace(
r"[\[\]'\"']", "", regex=True
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern contains a redundant character class with both a double quote and a curly quote. The pattern [\[\]'\"'] includes both \" (escaped double quote) and ' (curly right single quotation mark U+2019), which appears unintentional. If both straight and curly quotes should be removed, this should be documented. If only straight quotes are intended, remove the curly quote character.

Suggested change
r"[\[\]'\"']", "", regex=True
r"[\[\]'\"\"]", "", regex=True

Copilot uses AI. Check for mistakes.
SCN-MNG and others added 5 commits November 14, 2025 09:21
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant