Skip to content

ulsdevteam/eprints-hyku-reconciliation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Eprints to Hyku data migration validation

About

Given a migration from Eprints 3.3 to Hyku 6.2 (Hyrax 5.0), we'll use OAI-PMH to compare the source system with the destination system.

Assumptions

  • The Eprints URI was ingested into Hyku as the identifier
  • The Eprints oai_dc identifiers, when they look like URLs, are Eprints files
  • The Hyku oai_hyku file_urls are Hyku files.
  • Only public records are considered.
  • Tombstones in Eprints are ignored by skipping when there is no metadata on the OAI-PMH record.
  • Collections and non-Eprints works in Hyku are ignored by skipping when there is no identifier.
  • Collection memberships are not (yet?) considered

LLM baseline

Prompt

Create a python script, using Sickle for OAI-PMH access which compares the output from two sources: Eprints and Hyku. The script will take the base URL for each as a parameter. For Eprints, the OAI identifier is in the form of three colon-separated values, where the last value is a numeric id. For Eprints, for each identifier which is a URI in the oai_dc metadata, extract the final URI component after the last slash. These final components represent the Eprints files. Keep track of the numeric id from the Eprints OAI identifier and the count of the Eprints files, and a list of the Eprints files. For Hyku, the OAI identifier is in the form of three colon-separated values, where the last value is a UUID. For Hyku, the identifier element in the oai_hyku metadata is a URI where the last URI component should match the numeric id from an Eprints OAI identifier. If the oai_hyku metadata does not have an identifier element, skip this Hyku record. For Hyku records with the element, the files are represented by a file_url element in the oai_hyku metadata. Keep track of the Hyku UUID, and count of Hyku files. Report when there are Eprints records without a match in Hyku, or Eprints record where the count of files in Eprints does not match the count of files in Hyku. For these failed matches, report the Eprints numeric id, the list of Eprints files, and Hyku UUID and the count of Hyku files (if applicable).

Response

I'll create a Python script that compares OAI-PMH records from Eprints and Hyku sources using the Sickle library.

Key Features:

  • Eprints Processing: Extracts numeric IDs from OAI identifiers and collects file URIs from identifier elements in oai_dc metadata
  • Hyku Processing: Extracts UUIDs and matches records based on the numeric ID found in the identifier element, counts file_url elements
  • Comparison: Reports two types of issues:
    • Records present in Eprints but missing in Hyku
    • Records where file counts don't match

Usage:

pip install sickle

python script.py \
  --eprints-url "https://eprints.example.edu/cgi/oai2" \
  --hyku-url "https://hyku.example.edu/catalog/oai"

The script provides detailed reporting including:

  • List of all discrepancies with Eprints ID, file lists, and counts
  • Hyku UUID and file count for mismatches
  • Summary statistics

License

Initial commit is public domain under U.S. law. Subsequent edits are Copyright (c) Univerisity of Pittsburgh and licenses under Zero-Clause BSD.

About

Tools to reconcile the migration of content from Eprints to Hyku

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages