Given a migration from Eprints 3.3 to Hyku 6.2 (Hyrax 5.0), we'll use OAI-PMH to compare the source system with the destination system.
- The Eprints URI was ingested into Hyku as the identifier
- The Eprints oai_dc identifiers, when they look like URLs, are Eprints files
- The Hyku oai_hyku file_urls are Hyku files.
- Only public records are considered.
- Tombstones in Eprints are ignored by skipping when there is no metadata on the OAI-PMH record.
- Collections and non-Eprints works in Hyku are ignored by skipping when there is no identifier.
- Collection memberships are not (yet?) considered
Create a python script, using Sickle for OAI-PMH access which compares the output from two sources: Eprints and Hyku. The script will take the base URL for each as a parameter. For Eprints, the OAI identifier is in the form of three colon-separated values, where the last value is a numeric id. For Eprints, for each identifier which is a URI in the oai_dc metadata, extract the final URI component after the last slash. These final components represent the Eprints files. Keep track of the numeric id from the Eprints OAI identifier and the count of the Eprints files, and a list of the Eprints files. For Hyku, the OAI identifier is in the form of three colon-separated values, where the last value is a UUID. For Hyku, the identifier element in the oai_hyku metadata is a URI where the last URI component should match the numeric id from an Eprints OAI identifier. If the oai_hyku metadata does not have an identifier element, skip this Hyku record. For Hyku records with the element, the files are represented by a file_url element in the oai_hyku metadata. Keep track of the Hyku UUID, and count of Hyku files. Report when there are Eprints records without a match in Hyku, or Eprints record where the count of files in Eprints does not match the count of files in Hyku. For these failed matches, report the Eprints numeric id, the list of Eprints files, and Hyku UUID and the count of Hyku files (if applicable).
I'll create a Python script that compares OAI-PMH records from Eprints and Hyku sources using the Sickle library.
Key Features:
- Eprints Processing: Extracts numeric IDs from OAI identifiers and collects file URIs from identifier elements in oai_dc metadata
- Hyku Processing: Extracts UUIDs and matches records based on the numeric ID found in the identifier element, counts file_url elements
- Comparison: Reports two types of issues:
- Records present in Eprints but missing in Hyku
- Records where file counts don't match
Usage:
pip install sickle
python script.py \
--eprints-url "https://eprints.example.edu/cgi/oai2" \
--hyku-url "https://hyku.example.edu/catalog/oai"
The script provides detailed reporting including:
- List of all discrepancies with Eprints ID, file lists, and counts
- Hyku UUID and file count for mismatches
- Summary statistics
Initial commit is public domain under U.S. law. Subsequent edits are Copyright (c) Univerisity of Pittsburgh and licenses under Zero-Clause BSD.