Skip to content

bin/fixity_check_replicated_moabs#2530

Draft
jmartin-sul wants to merge 2 commits intomainfrom
bin-fixity-check-replicated-moabs
Draft

bin/fixity_check_replicated_moabs#2530
jmartin-sul wants to merge 2 commits intomainfrom
bin-fixity-check-replicated-moabs

Conversation

@jmartin-sul
Copy link
Member

@jmartin-sul jmartin-sul commented Jan 14, 2026

TODO to get this out of draft:

  • there's a deprecation warning about the current download approach and the need to switch to transfer manager. see here for an example of that switch: Switch to using Aws transfer manager. #2526
  • extract the meat of the functionality into a service class
  • write unit tests for said service class

Why was this change made? 🤔

This is the start of a script for pulling down and fixity checking content we've archived to cloud buckets. It's a shaping up from what was essentially ruby console notes to myself for this sort of check, into something that is runnable as a script by anyone. The machinery here could also form a starting point for #1076.

It has been useful for spot checking GCP replication on prod (as we did with its predecessor against stage and QA). See #2426 and #2518.

Usage

Usage: bin/fixity_check_replicated_moabs.rb [options]
        --druid_list DRUID_LIST      comma-separated (no spaces) list of bare druids (no prefixes)
        --druid_list_file DRUID_LIST_FILE
                                     file with a list of provided druids, e.g. from integration tests, manual tests, your own queries, etc
        --fixity_check_base_location FIXITY_CHECK_BASE_LOCATION
                                     target directory for downloading cloud archived Moabs, where they will be inflated and fixity checked.  ensure sufficient free space.
        --single_part_druid_sample_count SINGLE_PART_DRUID_SAMPLE_COUNT
                                     number of < 10 GB Moabs to query for and retrieve (default: 0)
        --multipart_druid_sample_count MULTIPART_DRUID_SAMPLE_COUNT
                                     number of > 10 GB Moabs to query for and retrieve (default: 0)
        --endpoints_to_audit ENDPOINTS_TO_AUDIT
                                     list of cloud endpoints to audit (comma-separated, no spaces, names from config)
        --[no-]force_part_md5_comparison
                                     Even if the zip parts are not downloaded on this run, compare the previously downloaded MD5 results to what is in the DB
        --[no-]dry_run               Simulate download and fixity check for druid list (defaults to false)
        --[no-]quiet                 Do not output progress information (defaults to false)
    -h, --help                       Displays help.

How was this change tested? 🤨

it was used to fixity check the test objects listed in #2518. when an earlier pass at the script failed to properly re-assemble multi-part zips it had downloaded, the missing moab manifest files were detected (they were indeed missing from that first zip extraction attempt; but were present in the multi-part zip, retrieved when the script was updated to re-assemble the zip successfully, and led to the moabs in question passing fixity check once the zip re-assembly was fixed). this was a convenient natural experiment with a corrupted zip extraction that was resolved by fixing the extraction approach.

@jmartin-sul jmartin-sul force-pushed the bin-fixity-check-replicated-moabs branch 4 times, most recently from 011371a to 832b1de Compare January 15, 2026 06:12
Base automatically changed from refactor_replication to main January 16, 2026 02:35
@jmartin-sul jmartin-sul force-pushed the bin-fixity-check-replicated-moabs branch 4 times, most recently from e12aeb7 to 98ef6ad Compare January 21, 2026 19:49
@aaron-collier aaron-collier force-pushed the bin-fixity-check-replicated-moabs branch from 98ef6ad to a143291 Compare January 27, 2026 21:46
@aaron-collier aaron-collier force-pushed the bin-fixity-check-replicated-moabs branch from b18bda6 to 5c591f1 Compare January 29, 2026 19:19
@jmartin-sul
Copy link
Member Author

note that there is more indirection to go through if you want to retrieve from AWS endpoints, see #1076 (comment):

not all content is directly retrievable: while GCP lets you just retrieve from the bucket using a simple synchronous request, AWS endpoint policy is such that the content is shuffled off to cold storage, a request for its retrieval must be made, and then something must wait for that request to be fulfilled.

these docs may be out of date in light of #2526, but these are the related tabs i had open from the bit of digging i did on this last year:

@aaron-collier aaron-collier force-pushed the bin-fixity-check-replicated-moabs branch from 5c591f1 to 04940bc Compare February 2, 2026 20:48
logger.info("#{download_path.dirname}/#{combined_filename} exists, skipping combining")
else
# https://unix.stackexchange.com/questions/40480/how-to-unzip-a-multipart-spanned-zip-on-linux
logger.info(Open3.capture2e("zip -s 0 #{download_path.basename} --out #{combined_filename}", chdir: download_path.dirname))
Copy link
Contributor

@aaron-collier aaron-collier Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmartin-sul I'm still working on verifying the files, but at a minimum, https://github.com/sul-dlss/preservation_catalog/wiki/Extracting-segmented-zipfiles#the-hard-way indicates that this method of zip -s 0 does NOT work and to individually cat the files in order. I think the fact that unzipped parts are incomplete confirms this.

So I think this needs to be updated to try to best identify the parts and individually cat them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, for what it's worth, the combined seems too small here as well:

rw-r--r-- 1 pres pres  28G Jan 28 21:02 zy152mz9183.v0002.combined.zip
-rw-r--r-- 1 pres pres  10G Jan 28 17:51 zy152mz9183.v0002.z01
-rw-r--r-- 1 pres pres  10G Jan 28 17:55 zy152mz9183.v0002.z02
-rw-r--r-- 1 pres pres  10G Jan 28 17:59 zy152mz9183.v0002.z03
-rw-r--r-- 1 pres pres  10G Jan 28 18:03 zy152mz9183.v0002.z04
-rw-r--r-- 1 pres pres  10G Jan 28 18:07 zy152mz9183.v0002.z05
-rw-r--r-- 1 pres pres  10G Jan 28 18:11 zy152mz9183.v0002.z06
-rw-r--r-- 1 pres pres 3.0G Jan 28 18:13 zy152mz9183.v0002.zip

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cat-ing the files seems more accurate, maybe:

-rw-r--r-- 1 pres pres  28G Jan 28 21:02 zy152mz9183.v0002.combined.zip
-rw-r--r-- 1 pres pres  63G Feb  2 13:29 zy152mz9183.v0002.fixed.zip

@aaron-collier aaron-collier force-pushed the bin-fixity-check-replicated-moabs branch from 04940bc to f5185c9 Compare February 2, 2026 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants