Conversation
9b6163e to
469ca58
Compare
011371a to
832b1de
Compare
e12aeb7 to
98ef6ad
Compare
98ef6ad to
a143291
Compare
b18bda6 to
5c591f1
Compare
|
note that there is more indirection to go through if you want to retrieve from AWS endpoints, see #1076 (comment):
these docs may be out of date in light of #2526, but these are the related tabs i had open from the bit of digging i did on this last year: |
5c591f1 to
04940bc
Compare
| logger.info("#{download_path.dirname}/#{combined_filename} exists, skipping combining") | ||
| else | ||
| # https://unix.stackexchange.com/questions/40480/how-to-unzip-a-multipart-spanned-zip-on-linux | ||
| logger.info(Open3.capture2e("zip -s 0 #{download_path.basename} --out #{combined_filename}", chdir: download_path.dirname)) |
There was a problem hiding this comment.
@jmartin-sul I'm still working on verifying the files, but at a minimum, https://github.com/sul-dlss/preservation_catalog/wiki/Extracting-segmented-zipfiles#the-hard-way indicates that this method of zip -s 0 does NOT work and to individually cat the files in order. I think the fact that unzipped parts are incomplete confirms this.
So I think this needs to be updated to try to best identify the parts and individually cat them.
There was a problem hiding this comment.
And, for what it's worth, the combined seems too small here as well:
rw-r--r-- 1 pres pres 28G Jan 28 21:02 zy152mz9183.v0002.combined.zip
-rw-r--r-- 1 pres pres 10G Jan 28 17:51 zy152mz9183.v0002.z01
-rw-r--r-- 1 pres pres 10G Jan 28 17:55 zy152mz9183.v0002.z02
-rw-r--r-- 1 pres pres 10G Jan 28 17:59 zy152mz9183.v0002.z03
-rw-r--r-- 1 pres pres 10G Jan 28 18:03 zy152mz9183.v0002.z04
-rw-r--r-- 1 pres pres 10G Jan 28 18:07 zy152mz9183.v0002.z05
-rw-r--r-- 1 pres pres 10G Jan 28 18:11 zy152mz9183.v0002.z06
-rw-r--r-- 1 pres pres 3.0G Jan 28 18:13 zy152mz9183.v0002.zip
There was a problem hiding this comment.
cat-ing the files seems more accurate, maybe:
-rw-r--r-- 1 pres pres 28G Jan 28 21:02 zy152mz9183.v0002.combined.zip
-rw-r--r-- 1 pres pres 63G Feb 2 13:29 zy152mz9183.v0002.fixed.zip
…ve zips from preservation cloud buckets
04940bc to
f5185c9
Compare
TODO to get this out of draft:
EventHelpers.confirm_archive_zip_replication_events?Why was this change made? 🤔
This is the start of a script for pulling down and fixity checking content we've archived to cloud buckets. It's a shaping up from what was essentially ruby console notes to myself for this sort of check, into something that is runnable as a script by anyone. The machinery here could also form a starting point for #1076.
It has been useful for spot checking GCP replication on prod (as we did with its predecessor against stage and QA). See #2426 and #2518.
Usage
How was this change tested? 🤨
it was used to fixity check the test objects listed in #2518. when an earlier pass at the script failed to properly re-assemble multi-part zips it had downloaded, the missing moab manifest files were detected (they were indeed missing from that first zip extraction attempt; but were present in the multi-part zip, retrieved when the script was updated to re-assemble the zip successfully, and led to the moabs in question passing fixity check once the zip re-assembly was fixed). this was a convenient natural experiment with a corrupted zip extraction that was resolved by fixing the extraction approach.