-
Notifications
You must be signed in to change notification settings - Fork 1
Description
This is problematic. The core problem is that many orgs, ours included, don't hold onto the source files (TIFFs for scanned issues, original PDFs for born-digital), which means that we effectively only have a pile of derivatives to work with. None of the files present in the final "live batch" dir are true source files!
To make NCA work without fairly complicated hacks, we need the original files, which are not available in an automated manner right now. Once files are in our dark archive, there isn't a simple way to find them again - they're archived in a non-deterministic dir structure that is based on the date an archive was created, and I believe some kind of "volume code". So batch_oru_november_ver1 might be in <dark archive>/newspapers3/2021-01-02/batch_oru_november or something, while batch_oru_echo_ver01 could be in <dark archive>/newspapers1/2020-12-20/batch_oru_echo_ver01. I suspect other institutions have similar problems, and I know for a fact that at least one doesn't even keep the original files.
The obvious plan is to just use the derivatives, which I was planning to do. But it's looking less and less viable, at least with the setup we have today. So many places assume we're working with the original files, because obviously that's what we've always done. Some examples: there's a job to rebuild derivatives if things go wrong, which obviously won't work if we have no original files; the "flag as errored and remove from NCA" process basically requires the original files and the derivatives are stored as something of an afterthought; when we create batches, we assume born-digital issues have a backup with the original uploads.
Some of these problems can be hacked around without too much trouble, but at a pretty steep cost for future improvements. But some places that expect originals are going to be a bigger problem just because of how coupled the app is to those originals in ways we can't really undo. Validating an issue, for instance, has to happen a certain way pretty much in all current use-cases, but not in an issue that's just a copy of the derivatives.
We may not have a lot of options here other than to say "if you want the originals you'll have to manually pull the files from your archive" "if you want the issues back in NCA, you'll have to manually pull the files from your archive."