diff --git a/adr/007-content-type-allocation.md b/adr/007-content-type-allocation.md new file mode 100644 index 0000000..75bb321 --- /dev/null +++ b/adr/007-content-type-allocation.md @@ -0,0 +1,127 @@ +# Determination of Content Type + +> [!IMPORTANT] +> The following assumes we are running on Amazon S3 for Deposit Storage and not a local file system. + +If you upload a file in the UI, we try to determine the content type in your browser, as soon as the file is selected on the local file system. We read the value of the file input element's files property and read the `type` in JavaScript: + +```javascript + fileContentType.value = fileSelector.files[0].type; +``` + +(see https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/input/file#type) + +The value of type is included in the submitted form, which is handled by `OnPostUploadFile` in Deposit.cshtml.cs. + +If no value is supplied in the form submission, we use the [MimeTypes](https://github.com/khellang/MimeTypes) library to determine it: + +```c# + if(string.IsNullOrWhiteSpace(depositFileContentType) && MimeTypes.TryGetMimeType(slug, out var foundMimeType)) + depositFileContentType = foundMimeType; +``` + +We then call `WorkspaceManager.UploadSingleSmallFile(...)`, supplying the determined content type (which might still be null). + +This will cause `UploadFileToDepositHandler` to run, which makes an S3 `PutObjectRequest`: + +```c# +var req = new PutObjectRequest +{ + BucketName = s3Uri.Bucket, + Key = fullKey, + ContentType = request.ContentType, + ChecksumAlgorithm = ChecksumAlgorithm.SHA256, + InputStream = request.Stream +}; +``` + +The content type can still be null or empty at this stage. If so, the file on S3 will be given the content type `application/octet-stream` in its object metadata, and we will retrieve this and apply it to the object. (NEW) + +> [Here we re-acquire the object metadata](https://github.com/digirati-co-uk/digital-preservation/blob/46486d20d067484632ab5fdecda2ff8feb8b3c2a/src/DigitalPreservation/DigitalPreservation.Workspace/Requests/UploadFileToDeposit.cs#L75). + +If the supplied content type was empty, we use the S3 content type. +And if the S3 content type is empty (unlikely), we use the value `ContentTypes.NotIdentified` (which is "dlip/not-identified"). (NEW) + +(we don't want to store this special type in METS though - see later) + +S3 only sets the content type metadata if we don't supply one in the form upload - it will return the ContentType we supplied in the original PutObjectRequest unless it was empty. + +Either way this gets stored in the S3 object metadata: + +![S3 object metadata](images/s3-object-metadata.png) + +If we put the object directly in S3 and then regenerate the filesystem, then obviously S3 sets this metadata field, unless we set it ourselves manually. + +So for an _uploaded_ file, the object on S3 has metadata that is either from: + + - the uploaded form field (browser-detected) + - MimeTypes if none sent by the browser + - S3's own determination, if none returned from MimeTypes + +... and a directly placed file will return either + + - what we explicitly set on our manual direct upload + - S3's own determination, if we didn't specify a Content Type + +We then create a WorkingFile and add this to the __metslike.json local file model, viewable on /deposits/{id}/depositfilesystem in the UI application. This will later be `CombinedFile.FileInDeposit` when we start comparing for mismatches. + +By this stage it is _extremely unlikely_ that the WorkingFile has `ContentTypes.NotIdentified` because S3 always *_seems_* to apply something - it won't let the object be served without a content type. + +We then ADD the WorkingFile to METS, and the ContentType appears as the MIMETYPE attribute of the mets:file element: + +![alt text](images/mets-mimetype.png) + + - The first file here has no MIMETYPE element because it was either empty or was `ContentTypes.NotIdentified` (now assumed to be extremely rare). + - The second file was determined on the browser as video/mp4 and this makes its way all the way through to the METS. + - The third file demonstrates that we can supply an alternative ContentType value on the PutObjectRequest to S3 and it will persist into METS. + - The fourth file is something which neither the browser or MimeTypes could determine a content type for, and S3 applied application/octet-stream, which persists into METS. + +# Further information from tools + +If we now [run Siegfried](images/siegfried.csv) over these files, we acquire some more content type information. +But Siegfried doesn't give the same results for some file types: + +| file | S3/Browser/MimeTypes | Siegfried CT | Siegfried PRONOM | +|------------------------------------|----------------------------|-----------------|---------------------------------------------| +| boo.bar | application/octet-stream | null | fmt/683 - Vector Markup Language | +| brian_news-alert-test-cpr-3171.msg | application/vnd.ms-outlook | null | x-fmt/430 - Microsoft Outlook Email Message | +| chrome_bookmarks_11_11_2020.html | text/html | null | fmt/1132 - Netscape Bookmark File Format | +| digirati-logo-white-green.svg | image/svg+xml | null | (none) | +| foo.bar | application/octet-stream | null | fmt/683 - Vector Markup Language | +| pm1.mp4 (custom S3 write) | application/mp4 | application/mp4 | fmt/199 - MPEG-4 Media File | +| pm2.mp4 (standard upload) | video/mp4 | application/mp4 | fmt/199 - MPEG-4 Media File | +| Silver-Dagger.m4a | audio/x-m4a | application/mp4 | fmt/199 - MPEG-4 Media File | +| Silver-Dagger.mp4 | video/mp4 | application/mp4 | fmt/199 - MPEG-4 Media File | +| ZoomOutlookPluginSetup.msi | application/octet-stream | null | fmt/111 - OLE2 Compound Document Format | + +(The two MP4s and the two Silver-Dagger.* are the same binary content) + +We have tended to assume that Siegfried is a better source, because it actually examines the file. The S3/Browser/MimeTypes mechanisms are just looking up the file extension in a big list. However, there are two scenarios we need to address: + + - S3/Browser/MimeTypes gives one answer, Siegfried gives another (pm2.mp4) + - Siegfried doesn't report a content type at all + +In the first case, which one do we store in the METS? + +In the latter case Siegfried still _identifies_ the file correctly (most of the time) but doesn't assign a content type. Should we just use the content type we got from S3/Browser/MimeTypes? If so, should we record the fact that Siegfried didn't identify the file? That information is present in the siegfried.csv file, but + +What do we consider a mismatch between the file system (S3 metadata + metadata parsed from siegfried.csv) and the METS file? The METS file's MIMETYPE might agree with one of those. + +The MIMETYPE attribute in METS is used by iiif-builder when registering _assets_ with IIIF Cloud Services. In the case of pm1.mp4 and pm2.mp4, the content type obtained by S3/Browser/MimeTypes is *better* than the Siegfried one. While MP4 can be a container format for audio or video, these MP4s really are videos! And this is important when registering assets because the registered content type - image/\*, audio/\*, video/\* - is significant. It determines how the asset will be processed. + +Silver-Dagger.m4a and Silver-Dagger.mp4 are pure audio files - they are the same file, just with different file extensions. The extension-only approach produces a better result for .m4a - but the extension-only approach for Silver-Dagger.mp4 is actively harmful, this is not a video. + +(See https://en.wikipedia.org/wiki/MP4_file_format#Filename_extensions for more information - while MP4 audio is usually .m4a, it doesn't _have_ to be.) + +## Strategy for dealing with content type differences + +- Get a "Best Content Type" from the Deposit WorkingFile - taking into account 0..n content types from `FileFormatMetadata` values as well as the ContentType initially determined on the file (and stored in S3 metadata and/or __metslike.json filesystem) +- `ContentTypes.GetBestContentType(WorkingFile? fileInDeposit)` is an extension point for adding in more complex decisions later, perhaps by looking at the outputs of further tools such as EXIFTool or even FFProbe. I have added in some started logic here to favour video/mp4 over application/mp4 and similar patterns - but this needs more work. It's now where we start removing generic content types if we have more than one. +- When adding to or updating METS via patchPremis / premisFile logic, use this "best Content Type" if present +- Mismatch between METS and Deposit is a mismatch between this BEST value and what's currently in the METS file. +- Actually list all the content types when there is a mismatch when trying to create an import job + + +Additional changes + +- In MetsParser, remove fallback determination of content type via MimeTypes. If it's not in the METS, MetsParser shouldn't invent it. diff --git a/adr/images/mets-mimetype.png b/adr/images/mets-mimetype.png new file mode 100644 index 0000000..e18a907 Binary files /dev/null and b/adr/images/mets-mimetype.png differ diff --git a/adr/images/s3-object-metadata.png b/adr/images/s3-object-metadata.png new file mode 100644 index 0000000..34c0e03 Binary files /dev/null and b/adr/images/s3-object-metadata.png differ diff --git a/adr/images/siegfried.csv b/adr/images/siegfried.csv new file mode 100644 index 0000000..422127a --- /dev/null +++ b/adr/images/siegfried.csv @@ -0,0 +1,11 @@ +filename,filesize,modified,errors,sha256,namespace,id,format,version,mime,class,basis,warning +/usr/deposit/deposits/xy97za6qef27/objects/boo.bar,50789,2026-01-20T14:30:23Z,,5008b95670ae2ed3fc27c45ffe8a7e79c19f13b53306831f7b18ad8203fe4077,pronom,fmt/583,Vector Markup Language,,,"Image (Vector), Text (Mark-up)","byte match at 6, 39",extension mismatch +/usr/deposit/deposits/xy97za6qef27/objects/brian_news-alert-test-cpr-3171.msg,187904,2026-01-20T14:30:00Z,,bab163717d42c1f85ad434007c1bb3e7385212026002ba67a4ec09841f5d0c70,pronom,x-fmt/430,Microsoft Outlook Email Message,97-2003,,Email,extension match msg; container name __nameid_version1.0 with name only; name __properties_version1.0 with name only, +/usr/deposit/deposits/xy97za6qef27/objects/chrome_bookmarks_11_11_2020.html,621891,2026-01-20T14:29:40Z,,b2785c76ffec3fcd42be30a58be94f30e70d6ba9815bb8c6ef488432137f1d27,pronom,fmt/1132,Netscape Bookmark File Format,,,Text (Mark-up),"extension match html; byte match at 0, 35", +/usr/deposit/deposits/xy97za6qef27/objects/digirati-logo-white-green.svg,6037,2026-01-20T14:29:23Z,,faa3467b13d80944e22742976f4f4b2c2f74ca08001e39d5a8a48adc565b6c31,pronom,UNKNOWN,,,,,,"no match; possibilities based on extension are fmt/91, fmt/92, fmt/413" +/usr/deposit/deposits/xy97za6qef27/objects/foo.bar,50789,2026-01-20T14:30:34Z,,5008b95670ae2ed3fc27c45ffe8a7e79c19f13b53306831f7b18ad8203fe4077,pronom,fmt/583,Vector Markup Language,,,"Image (Vector), Text (Mark-up)","byte match at 6, 39",extension mismatch +/usr/deposit/deposits/xy97za6qef27/objects/pm1.mp4,22748707,2026-01-20T14:31:22Z,,0d52980fe7e30361eb5c3cd4f52fa9372a80657b24f6a5d7102185f090c55056,pronom,fmt/199,MPEG-4 Media File,,application/mp4,"Audio, Video",extension match mp4; byte match at [[4 8] [22088895 4]], +/usr/deposit/deposits/xy97za6qef27/objects/pm2.mp4,22748707,2026-01-20T14:32:01Z,,0d52980fe7e30361eb5c3cd4f52fa9372a80657b24f6a5d7102185f090c55056,pronom,fmt/199,MPEG-4 Media File,,application/mp4,"Audio, Video",extension match mp4; byte match at [[4 8] [22088895 4]], +/usr/deposit/deposits/xy97za6qef27/objects/silver-dagger.m4a,3703505,2026-01-20T17:25:58Z,,2eb7c08f5ea3baf39ac7376c9c71f911e5be894aad1f0048209c166a029ec76f,pronom,fmt/199,MPEG-4 Media File,,application/mp4,"Audio, Video",extension match m4a; byte match at [[4 20] [36 4]], +/usr/deposit/deposits/xy97za6qef27/objects/silver-dagger.mp4,3703505,2026-01-20T17:26:16Z,,2eb7c08f5ea3baf39ac7376c9c71f911e5be894aad1f0048209c166a029ec76f,pronom,fmt/199,MPEG-4 Media File,,application/mp4,"Audio, Video",extension match mp4; byte match at [[4 20] [36 4]], +/usr/deposit/deposits/xy97za6qef27/objects/zoomoutlookpluginsetup.msi,7598592,2026-01-20T14:29:09Z,,b1706937b73f35c27b51c4919f387a65492a6ffb58cdd9b071868d2a727df57f,pronom,fmt/111,OLE2 Compound Document Format,,,Text (Structured),"byte match at 0, 30",