Improve XMP metadata handling#1010
Conversation
When trying to extract the encoding part out of a string like "ENC=B,UTF-8" (returned from the XMPHandler class) the encoding starts at offset 6, not 5. We want "UTF-8", not ",UTF-8". Without this fix JHOVE would raise a TIFF-HUL-14 error whenever there is XMP metadata in a TIFF file that is enclosed in a packet wrapper (a pair of XML processing instructions) and the packet wrapper has an encoding attribute, regardless of which encoding is actually specified there (because Java doesn't recognize the encoding string because of the leading comma). With this fix, JHOVE *should* raise a TIFF-HUL-14 error whenever there is XMP metadata in a TIFF file that is enclosed by a packet wrapper, the packet wrapper has an encoding attribute, and the encoding specified that is not supported by Java. If it wasn't for another bug that was uncovered by this bugfix, but we're getting there ...
JHOVE now correctly raises a TIFF-HUL-14 "Invalid or ill-formed XMP metadata" error in the following cases: - The XMP is not enclosed in a packet wrapper and it contains invalid XML. This is what one would usually expect from the error message but has not been the case before. - The XMP is enclosed in a packet wrapper and the packet wrapper declares an encoding not supported by Java. Doesn't matter if the XML is actually valid or not. This is what JHOVE has been doing for years; with the exception that it raised an error regardless of whether the declared encoding was actually supported or not (this was a bug). Note however that whether or not an encoding is supported by Java should actually not impact the XMP validation anyway because according to the Adobe XMP Specification Part 3 (2020), page 19, the only allowed encoding for XMP in TIFF is UTF-8. JHOVE does not, however, raise a TIFF-HUL-14 error in this case: - The XMP is enclosed in a packet wrapper, the packet wrapper declares an encoding that *is* supported by Java, and the XMP contains invalid XML. This should of course be detected but is hard to implement with the current control flow. So this commit should be seen as an improvement, but not a full solution.
Is this class actually used anywhere? But well, I just updated the XMP processing code like everywhere else.
... to align with the rating in PDF. Files with invalid XMP metadata now get an overall rating of "well-formed, but not valid". I think that's OK for such a minor deficiency.
Codacy tells me "These nested if statements could be combined". Yes of course they can, but does that make the code and documentation more readable? Not at all. But it that's what it takes to satisfy the soulless rule checker thingy, so be it ...
|
Ah, I see. CI does more than just |
|
OK, sorry it took so long - priorities ... Now I finally found some time to investigate the bbt-jhove CI failures. Not to sound presumptuous but I think none of the errors indicate a problem with the changes in this PR. First, several JPEG2000 files lead to an unhandled EOFException ...
... or an unhandled NullPointerException
However, since I haven't messed with JPEG2000 in this PR and since these exceptions are thrown by the much older JHOVE v1.28 as well I deny any responsibility for them. ;-) Second, two PDF files do indeed fail the tests because validating them doesn't lead to the expected results.
This might be because JHOVE now detects erroneous XMP that it previously wasn't able to find, obviously changing the validation results. And in fact, JHOVE now throws PDF-HUL-101 errors for these files. However, in these specific cases this is not actually caused by invalid XMP data but by the fact that the XMP data is stored in encrypted content streams that JHOVE cannot handle. JHOVE doesn't decrypt the content streams but happily feeds the encrypted data to its XML parser which of course throws an exception that ultimately leads to the PDF-HUL-101 errors ... From the XMP validation perspective this behaviour appears to be correct - the encrypted data isn't valid XMP but a seemingly random bitstream. This is of course far from perfect; I would rather know that JHOVE cannot read the content stream instead of being told that the XMP is invalid. However, dealing with encrypted content streams in general is very much out of scope of this PR, so I suggest you accept it as it is. PS: We could of course stop validating XMP altogether if the content stream is encrypted. But as far as I can tell from the source code this cannot be reliably determined at the moment because the |
|
Hi @marhop I'm starting to look at these test errors. Once I have some confidence in the results I'll patch the tests and look to merge this. |
This PR changes how XMP metadata is validated in TIFF, GIF, JPEG, and PDF.
Prior to this PR, an error with respect to XMP (e.g., TIFF-HUL-14) would be raised if and only if the XMP metadata was enclosed in a so-called packet wrapper that contained an encoding attribute to declare the encoding of the XMP data. This was problematic for a couple of reasons:
With this PR XMP metadata is now checked as follows:
Note that I added two new error IDs to account for invalid XMP metadata, GIF-HUL-11 and JPEG-HUL-15. I will add them to the Wiki if/when this PR is accepted.
Cheers,
Martin