Skip to content

Improve XMP metadata handling#1010

Open
marhop wants to merge 20 commits intoopenpreserve:integrationfrom
marhop:xmp
Open

Improve XMP metadata handling#1010
marhop wants to merge 20 commits intoopenpreserve:integrationfrom
marhop:xmp

Conversation

@marhop
Copy link
Member

@marhop marhop commented Feb 26, 2025

This PR changes how XMP metadata is validated in TIFF, GIF, JPEG, and PDF.

Prior to this PR, an error with respect to XMP (e.g., TIFF-HUL-14) would be raised if and only if the XMP metadata was enclosed in a so-called packet wrapper that contained an encoding attribute to declare the encoding of the XMP data. This was problematic for a couple of reasons:

  • A packet wrapper is a pair of XML processing instructions that is intended to facilitate scanning a byte stream of unknown format for XMP metadata by enclosing the actual XML data with very specific marker strings, similar to magic numbers used for file format identification. However, a packet wrapper is not recommended (albeit not illegal) if the location of XMP metadata in a file is well-defined. This is the case in all of TIFF, GIF, JPEG, and PDF, so JHOVE could just as well ignore the packet wrapper. (Adobe XMP Specification Part 1 (2012), pages 10-11)
  • The encoding attribute has been deprecated since at least 2004. (Adobe XMP Specification (2004), page 30)
  • Moreover, XMP metadata in all of TIFF, GIF, JPEG, and PDF is explicitly required to use UTF-8 since at least 2010, so JHOVE does not need the encoding attribute anyway. (Adobe XMP Specification Part 3 (2010), this hasn't changed in the current version)
  • There was a bug in the code that handled the encoding string which had been copy/pasted everywhere where XMP was processed, so XMP validation seems to never have worked as intended. (see for example 7d54020)
  • As a result, the only way to get an XMP-related error was a (not recommended) packet wrapper with a (deprecated) encoding attribute. The actual XML that is the XMP metadata was never checked at all.

With this PR XMP metadata is now checked as follows:

  • A packet wrapper and in particular its encoding attribute are ignored because they are irrelevant, but not strictly illegal.
  • XMP metadata is expected to be encoded in UTF-8 because this is prescribed for the file formats JHOVE is dealing with. Other encodings will raise an error.
  • XMP metadata is checked for well-formed XML, so fundamentally broken XML will raise an error. However, no XML validation with respect to a schema is performed because due to the extensibility of XMP this would lead to a lot of validation failures when custom schema files are not available.
  • Files containing broken XMP metadata are rated as "well-formed, but not valid". I'm not sure whether this conforms to JHOVE's wider policy but it seemed sensible to me because I cannot imagine broken XMP leading to serious issues. IMHO, even a mere warning/info would be enough.

Note that I added two new error IDs to account for invalid XMP metadata, GIF-HUL-11 and JPEG-HUL-15. I will add them to the Wiki if/when this PR is accepted.

Cheers,
Martin

When trying to extract the encoding part out of a string like
"ENC=B,UTF-8" (returned from the XMPHandler class) the encoding starts
at offset 6, not 5. We want "UTF-8", not ",UTF-8".

Without this fix JHOVE would raise a TIFF-HUL-14 error whenever there is
XMP metadata in a TIFF file that is enclosed in a packet wrapper (a pair
of XML processing instructions) and the packet wrapper has an encoding
attribute, regardless of which encoding is actually specified there
(because Java doesn't recognize the encoding string because of the
leading comma).

With this fix, JHOVE *should* raise a TIFF-HUL-14 error whenever there
is XMP metadata in a TIFF file that is enclosed by a packet wrapper, the
packet wrapper has an encoding attribute, and the encoding specified
that is not supported by Java. If it wasn't for another bug that was
uncovered by this bugfix, but we're getting there ...
JHOVE now correctly raises a TIFF-HUL-14 "Invalid or ill-formed XMP
metadata" error in the following cases:

- The XMP is not enclosed in a packet wrapper and it contains invalid
  XML. This is what one would usually expect from the error message but
  has not been the case before.
- The XMP is enclosed in a packet wrapper and the packet wrapper
  declares an encoding not supported by Java. Doesn't matter if the XML
  is actually valid or not. This is what JHOVE has been doing for years;
  with the exception that it raised an error regardless of whether the
  declared encoding was actually supported or not (this was a bug). Note
  however that whether or not an encoding is supported by Java should
  actually not impact the XMP validation anyway because according to the
  Adobe XMP Specification Part 3 (2020), page 19, the only allowed
  encoding for XMP in TIFF is UTF-8.

JHOVE does not, however, raise a TIFF-HUL-14 error in this case:

- The XMP is enclosed in a packet wrapper, the packet wrapper declares
  an encoding that *is* supported by Java, and the XMP contains invalid
  XML. This should of course be detected but is hard to implement with
  the current control flow.

So this commit should be seen as an improvement, but not a full solution.
Is this class actually used anywhere? But well, I just updated the XMP
processing code like everywhere else.
... to align with the rating in PDF. Files with invalid XMP metadata now
get an overall rating of "well-formed, but not valid". I think that's OK
for such a minor deficiency.
Codacy tells me "These nested if statements could be combined". Yes of
course they can, but does that make the code and documentation more
readable? Not at all. But it that's what it takes to satisfy the
soulless rule checker thingy, so be it ...
@marhop
Copy link
Member Author

marhop commented Feb 26, 2025

Ah, I see. CI does more than just mvn test. Let me look into this and get back to you ...

@marhop
Copy link
Member Author

marhop commented Jul 3, 2025

OK, sorry it took so long - priorities ... Now I finally found some time to investigate the bbt-jhove CI failures. Not to sound presumptuous but I think none of the errors indicate a problem with the changes in this PR.

First, several JPEG2000 files lead to an unhandled EOFException ...

  • test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-1911.jp2
  • test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-1951.jp2
  • test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-2021.jp2
  • test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-1971.jp2
  • test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-2011.jp2
  • test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-1984.jp2
  • test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-1961.jp2
  • test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-1920.jp2
  • test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-1937.jp2
  • test-root/corpora/errors/modules/JPEG2000-hul/meth_is_2_no_icc.jp2
  • test-root/corpora/errors/modules/JPEG2000-hul/bitwiser-icc-corrupted-tagcount-1999.jp2

... or an unhandled NullPointerException

  • test-root/corpora/errors/modules/JPEG2000-hul/openJPEG15.jp2

However, since I haven't messed with JPEG2000 in this PR and since these exceptions are thrown by the much older JHOVE v1.28 as well I deny any responsibility for them. ;-)

Second, two PDF files do indeed fail the tests because validating them doesn't lead to the expected results.

  • test-root/targets/1.34/errors/modules/PDF-hul/pdf-hul-43-govdocs-486355.pdf.jhove.xml
    • Expected child nodelist length '23' but was '17' - comparing <repInfo...> at /jhove[1]/repInfo[1] to <repInfo...> at /jhove[1]/repInfo[1]
  • test-root/targets/1.34/errors/modules/PDF-hul/pdf-hul-22-govdocs-000187.pdf.jhove.xml
    • Expected child nodelist length '23' but was '17' - comparing <repInfo...> at /jhove[1]/repInfo[1] to <repInfo...> at /jhove[1]/repInfo[1]

This might be because JHOVE now detects erroneous XMP that it previously wasn't able to find, obviously changing the validation results. And in fact, JHOVE now throws PDF-HUL-101 errors for these files. However, in these specific cases this is not actually caused by invalid XMP data but by the fact that the XMP data is stored in encrypted content streams that JHOVE cannot handle. JHOVE doesn't decrypt the content streams but happily feeds the encrypted data to its XML parser which of course throws an exception that ultimately leads to the PDF-HUL-101 errors ...

From the XMP validation perspective this behaviour appears to be correct - the encrypted data isn't valid XMP but a seemingly random bitstream. This is of course far from perfect; I would rather know that JHOVE cannot read the content stream instead of being told that the XMP is invalid. However, dealing with encrypted content streams in general is very much out of scope of this PR, so I suggest you accept it as it is.

PS: We could of course stop validating XMP altogether if the content stream is encrypted. But as far as I can tell from the source code this cannot be reliably determined at the moment because the PdfModule._streamsEncrypted attribute is only set to true if the encryption dictionary contains the (optional) StmF key. This is probably another story that should not make this PR even larger ...

@carlwilson
Copy link
Member

Hi @marhop I'm starting to look at these test errors. Once I have some confidence in the results I'll patch the tests and look to merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants