Fix octal escape sequence handling in literal strings#1026
Fix octal escape sequence handling in literal strings#1026prettybits wants to merge 13 commits intoopenpreserve:integrationfrom
Conversation
… UTF-16BE-encoded string literals
petervwyatt
left a comment
There was a problem hiding this comment.
Line 614 is incorrect which implies further issues:
_pdfACompliant = false; // octal sequences must be 3 chars in PDF/A
You are confusing octal escape sequences in literal strings with hex-strings that are required to have an even number of digits in PDF/A (so no reliance on SW knowing to add an extra zero digit). There are no limitations on literal string escape sequences in PDF/A.
petervwyatt
left a comment
There was a problem hiding this comment.
Line 625 is also incorrect - literal string objects can span multiple lines by preceding any valid EOL sequence (1 or 2 bytes). Quoting text below Table 3 in ISO 32000:
A PDF writer may split a literal string across multiple lines. The REVERSE SOLIDUS (5Ch) (backslash character) at the end of a line shall be used to indicate that the string continues on the following line. A PDF processor shall disregard the REVERSE SOLIDUS and the end-of-line marker following it when reading the string; the resulting string value shall be identical to that which would be read if the string were not split.
- don't mark literal strings with octal escape sequences of less than 3 characters as non-compliant - mark hexadecimal strings with an odd number of digits as non-compliant
…in literal strings
…ters independent of encoding
… encoding of literal strings
…renthesis of a literal string early
|
@petervwyatt I didn't touch these lines initially because I wanted to keep the PR more focused. Your comments now led me down a path of more extensive fixes/changes surrounding String object handling. I tried to make the individual commits meaningful enough so they can be followed more easily. I didn't write the lines you pointed out, but thanks for the remarks nevertheless. I changed the code not to mark the PDF as not PDF/A-compliant in the presence of shorter octal escape codes and added such marking for the odd digits case in hexadecimal strings. Line continuations with backslashes now also take into account all EOL markers. After finding more unhandled edge cases in the handling of parentheses and escapes combined with the way encodings were handled I decided to rework the logic there to more cleanly do the processing of literal strings before decoding the String with the detected encoding. I'm using Java's existing This also includes support for UTF-8 strings now, although currently without validating if the PDF version is >=2.0, I'm not sure if there are PDFs using UTF-8 despite announcing themselves as being of an older version? Finally, I found that some error handling still assumed that |
|
@prettybits I don't know the JHOVE code well enough to understand these new, larger changes so I cannot comment further. FYI the only 2 PDF tokens that can span lines are literal strings with the
Unfortunately, there are definitely PDFs like this (although unlikely PDF/A due to validators checking such things). There are 1.x PDFs that have strings with Unicode BoMs for UTF-8, UTF-16LE, UTF-16BE, and UTF-32 even though only UTF-16BE was the only one spec'd. Several interactive viewers also silently support these malformations. |
carlwilson
left a comment
There was a problem hiding this comment.
So @prettybits, I'm happy that the code is better following this contribution. I'm now going to do some local testing against a few PDF corpora. I'll check the changed results and confirm that the changes are for the better. That will take a few days.
Alternative Title: Properly expose the phantoms in a PDF (referencing #927)
These changes remove the assumption that the reverse solidus and immediately following characters (i.e. bytes) in escape sequences (incl. octal) can be encoded with multiple bytes, so escape sequences are treated the same for UTF-16BE encoded literal strings as for the other encodings.
Octal-encoded
NULbytes (i.e.\000) were also wrongly treated as invalid escape sequences, these are now treated as valid and thus properly taken into account as a leading zero byte in UTF16-BE encoded literal strings.The JHOVE text output for the Phantom PDF test file from https://digitalpreservation.fi/en/2024-phantom-pdf-file is now the expected:
This PR doesn't introduce any changes to the handling of embedded (BCP-47) language escape sequences, from the looks of it the logic there would also need some work.
@jmlehton @bitsgalore @petervwyatt