Skip to content

Comments

Fix octal escape sequence handling in literal strings#1026

Open
prettybits wants to merge 13 commits intoopenpreserve:integrationfrom
prettybits:fix-string-escape-handling
Open

Fix octal escape sequence handling in literal strings#1026
prettybits wants to merge 13 commits intoopenpreserve:integrationfrom
prettybits:fix-string-escape-handling

Conversation

@prettybits
Copy link
Contributor

@prettybits prettybits commented Apr 17, 2025

Alternative Title: Properly expose the phantoms in a PDF (referencing #927)

These changes remove the assumption that the reverse solidus and immediately following characters (i.e. bytes) in escape sequences (incl. octal) can be encoded with multiple bytes, so escape sequences are treated the same for UTF-16BE encoded literal strings as for the other encodings.

Octal-encoded NUL bytes (i.e. \000) were also wrongly treated as invalid escape sequences, these are now treated as valid and thus properly taken into account as a leading zero byte in UTF16-BE encoded literal strings.

The JHOVE text output for the Phantom PDF test file from https://digitalpreservation.fi/en/2024-phantom-pdf-file is now the expected:

Jhove (Rel. 1.34.0-RC1, 2025-04-17)
 Date: 2025-04-17 19:13:21 CEST
 RepresentationInformation: /home/prettybits/Downloads/phantom_of_a_pdf_file_blog_post_2024.pdf
  ReportingModule: PDF-hul, Rel. 1.12.8 (2025-03-12)
  LastModified: 2025-04-17 19:12:54 CEST
  Size: 5906
  Format: PDF
  Version: 1.4
  Status: Well-Formed and valid
  SignatureMatches:
   PDF-hul
  MIMEtype: application/pdf
  PDFMetadata: 
   Objects: 11
   FreeObjects: 1
   IncrementalUpdates: 0
   DocumentCatalog: 
    PageLayout: SinglePage
    PageMode: UseNone
   Info: 
    Title: Boo
    Producer: PDF Phantom
    CreationDate: Tue Oct 29 14:43:30 CET 2024
   ID: 0x71a810587639eb130aefddee35e3c49d, 0x71a810587639eb130aefddee35e3c49d
   Filters: 
    FilterPipeline: FlateDecode
   Images: 
    Image: 
     NisoImageMetadata: 
      FormatName: image/png
      ImageWidth: 62
      ImageHeight: 32
      BitsPerSample: 8
      BitsPerSampleUnit: integer
     Filter: FlateDecode
     Intent: Perceptual
     Interpolate: true
   XMP: <x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='Image::ExifTool 12.71'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
 <rdf:Description rdf:about=''
  xmlns:pdf='http://ns.adobe.com/pdf/1.3/'>
  <pdf:Producer>PDF Phantom</pdf:Producer>
 </rdf:Description>
</rdf:RDF>
</x:xmpmeta>
   Pages: 
    Page: 
     Sequence: 1

This PR doesn't introduce any changes to the handling of embedded (BCP-47) language escape sequences, from the looks of it the logic there would also need some work.

@jmlehton @bitsgalore @petervwyatt

Copy link

@petervwyatt petervwyatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 614 is incorrect which implies further issues:

_pdfACompliant = false; // octal sequences must be 3 chars in PDF/A

You are confusing octal escape sequences in literal strings with hex-strings that are required to have an even number of digits in PDF/A (so no reliance on SW knowing to add an extra zero digit). There are no limitations on literal string escape sequences in PDF/A.

Copy link

@petervwyatt petervwyatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 625 is also incorrect - literal string objects can span multiple lines by preceding any valid EOL sequence (1 or 2 bytes). Quoting text below Table 3 in ISO 32000:

A PDF writer may split a literal string across multiple lines. The REVERSE SOLIDUS (5Ch) (backslash character) at the end of a line shall be used to indicate that the string continues on the following line. A PDF processor shall disregard the REVERSE SOLIDUS and the end-of-line marker following it when reading the string; the resulting string value shall be identical to that which would be read if the string were not split.

@prettybits
Copy link
Contributor Author

@petervwyatt I didn't touch these lines initially because I wanted to keep the PR more focused. Your comments now led me down a path of more extensive fixes/changes surrounding String object handling. I tried to make the individual commits meaningful enough so they can be followed more easily.

I didn't write the lines you pointed out, but thanks for the remarks nevertheless. I changed the code not to mark the PDF as not PDF/A-compliant in the presence of shorter octal escape codes and added such marking for the odd digits case in hexadecimal strings. Line continuations with backslashes now also take into account all EOL markers.

After finding more unhandled edge cases in the handling of parentheses and escapes combined with the way encodings were handled I decided to rework the logic there to more cleanly do the processing of literal strings before decoding the String with the detected encoding. I'm using Java's existing Charset definitions from the java.nio.charset.StandardCharsets package to do the final decoding now and introduced a custom PDFDocEncodingCharset for the default encoding, for both literal and hexadecimal strings.

This also includes support for UTF-8 strings now, although currently without validating if the PDF version is >=2.0, I'm not sure if there are PDFs using UTF-8 despite announcing themselves as being of an older version?

Finally, I found that some error handling still assumed that readChar() might return -1, even though its current implementation throws an EOFException instead, fixing the affected code locations means that now PDF-HUL-10 ("Unterminated literal in PDF file") is being reported as originally intended again.

@petervwyatt
Copy link

@prettybits I don't know the JHOVE code well enough to understand these new, larger changes so I cannot comment further. FYI the only 2 PDF tokens that can span lines are literal strings with the \ + EOL sequence and hex-strings since whitespace is skipped and all EOL sequences count as whitespace. Generally speaking, literal and hex strings are fully interchangeable unless the spec states otherwise (which it does in very few of places) - so don't assume a date string (for example) has to be a literal string! It doesn't! The Unicode vs ASCII vs byte string are the interpretation of the data in the string, regardless of literal or hex. Hope that helps.

I'm not sure if there are PDFs using UTF-8 despite announcing themselves as being of an older version?

Unfortunately, there are definitely PDFs like this (although unlikely PDF/A due to validators checking such things). There are 1.x PDFs that have strings with Unicode BoMs for UTF-8, UTF-16LE, UTF-16BE, and UTF-32 even though only UTF-16BE was the only one spec'd. Several interactive viewers also silently support these malformations.

@carlwilson carlwilson added this to the JHOVE 1.36 milestone Jul 4, 2025
Copy link
Member

@carlwilson carlwilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So @prettybits, I'm happy that the code is better following this contribution. I'm now going to do some local testing against a few PDF corpora. I'll check the changed results and confirm that the changes are for the better. That will take a few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants