Fix octal escape sequence handling in literal strings by prettybits · Pull Request #1026 · openpreserve/jhove

prettybits · 2025-04-17T17:40:27Z

Alternative Title: Properly expose the phantoms in a PDF (referencing #927)

These changes remove the assumption that the reverse solidus and immediately following characters (i.e. bytes) in escape sequences (incl. octal) can be encoded with multiple bytes, so escape sequences are treated the same for UTF-16BE encoded literal strings as for the other encodings.

Octal-encoded NUL bytes (i.e. \000) were also wrongly treated as invalid escape sequences, these are now treated as valid and thus properly taken into account as a leading zero byte in UTF16-BE encoded literal strings.

The JHOVE text output for the Phantom PDF test file from https://digitalpreservation.fi/en/2024-phantom-pdf-file is now the expected:

Jhove (Rel. 1.34.0-RC1, 2025-04-17)
 Date: 2025-04-17 19:13:21 CEST
 RepresentationInformation: /home/prettybits/Downloads/phantom_of_a_pdf_file_blog_post_2024.pdf
  ReportingModule: PDF-hul, Rel. 1.12.8 (2025-03-12)
  LastModified: 2025-04-17 19:12:54 CEST
  Size: 5906
  Format: PDF
  Version: 1.4
  Status: Well-Formed and valid
  SignatureMatches:
   PDF-hul
  MIMEtype: application/pdf
  PDFMetadata: 
   Objects: 11
   FreeObjects: 1
   IncrementalUpdates: 0
   DocumentCatalog: 
    PageLayout: SinglePage
    PageMode: UseNone
   Info: 
    Title: Boo
    Producer: PDF Phantom
    CreationDate: Tue Oct 29 14:43:30 CET 2024
   ID: 0x71a810587639eb130aefddee35e3c49d, 0x71a810587639eb130aefddee35e3c49d
   Filters: 
    FilterPipeline: FlateDecode
   Images: 
    Image: 
     NisoImageMetadata: 
      FormatName: image/png
      ImageWidth: 62
      ImageHeight: 32
      BitsPerSample: 8
      BitsPerSampleUnit: integer
     Filter: FlateDecode
     Intent: Perceptual
     Interpolate: true
   XMP: <x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='Image::ExifTool 12.71'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
 <rdf:Description rdf:about=''
  xmlns:pdf='http://ns.adobe.com/pdf/1.3/'>
  <pdf:Producer>PDF Phantom</pdf:Producer>
 </rdf:Description>
</rdf:RDF>
</x:xmpmeta>
   Pages: 
    Page: 
     Sequence: 1

This PR doesn't introduce any changes to the handling of embedded (BCP-47) language escape sequences, from the looks of it the logic there would also need some work.

@jmlehton @bitsgalore @petervwyatt

… UTF-16BE-encoded string literals

petervwyatt

Line 614 is incorrect which implies further issues:

_pdfACompliant = false; // octal sequences must be 3 chars in PDF/A

You are confusing octal escape sequences in literal strings with hex-strings that are required to have an even number of digits in PDF/A (so no reliance on SW knowing to add an extra zero digit). There are no limitations on literal string escape sequences in PDF/A.

petervwyatt

Line 625 is also incorrect - literal string objects can span multiple lines by preceding any valid EOL sequence (1 or 2 bytes). Quoting text below Table 3 in ISO 32000:

A PDF writer may split a literal string across multiple lines. The REVERSE SOLIDUS (5Ch) (backslash character) at the end of a line shall be used to indicate that the string continues on the following line. A PDF processor shall disregard the REVERSE SOLIDUS and the end-of-line marker following it when reading the string; the resulting string value shall be identical to that which would be read if the string were not split.

- don't mark literal strings with octal escape sequences of less than 3 characters as non-compliant - mark hexadecimal strings with an odd number of digits as non-compliant

…in literal strings

…ters independent of encoding

…feeds

… encoding of literal strings

…ssigned

…trings

…renthesis of a literal string early

prettybits · 2025-04-22T14:53:09Z

@petervwyatt I didn't touch these lines initially because I wanted to keep the PR more focused. Your comments now led me down a path of more extensive fixes/changes surrounding String object handling. I tried to make the individual commits meaningful enough so they can be followed more easily.

I didn't write the lines you pointed out, but thanks for the remarks nevertheless. I changed the code not to mark the PDF as not PDF/A-compliant in the presence of shorter octal escape codes and added such marking for the odd digits case in hexadecimal strings. Line continuations with backslashes now also take into account all EOL markers.

After finding more unhandled edge cases in the handling of parentheses and escapes combined with the way encodings were handled I decided to rework the logic there to more cleanly do the processing of literal strings before decoding the String with the detected encoding. I'm using Java's existing Charset definitions from the java.nio.charset.StandardCharsets package to do the final decoding now and introduced a custom PDFDocEncodingCharset for the default encoding, for both literal and hexadecimal strings.

This also includes support for UTF-8 strings now, although currently without validating if the PDF version is >=2.0, I'm not sure if there are PDFs using UTF-8 despite announcing themselves as being of an older version?

Finally, I found that some error handling still assumed that readChar() might return -1, even though its current implementation throws an EOFException instead, fixing the affected code locations means that now PDF-HUL-10 ("Unterminated literal in PDF file") is being reported as originally intended again.

petervwyatt · 2025-04-23T01:21:45Z

@prettybits I don't know the JHOVE code well enough to understand these new, larger changes so I cannot comment further. FYI the only 2 PDF tokens that can span lines are literal strings with the \ + EOL sequence and hex-strings since whitespace is skipped and all EOL sequences count as whitespace. Generally speaking, literal and hex strings are fully interchangeable unless the spec states otherwise (which it does in very few of places) - so don't assume a date string (for example) has to be a literal string! It doesn't! The Unicode vs ASCII vs byte string are the interpretation of the data in the string, regardless of literal or hex. Hope that helps.

I'm not sure if there are PDFs using UTF-8 despite announcing themselves as being of an older version?

Unfortunately, there are definitely PDFs like this (although unlikely PDF/A due to validators checking such things). There are 1.x PDFs that have strings with Unicode BoMs for UTF-8, UTF-16LE, UTF-16BE, and UTF-32 even though only UTF-16BE was the only one spec'd. Several interactive viewers also silently support these malformations.

carlwilson

So @prettybits, I'm happy that the code is better following this contribution. I'm now going to do some local testing against a few PDF corpora. I'll check the changed results and confirm that the changes are for the better. That will take a few days.

prettybits added 2 commits April 17, 2025 18:24

allow octal escape sequences for NUL characters in literal strings

96c02d0

don't treat backslash character as multi-byte for escape sequences in…

a861685

… UTF-16BE-encoded string literals

petervwyatt reviewed Apr 18, 2025

View reviewed changes

prettybits added 10 commits April 18, 2025 10:50

fix PDF/A compliance marking of string objects

58633ca

- don't mark literal strings with octal escape sequences of less than 3 characters as non-compliant - mark hexadecimal strings with an odd number of digits as non-compliant

handle all EOL marker variants after backslash for line continuation …

e519f8f

…in literal strings

process parentheses and escape sequences in literal strings as charac…

fe5c045

…ters independent of encoding

treat all non-escaped EOL marker variants in literal strings as line …

f9fec26

…feeds

only process bytes and use proper charset-based decoding for detected…

c297073

… encoding of literal strings

mark undefined code points in the PDFDocEncoding character set as una…

4815215

…ssigned

support UTF-8 encoded literal and hexadecimal strings

9c03d94

fix EOF handling to properly report PDF-HUL-10 for unclosed literal s…

a011153

…trings

add CRLF as valid EOL marker in comments

6ce6d35

add partial UTF-8 marker bytes to buffer when encountering closing pa…

c1a7b9f

…renthesis of a literal string early

carlwilson added this to the JHOVE 1.36 milestone Jul 4, 2025

Merge branch 'integration' into fix-string-escape-handling

e37f6cc

carlwilson reviewed Jul 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix octal escape sequence handling in literal strings#1026

Fix octal escape sequence handling in literal strings#1026
prettybits wants to merge 13 commits intoopenpreserve:integrationfrom
prettybits:fix-string-escape-handling

prettybits commented Apr 17, 2025 •

edited

Loading

Uh oh!

petervwyatt left a comment

Uh oh!

petervwyatt left a comment

Uh oh!

prettybits commented Apr 22, 2025

Uh oh!

petervwyatt commented Apr 23, 2025

Uh oh!

carlwilson left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

prettybits commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petervwyatt left a comment

Choose a reason for hiding this comment

Uh oh!

petervwyatt left a comment

Choose a reason for hiding this comment

Uh oh!

prettybits commented Apr 22, 2025

Uh oh!

petervwyatt commented Apr 23, 2025

Uh oh!

carlwilson left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

prettybits commented Apr 17, 2025 •

edited

Loading