Skip to content

Conversation

@richard-smith-preservica

See thread I started here #18637 (but no replies).

I am opening this not because I expect it to be immediately accepted, but to start that conversation.

What is in this PR to start with:

Any PDF which

  • says it has N pages
  • has a top level /Pages dictionary pointing to exactly N children
  • N >= 20 (tuneable)
    ... will, when you ask it for Page X, will return the Xth item from that top level pages dictionary without traversing the whole structure. This assumes that a Pages dictionary with 500 things, when you say you have 500 pages, almost certainly contains 500 /Page references and nothing else. As explained in the discussion post, when rendering an interleaved PDF (where the /Page dictionary for each page is before its content), the performance hit of traversing the whole /Pages structure because we don't trust that assumption is unacceptably high.

We have been running a custom build of pdf.js in our renderer with this change for a year with no reported issues.

I understand that you could create a pathological PDF where this is intentionally untrue (in which asking for a page might theoretically return the wrong one), so the question is how much protection we need against that.

  • The higher N, the less likely a PDF will be constructed where /Pages contains N things that aren't all pages just by coincidence
  • It could be an option set in the calling content, so the embedding context could choose to turn it on. For example web portals (like us) pulling content over the Internet might want it, but local usages of PDF.js might want to keep the current tree traversal as the cost will be less.

Richard Smith (smir) added 4 commits August 20, 2024 10:49
…it is the right length, and there are enough pages to be worth the optimisation
…/assume-all-pages-in-top-level-when-likely-master
…nto rcs/assume-all-pages-in-top-level-when-likely-master
@richard-smith-preservica richard-smith-preservica changed the title Rcs/assume all pages in top level when likely master Assume all entries in /Pages dictionary are /Pages when the count of items matches the reported number of pages Oct 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants