Skip to content

Fixed nested pdf data merging#487

Open
ExtReMLapin wants to merge 4 commits intochrismattmann:masterfrom
ExtReMLapin:fix_nested_pdf
Open

Fixed nested pdf data merging#487
ExtReMLapin wants to merge 4 commits intochrismattmann:masterfrom
ExtReMLapin:fix_nested_pdf

Conversation

@ExtReMLapin
Copy link
Copy Markdown

Fixes #486

Why am I the first one to meet this issue ???

@chrismattmann
Copy link
Copy Markdown
Owner

Can you add a unit test that demonstrates that this works, @ExtReMLapin ?

@chrismattmann
Copy link
Copy Markdown
Owner

@afuetterer can you take a look?

@coveralls
Copy link
Copy Markdown

Coverage Status

coverage: 67.544% (-0.009%) from 67.553% — ExtReMLapin:fix_nested_pdf into chrismattmann:master

@afuetterer
Copy link
Copy Markdown
Contributor

Sorry, no.
I don't enjoy the tone of issue and PR description and will not be a part of this interaction.

@ExtReMLapin
Copy link
Copy Markdown
Author

IMO It's better to have E2E tests, but this one should be more than enough.
And i'm not sure about hosting this PDF or making a dummy one generated at runtime.

Anyway the test it there, it's probably better to squash commit commits on merge.

@ExtReMLapin
Copy link
Copy Markdown
Author

ExtReMLapin commented Apr 27, 2026

@afuetterer my point there is It's surprising to see a bug caused by 11 years old code, I would have expected it to be met sooner by anyone else.

Being on the bleeding edge forces us to get bugs fixed everywhere, most of the time on "young" libraries or very actives ones.

@chrismattmann
Copy link
Copy Markdown
Owner

This comment from your test was interesting to me.

 """When the same metadata key appears in multiple embedded objects (nested
    PDFs, archives, ...), values must be merged into a single flat list rather
    than appended as nested lists."""

Why does having nested lists associated with a single key introduce a problem? In Java Tika, the return from Metadata[key] is a Vector of String types. However in Python it can be nested lists. Does this break something downstream from your use case?

@ExtReMLapin
Copy link
Copy Markdown
Author

ExtReMLapin commented Apr 27, 2026

My point was it (the list of the second/third/xxx) was appened at the end of the list of the first document, instead of either extending it or shoving the first document list into it's own list and adding another one next to it.

We had something like :

  1. first document data finished being parsed
    • ret['pdf:charsPerPage'] = [45,45, 45]
  2. second document data finished being parsed>
    • ret['pdf:charsPerPage'] = [45,45, 45, [150, 145, 642]]

In my opinion having it like ret['pdf:charsPerPage'] = [45,45, 45, 150, 145, 642] is just a design choice, ret['pdf:charsPerPage'] = [[45,45, 45], [150, 145, 642]] is fine aswell (if not better) but would break more libs since they expect a 1D array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pdf:charsPerPage is broken for nested PDFs

4 participants