Skip to content
Discussion options

You must be logged in to vote

Roughly true, yes.

But please be aware that in PDF we cannot rely on text being nicely ordered in a way that makes sense semantically. The "blocks" will always follow their physical enumeration sequence in the respective PDF source.
Even when sorted (sort=True parameter), single text piece may remain out of order because they may have been added / replaced by some editing mechanism.
Many / most of these pesky things are resolved in to_markdown() because text flow is re-synthesized from a longer detail level upwards. Which is part of the reason why this extraction type is significantly slower.

Replies: 1 comment 4 replies

Comment options

You must be logged in to vote
4 replies
@dkoterwa
Comment options

@JorjMcKie
Comment options

Answer selected by JorjMcKie
@dkoterwa
Comment options

@JorjMcKie
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #199 on December 02, 2024 12:52.