Skip to content
Discussion options

You must be logged in to vote

Have a look at the documentation of class TextPage. Every text extraction and text search in PyMuPDF is based on a TextPage.

A Textpage is created by the underlying C-library MuPDF. It works for all document types - not just PDF. Inside MuPDF are also all the heuristic algorithms to separate a page's text into blocks, lines and text spans.
As this is open source, you are free to study the respective C code.

The text extraction variants inside PyMuPDF are each targeted at specific needs in terms of ease of use, desired detail level and performance - but as said are all footed on a TextPage.

So "blocks" iterates over the blocks inside a TextPage and delivers a list with concatenated lines o…

Replies: 2 comments 2 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
2 replies
@shihanmax
Comment options

@JorjMcKie
Comment options

Answer selected by JorjMcKie
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #2212 on February 06, 2023 10:04.