Skip to content
Discussion options

You must be logged in to vote

Like I mentioned before: you will always find examples that won't work. But there always is hope. Here is script version resolving this. Changes:

  1. global variable that extracts minimum bbox heights
  2. compare not only the word bottom, but also the top - if any is sufficiently close, we accept the word as a brother of the same line.
import fitz

fitz.TOOLS.set_small_glyph_heights(True)
doc = fitz.open("test.pdf")
page = doc[5]
words = page.get_text("words", sort=True)
line = []  # temp store for 1 line
lines = []  # of al lines
for w in words:
    if not line:
        line.append(w)
        continue
    y1old = line[-1][3]  # get y1 of last word in line
    y0old = line[-1][1]  # get y0 of l…

Replies: 3 comments 9 replies

Comment options

You must be logged in to vote
4 replies
@glangford
Comment options

@JorjMcKie
Comment options

@glangford
Comment options

@JorjMcKie
Comment options

Comment options

You must be logged in to vote
1 reply
@JorjMcKie
Comment options

Comment options

You must be logged in to vote
4 replies
@JorjMcKie
Comment options

Answer selected by glangford
@JorjMcKie
Comment options

@glangford
Comment options

@JorjMcKie
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
resolved fixed / implemented / answered
2 participants