Handling indents #2197

vdavez · 2023-01-27T17:27:32Z

vdavez
Jan 27, 2023

Hello—First of all, this library is amazing. Thank you!

I am working with a PDF where indentation matters to a human but the PDF itself doesn't include any whitespace characters in the document. When I convert to xhtml (my desired output), I lose the whitespace.

Is there any smart way to determine whether a line of text starts with whitespace and then insert whitespace characters so that when I output to xhtml, I can preserve that aspect of the layout?

Here's a screenshot of what I'm working with...

Thank you!!!

JorjMcKie · 2023-01-27T20:50:53Z

JorjMcKie
Jan 27, 2023
Maintainer

No the only way is to note at which x-coordinate the first character starts and set this into relation to page width and the corresponding values of other lines.
To do this your output format "xhtml" is unfortunately chosen, because it contains no position information ...

Why don't you use the "dict" format:

for block in page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT)["blocks"]:
    for line in block["lines"]:
        bbox = line["bbox"]
        text = "".join([span["text"] for span in line["spans"]])
        print(f"line '{text}' starts at {bbox[0]}")

To be a little picky with the wording:
You don't actually "lose" spaces: they never existed!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling indents #2197

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Handling indents #2197

Uh oh!

Uh oh!

vdavez Jan 27, 2023

Replies: 1 comment

Uh oh!

JorjMcKie Jan 27, 2023 Maintainer

vdavez
Jan 27, 2023

JorjMcKie
Jan 27, 2023
Maintainer