Some missing spaces in get_text output #2440
-
Describe the bug (mandatory)Output from To Reproduce (mandatory)import fitz
doc = fitz.open('file.pdf`)
for page in doc:
for block in page.get_text("dict", flags=31)["blocks"]:
print(block) Expected behavior (optional)Text contains all the spaces that the PDF does. Screenshots (optional)N/A Your configuration (mandatory)
Additional context (optional)I have reviewed the bug report from #456 and #364 and tested using |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
This is a "Discussions" item, so I transfer. |
Beta Was this translation helpful? Give feedback.
-
Please provide an example document page and the Python code snippet. |
Beta Was this translation helpful? Give feedback.
Why are you using the flags value 31? Its bit decomposition is '0b11111', which, among other things, suppresses the corrective MuPDF action that inserts spaces where deemed beneficial ...
IAW you are setting
fitz.TEXT_INHIBIT_SPACES
.Here is what I get as a result: