page.get_text("words") issue #2757
Answered
by
JorjMcKie
hifiveszu
asked this question in
Looking for help
-
Hello, The text I obtained using page.get_text("words") is missing some line breaks and spaces compared to the text obtained using page.get_text(). I feel that these symbols are quite important. Is there a way to preserve them?
|
Beta Was this translation helpful? Give feedback.
Answered by
JorjMcKie
Oct 25, 2023
Replies: 1 comment 3 replies
-
The "words" text extraction variants is not intended to produce the original layout - on the contrary: |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This is a frequent and normal thing to happen. Text extraction extracts the text in the same sequence as stored in the file. Many creators do not store content in reading sequence.
You must establish the reading sequence yourself. There is the
sort
parameter that often helps - please read the documentation.In other cases you must use your own code to do that by extracting text including coordinates, like
get_text("dict")
.But using
get_text("words")
is a good start, if you sort them and concatenate again with a space.