-
Notifications
You must be signed in to change notification settings - Fork 690
Closed
Description
I have the following code and want to extract the text from each page.
My question is, how can I use span.get("text", "") to obtain the same result as page.get_text()? I am unsure how to concatenate span_text to match the output of page.get_text(). What kind of delimiter should I use for concatenation?
for page_no, page in enumerate(doc):
info = json.loads(page.get_text("json"))
blocks = info.get("blocks",[])
raw_text = ""
cur_pos = 0
for block_id, block in enumerate(blocks):
lines = block.get("lines",[])
block_text = ""
for line_no, line in enumerate(lines):
spans = line.get("spans",[])
line_text = ""
if len(spans) == 0:
continue
for span_id, span in enumerate(spans):
span_text = span.get("text","")
cur_pos += len(span_text)
raw_text += span_text
block_text += span_text
line_text += span_text
raw_text += "\n"
cur_pos += 1
doc.close()
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels