Need help with how to extract the content of pages from a PDF

 I have the following code and want to extract the text from each page. 
My question is, how can I use span.get("text", "") to obtain the same result as page.get_text()? I am unsure how to concatenate span_text to match the output of page.get_text(). What kind of delimiter should I use for concatenation?
```
for page_no, page in enumerate(doc):
            info = json.loads(page.get_text("json"))
            blocks = info.get("blocks",[])
            raw_text = ""
            cur_pos = 0    
            for block_id, block in enumerate(blocks):

                lines = block.get("lines",[])
                block_text = ""

                for line_no, line in enumerate(lines):
                    spans = line.get("spans",[])
                    line_text = ""
                    if len(spans) == 0:
                        continue

                    for span_id, span in enumerate(spans):
                        span_text = span.get("text","")
                        cur_pos += len(span_text)
                        raw_text += span_text
                        block_text += span_text
                        line_text += span_text

                raw_text += "\n"
                cur_pos += 1
        doc.close() 
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need help with how to extract the content of pages from a PDF #3737

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Need help with how to extract the content of pages from a PDF #3737

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions