Skip to content

Need help with how to extract the content of pages from a PDF #3737

@helloburke

Description

@helloburke

I have the following code and want to extract the text from each page.
My question is, how can I use span.get("text", "") to obtain the same result as page.get_text()? I am unsure how to concatenate span_text to match the output of page.get_text(). What kind of delimiter should I use for concatenation?

for page_no, page in enumerate(doc):
            info = json.loads(page.get_text("json"))
            blocks = info.get("blocks",[])
            raw_text = ""
            cur_pos = 0    
            for block_id, block in enumerate(blocks):

                lines = block.get("lines",[])
                block_text = ""

                for line_no, line in enumerate(lines):
                    spans = line.get("spans",[])
                    line_text = ""
                    if len(spans) == 0:
                        continue

                    for span_id, span in enumerate(spans):
                        span_text = span.get("text","")
                        cur_pos += len(span_text)
                        raw_text += span_text
                        block_text += span_text
                        line_text += span_text

                raw_text += "\n"
                cur_pos += 1
        doc.close() 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions