How to extract text with respect to the heading? #2192

Laxmi530 · 2023-01-25T14:39:44Z

Laxmi530
Jan 25, 2023

Hai,
Thank you for providing a beautiful library.
Actually, I am trying to extract portion of text with respect to the heading like in the sample pdf file we select the heading 'At glance 1' so as output need the text from 'The team' to ' penalty shootout.'. I tried some technique with bs4 but it is not working also tried to extract the headings so on the basis of position of text we can process. Below is the sample code tried.
so can someone please help me.

Thank you in advance.

doc = fitz.open(file_name)
for page in doc:
    html = page.get_text("html")
    soup = BeautifulSoup(html, 'html.parser')
    heading_name = "At glance 1"
    heading = soup.find("h1", text=heading_name)
    if heading:
        next_element = heading.find_next()
        text = ""
        while next_element and next_element.name != "h1":
            text += next_element.text
            next_element = next_element.find_next()
        print(text)
    else:
        print(f"{heading_name} not found in the page")
doc.close()

Sample PDF file.pdf

Answered by JorjMcKie

Jan 25, 2023

The problem is, that nothing in the PDF identifies text in categories as HTML knows them. Your HTML extraction simply contains everything on that page as text (certainly with various different properties).

<div id="page0" style="width:612.0pt;height:792.0pt">
<p style="top:74.1pt;left:72.1pt;line-height:14.4pt"><b><span style="font-family:Calibri,sans-serif;font-size:14.4pt">At Glance </span></b></p>
<p style="top:99.9pt;left:72.1pt;line-height:11.2pt"><span style="font-family:Calibri,sans-serif;font-size:11.2pt">Football, also called association football or soccer, is a game involving two teams of 11 players who try </span></p>
<p style="top:114.4pt;left:72.1pt;line-height:11.2pt"><span …

View full answer

JorjMcKie · 2023-01-25T15:02:11Z

JorjMcKie
Jan 25, 2023
Maintainer

The problem is, that nothing in the PDF identifies text in categories as HTML knows them. Your HTML extraction simply contains everything on that page as text (certainly with various different properties).

<div id="page0" style="width:612.0pt;height:792.0pt">
<p style="top:74.1pt;left:72.1pt;line-height:14.4pt"><b><span style="font-family:Calibri,sans-serif;font-size:14.4pt">At Glance </span></b></p>
<p style="top:99.9pt;left:72.1pt;line-height:11.2pt"><span style="font-family:Calibri,sans-serif;font-size:11.2pt">Football, also called association football or soccer, is a game involving two teams of 11 players who try </span></p>
<p style="top:114.4pt;left:72.1pt;line-height:11.2pt"><span style="font-family:Calibri,sans-serif;font-size:11.2pt">to maneuver the ball into the other team&apos;s goal without using their hands or arms. The team that scores </span></p>
...

So the HTML element "h1" simply is not there!
You must invent your own filter to identify some text as header or body text. E.g. fontsize, boldness, color, ...
For this task, I would work with the dict output directly, where you get all this info - instead of using other packages, which don't know more, and thus only introduce additional complexity.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to extract text with respect to the heading? #2192

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to extract text with respect to the heading? #2192

Uh oh!

Uh oh!

Laxmi530 Jan 25, 2023

Replies: 1 comment

Uh oh!

JorjMcKie Jan 25, 2023 Maintainer

Laxmi530
Jan 25, 2023

JorjMcKie
Jan 25, 2023
Maintainer