TOC extraction for Half Page Layout PDFs #2794

ytiam · 2023-11-09T11:50:34Z

ytiam
Nov 9, 2023

I am trying to extract the "Table of Content" using the get_toc api of PyMuPDF from the attached PDF. But it is returning me blank output, i.e. not able to extract any TOC from this PDF. Is there anything I am missing while calling the get_toc function? Any extra argument I have to provide? Please help me with this.
An Empirical Correlation to Estimate Solder Joint Reliability Acceleration Factors.pdf

Answered by JorjMcKie

Nov 10, 2023

In PDF, things like headers, footers, section headlines, or tables are terra incognita. While there exist specifications how to express higher order information, they are often not used.

So you mostly will simply have text, and it is left to yours wits how to find those structures. There is no library that can do this for you.
You have to look at things like text position, font properties (bold, italic), font size, text color, etc.
In your case it looks like you could successfully check for bold and all-caps text. Findings such properties is no problem with PyMuPDF.
A simple iteration like this brings you already close:

for page in doc:
    for block in page.get_text("dict",flags=fitz.TEX…

View full answer

JorjMcKie · 2023-11-09T12:02:06Z

JorjMcKie
Nov 9, 2023
Maintainer

I have no problem:

import fitz
from pprint import pprint
doc=fitz.open("An.Empirical.Correlation.to.Estimate.Solder.Joint.Reliability.Acceleration.Factors.pdf")
toc=doc.get_toc()
pprint(toc)
[[1, 'ABSTRACT', 1],
 [1, 'INTRODUCTION', 1],
 [1, 'HISTORICAL REVIEW OF SOLDER JOINT ACCELERATION FACTOR DEVELOPMENT', 1],
 [1, 'EMPIRICAL FIT FOR ACCELERATION FACTOR', 3],
 [1, 'REGRESSION ANALYSIS TO IDENTIFY COEFFICIENTS FOR AF EQUATION', 4],
 [1, 'Graphical Solution', 6],
 [1, 'SUMMARY / CONCLUSIONS', 6],
 [1, 'AcknowledgementS', 6],
 [1, 'REFERENCES', 6]]

5 replies

ytiam Nov 9, 2023
Author

@JorjMcKie Thank you for responding.
Sorry, I think I have given a wrong PDF. Please check on this newly attached pdf here,
CRACK GROWTH RATE MEASUREMENT AND ANALYSIS FOR WLCSP Sn-Ag-Cu SOLDER JOINTS.pdf

JorjMcKie Nov 9, 2023
Maintainer

You should trust the documentation!
If the return of the method is[] like in this case, then there is no TOC.

ytiam Nov 10, 2023
Author

Ok, I see. But also, as you can identify, there are different section headers which I need identify and extract. Is there any other way around to get the Headers extracted from the PDFs? Please help me with any libraries or way out to get this done. Thanks in advance.

JorjMcKie Nov 10, 2023
Maintainer

In PDF, things like headers, footers, section headlines, or tables are terra incognita. While there exist specifications how to express higher order information, they are often not used.

So you mostly will simply have text, and it is left to yours wits how to find those structures. There is no library that can do this for you.
You have to look at things like text position, font properties (bold, italic), font size, text color, etc.
In your case it looks like you could successfully check for bold and all-caps text. Findings such properties is no problem with PyMuPDF.
A simple iteration like this brings you already close:

for page in doc:
    for block in page.get_text("dict",flags=fitz.TEXTFLAGS_TEXT)["blocks"]:
        for line in block["lines"]:
            text = "".join([s["text"] for s in line["spans"]])
            if not text.strip(): continue
            if text[0].isdecimal() or text[0] in("°"," ","-", chr(0xfffd)): continue
            if text == text.upper():
                print(f"page {page.number}: '{text}'")

   
page 0: 'CRACK GROWTH RATE MEASUREMENT AND ANALYSIS FOR WLCSP '
page 0: 'ABSTRACT '
page 0: 'INTRODUCTION '
page 0: 'TEST SAMPLE PREPARATION '
page 0: 'TEST SAMPLE CHARACTERIZATION '
page 1: 'TEST PROCEDURE '
page 1: 'MODELING '
page 2: 'RESULTS  '
page 2: 'MEASUREMENT OF CRACK AREA '
page 3: 'VARIATION IN MEASUREMENTS '
page 4: 'CRACK GROWTH RATE ANALYSIS '
page 4: 'MICROSTRUCTURAL EVALUATION '
page 4: 'C1(10-3 '
page 4: 'C2(10-5 '
page 5: 'CONCLUSIONS '
page 6: 'ACKNOWLEDGEMENTS '
page 6: 'REFERENCES '

Answer selected by ytiam

ytiam Nov 12, 2023
Author

Thank you @JorjMcKie . Parallelly I was trying to convert the PDF pages into images and then Try to extract those headers using LayOutParser model. That is also giving me fare results. But, from a computational stand point, I think your solution will help me in better way than my technique. Thanks a lot for your effort for putting time to help me out. I appreciate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TOC extraction for Half Page Layout PDFs #2794

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

TOC extraction for Half Page Layout PDFs #2794

Uh oh!

ytiam Nov 9, 2023

Replies: 1 comment · 5 replies

Uh oh!

JorjMcKie Nov 9, 2023 Maintainer

Uh oh!

ytiam Nov 9, 2023 Author

Uh oh!

JorjMcKie Nov 9, 2023 Maintainer

Uh oh!

ytiam Nov 10, 2023 Author

Uh oh!

JorjMcKie Nov 10, 2023 Maintainer

Uh oh!

ytiam Nov 12, 2023 Author

ytiam
Nov 9, 2023

Replies: 1 comment 5 replies

JorjMcKie
Nov 9, 2023
Maintainer

ytiam Nov 9, 2023
Author

JorjMcKie Nov 9, 2023
Maintainer

ytiam Nov 10, 2023
Author

JorjMcKie Nov 10, 2023
Maintainer

ytiam Nov 12, 2023
Author