TOC extraction for Half Page Layout PDFs #2794
Answered
by
JorjMcKie
ytiam
asked this question in
Looking for help
-
I am trying to extract the "Table of Content" using the get_toc api of PyMuPDF from the attached PDF. But it is returning me blank output, i.e. not able to extract any TOC from this PDF. Is there anything I am missing while calling the get_toc function? Any extra argument I have to provide? Please help me with this. |
Beta Was this translation helpful? Give feedback.
Answered by
JorjMcKie
Nov 10, 2023
Replies: 1 comment 5 replies
-
I have no problem: import fitz
from pprint import pprint
doc=fitz.open("An.Empirical.Correlation.to.Estimate.Solder.Joint.Reliability.Acceleration.Factors.pdf")
toc=doc.get_toc()
pprint(toc)
[[1, 'ABSTRACT', 1],
[1, 'INTRODUCTION', 1],
[1, 'HISTORICAL REVIEW OF SOLDER JOINT ACCELERATION FACTOR DEVELOPMENT', 1],
[1, 'EMPIRICAL FIT FOR ACCELERATION FACTOR', 3],
[1, 'REGRESSION ANALYSIS TO IDENTIFY COEFFICIENTS FOR AF EQUATION', 4],
[1, 'Graphical Solution', 6],
[1, 'SUMMARY / CONCLUSIONS', 6],
[1, 'AcknowledgementS', 6],
[1, 'REFERENCES', 6]] |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In PDF, things like headers, footers, section headlines, or tables are terra incognita. While there exist specifications how to express higher order information, they are often not used.
So you mostly will simply have text, and it is left to yours wits how to find those structures. There is no library that can do this for you.
You have to look at things like text position, font properties (bold, italic), font size, text color, etc.
In your case it looks like you could successfully check for bold and all-caps text. Findings such properties is no problem with PyMuPDF.
A simple iteration like this brings you already close: