You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A dictionary mapping (integer) font sizes to Markdown header strings like ``{14: '# ', 12: '## '}``. The dictionary is created by the `IdentifyHeaders` constructor. The keys are the font sizes of the text spans in the document. The values are the respective header strings.
141
+
A dictionary mapping (integer) font sizes to Markdown header strings like ``{14: '# ', 12: '## '}``. The dictionary is created by the :class:`IdentifyHeaders` constructor. The keys are the font sizes of the text spans in the document. The values are the respective header strings.
142
142
143
-
.. attribute:: body_limit
143
+
.. attribute:: body_limit
144
144
145
145
An integer value indicating the font size limit for body text. This is computed as ``min(header_id.keys()) - 1``. In the above example, body_limit would be 11.
146
146
147
147
148
-
**How to limit header levels (example)**
149
-
150
-
Limit the generated header levels to 3::
148
+
----
151
149
152
-
import pymupdf, pymupdf4llm
153
150
154
-
filename = "input.pdf"
155
-
doc = pymupdf.open(filename) # use a Document for subsequent processing
156
-
my_headers = pymupdf4llm.IdentifyHeaders(doc, max_levels=3) # generate header info
**How to provide your own header logic (example 2)**
163
+
**How to provide your own header logic (example 1)**
164
+
165
+
Provide your own function which uses pre-determined, fixed font sizes::
166
+
167
+
import pymupdf, pymupdf4llm
168
+
169
+
filename = "input.pdf"
170
+
doc = pymupdf.open(filename) # use a Document for subsequent processing
171
+
172
+
def my_headers(span, page=None):
173
+
"""
174
+
Provide some custom header logic.
175
+
This is a callable which accepts a text span and the page.
176
+
Could be extended to check for other properties of the span, for
177
+
instance the font name, text color and other attributes.
178
+
"""
179
+
# header level is h1 if font size is larger than 14
180
+
# header level is h2 if font size is larger than 10
181
+
# otherwise it is body text
182
+
if span["size"] > 14:
183
+
return "# "
184
+
elif span["size"] > 10:
185
+
return "## "
186
+
else:
187
+
return ""
190
188
191
-
This user function uses the document's Table of Contents -- under the assumption that the bookmark text is also present as a header line on the page (which certainly need not be the case!)::
192
-
193
-
import pymupdf, pymupdf4llm
194
-
195
-
filename = "input.pdf"
196
-
doc = pymupdf.open(filename) # use a Document for subsequent processing
197
-
TOC = doc.get_toc() # use the table of contents for determining headers
198
-
199
-
def my_headers(span, page=None):
200
-
"""
201
-
Provide some custom header logic (experimental!).
202
-
This callable checks whether the span text matches any of the
203
-
TOC titles on this page.
204
-
If so, use TOC hierarchy level as header level.
205
-
"""
206
-
# TOC items on this page:
207
-
toc = [t for t in TOC if t[-1] == page.number + 1]
208
-
209
-
if not toc: # no TOC items on this page
210
-
return ""
211
-
212
-
# look for a match in the TOC items
213
-
for lvl, title, _ in toc:
214
-
if span["text"].startswith(title):
215
-
return "#" * lvl + " "
216
-
if title.startswith(span["text"]):
217
-
return "#" * lvl + " "
218
-
189
+
# this will *NOT* scan the document for font sizes!
**How to provide your own header logic (example 2)**
193
+
194
+
This user function uses the document's Table of Contents -- under the assumption that the bookmark text is also present as a header line on the page (which certainly need not be the case!)::
195
+
196
+
import pymupdf, pymupdf4llm
197
+
198
+
filename = "input.pdf"
199
+
doc = pymupdf.open(filename) # use a Document for subsequent processing
200
+
TOC = doc.get_toc() # use the table of contents for determining headers
201
+
202
+
def my_headers(span, page=None):
203
+
"""
204
+
Provide some custom header logic (experimental!).
205
+
This callable checks whether the span text matches any of the
206
+
TOC titles on this page.
207
+
If so, use TOC hierarchy level as header level.
208
+
"""
209
+
# TOC items on this page:
210
+
toc = [t for t in TOC if t[-1] == page.number + 1]
211
+
212
+
if not toc: # no TOC items on this page
219
213
return ""
214
+
215
+
# look for a match in the TOC items
216
+
for lvl, title, _ in toc:
217
+
if span["text"].startswith(title):
218
+
return "#" * lvl + " "
219
+
if title.startswith(span["text"]):
220
+
return "#" * lvl + " "
220
221
221
-
# this will *NOT* scan the document for font sizes!
Create an object which uses the document's Table of Contents (TOC) to determine header levels. Upon object creation, the table of contents is read via the `Document.get_toc()` method. The TOC data is then used to determine header levels in the `to_markdown()` method.
235
+
236
+
This is an alternative to :class:`IdentifyHeaders`. Instead of running through the full document to identify font sizes, it uses the document's Table Of
237
+
Contents (TOC) to identify headers on pages. Like :class:`IdentifyHeaders`, this also is no guarantee to find headers, but for well-built Table of Contents, there is a good chance for more correctly identifying header lines on document pages than the font-size-based approach.
238
+
239
+
It also has the advantage of being much faster than the font-size-based approach, as it does not execute a full document scan or even access any of the document pages.
240
+
241
+
Examples where this approach works very well are the Adobe's files on PDF documentation.
242
+
243
+
Please note that this feature **does not read document pages** where the table of contents may exist as normal standard text. It only accesses data as provided by the `Document.get_toc()` method. It will not identify any headers for documents where the table of contents is not available as a collection of bookmarks.
0 commit comments