Skip to content

Commit aba0656

Browse files
committed
Update api.rst
1 parent a96cc67 commit aba0656

File tree

1 file changed

+117
-69
lines changed

1 file changed

+117
-69
lines changed

docs/pymupdf4llm/api.rst

Lines changed: 117 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -138,92 +138,140 @@ The |PyMuPDF4LLM| API
138138

139139
.. attribute:: header_id
140140

141-
A dictionary mapping (integer) font sizes to Markdown header strings like ``{14: '# ', 12: '## '}``. The dictionary is created by the `IdentifyHeaders` constructor. The keys are the font sizes of the text spans in the document. The values are the respective header strings.
141+
A dictionary mapping (integer) font sizes to Markdown header strings like ``{14: '# ', 12: '## '}``. The dictionary is created by the :class:`IdentifyHeaders` constructor. The keys are the font sizes of the text spans in the document. The values are the respective header strings.
142142

143-
.. attribute:: body_limit
143+
.. attribute:: body_limit
144144

145145
An integer value indicating the font size limit for body text. This is computed as ``min(header_id.keys()) - 1``. In the above example, body_limit would be 11.
146146

147147

148-
**How to limit header levels (example)**
149-
150-
Limit the generated header levels to 3::
148+
----
151149

152-
import pymupdf, pymupdf4llm
153150

154-
filename = "input.pdf"
155-
doc = pymupdf.open(filename) # use a Document for subsequent processing
156-
my_headers = pymupdf4llm.IdentifyHeaders(doc, max_levels=3) # generate header info
157-
md_text = pymupdf4llm.to_markdown(doc, hdr_info=my_headers)
151+
**How to limit header levels (example)**
158152

153+
Limit the generated header levels to 3::
154+
155+
import pymupdf, pymupdf4llm
156+
157+
filename = "input.pdf"
158+
doc = pymupdf.open(filename) # use a Document for subsequent processing
159+
my_headers = pymupdf4llm.IdentifyHeaders(doc, max_levels=3) # generate header info
160+
md_text = pymupdf4llm.to_markdown(doc, hdr_info=my_headers)
159161

160-
**How to provide your own header logic (example 1)**
161-
162-
Provide your own function which uses pre-determined, fixed font sizes::
163-
164-
import pymupdf, pymupdf4llm
165-
166-
filename = "input.pdf"
167-
doc = pymupdf.open(filename) # use a Document for subsequent processing
168-
169-
def my_headers(span, page=None):
170-
"""
171-
Provide some custom header logic.
172-
This is a callable which accepts a text span and the page.
173-
Could be extended to check for other properties of the span, for
174-
instance the font name, text color and other attributes.
175-
"""
176-
# header level is h1 if font size is larger than 14
177-
# header level is h2 if font size is larger than 10
178-
# otherwise it is body text
179-
if span["size"] > 14:
180-
return "# "
181-
elif span["size"] > 10:
182-
return "## "
183-
else:
184-
return ""
185-
186-
# this will *NOT* scan the document for font sizes!
187-
md_text = pymupdf4llm.to_markdown(doc, hdr_info=my_headers)
188162

189-
**How to provide your own header logic (example 2)**
163+
**How to provide your own header logic (example 1)**
164+
165+
Provide your own function which uses pre-determined, fixed font sizes::
166+
167+
import pymupdf, pymupdf4llm
168+
169+
filename = "input.pdf"
170+
doc = pymupdf.open(filename) # use a Document for subsequent processing
171+
172+
def my_headers(span, page=None):
173+
"""
174+
Provide some custom header logic.
175+
This is a callable which accepts a text span and the page.
176+
Could be extended to check for other properties of the span, for
177+
instance the font name, text color and other attributes.
178+
"""
179+
# header level is h1 if font size is larger than 14
180+
# header level is h2 if font size is larger than 10
181+
# otherwise it is body text
182+
if span["size"] > 14:
183+
return "# "
184+
elif span["size"] > 10:
185+
return "## "
186+
else:
187+
return ""
190188
191-
This user function uses the document's Table of Contents -- under the assumption that the bookmark text is also present as a header line on the page (which certainly need not be the case!)::
192-
193-
import pymupdf, pymupdf4llm
194-
195-
filename = "input.pdf"
196-
doc = pymupdf.open(filename) # use a Document for subsequent processing
197-
TOC = doc.get_toc() # use the table of contents for determining headers
198-
199-
def my_headers(span, page=None):
200-
"""
201-
Provide some custom header logic (experimental!).
202-
This callable checks whether the span text matches any of the
203-
TOC titles on this page.
204-
If so, use TOC hierarchy level as header level.
205-
"""
206-
# TOC items on this page:
207-
toc = [t for t in TOC if t[-1] == page.number + 1]
208-
209-
if not toc: # no TOC items on this page
210-
return ""
211-
212-
# look for a match in the TOC items
213-
for lvl, title, _ in toc:
214-
if span["text"].startswith(title):
215-
return "#" * lvl + " "
216-
if title.startswith(span["text"]):
217-
return "#" * lvl + " "
218-
189+
# this will *NOT* scan the document for font sizes!
190+
md_text = pymupdf4llm.to_markdown(doc, hdr_info=my_headers)
191+
192+
**How to provide your own header logic (example 2)**
193+
194+
This user function uses the document's Table of Contents -- under the assumption that the bookmark text is also present as a header line on the page (which certainly need not be the case!)::
195+
196+
import pymupdf, pymupdf4llm
197+
198+
filename = "input.pdf"
199+
doc = pymupdf.open(filename) # use a Document for subsequent processing
200+
TOC = doc.get_toc() # use the table of contents for determining headers
201+
202+
def my_headers(span, page=None):
203+
"""
204+
Provide some custom header logic (experimental!).
205+
This callable checks whether the span text matches any of the
206+
TOC titles on this page.
207+
If so, use TOC hierarchy level as header level.
208+
"""
209+
# TOC items on this page:
210+
toc = [t for t in TOC if t[-1] == page.number + 1]
211+
212+
if not toc: # no TOC items on this page
219213
return ""
214+
215+
# look for a match in the TOC items
216+
for lvl, title, _ in toc:
217+
if span["text"].startswith(title):
218+
return "#" * lvl + " "
219+
if title.startswith(span["text"]):
220+
return "#" * lvl + " "
220221
221-
# this will *NOT* scan the document for font sizes!
222-
md_text = pymupdf4llm.to_markdown(doc, hdr_info=my_headers)
222+
return ""
223+
224+
# this will *NOT* scan the document for font sizes!
225+
md_text = pymupdf4llm.to_markdown(doc, hdr_info=my_headers)
223226

224227
----
225228

226229

230+
.. class:: TocHeaders
231+
232+
.. method:: __init__(self, doc: pymupdf.Document | str)
233+
234+
Create an object which uses the document's Table of Contents (TOC) to determine header levels. Upon object creation, the table of contents is read via the `Document.get_toc()` method. The TOC data is then used to determine header levels in the `to_markdown()` method.
235+
236+
This is an alternative to :class:`IdentifyHeaders`. Instead of running through the full document to identify font sizes, it uses the document's Table Of
237+
Contents (TOC) to identify headers on pages. Like :class:`IdentifyHeaders`, this also is no guarantee to find headers, but for well-built Table of Contents, there is a good chance for more correctly identifying header lines on document pages than the font-size-based approach.
238+
239+
It also has the advantage of being much faster than the font-size-based approach, as it does not execute a full document scan or even access any of the document pages.
240+
241+
Examples where this approach works very well are the Adobe's files on PDF documentation.
242+
243+
Please note that this feature **does not read document pages** where the table of contents may exist as normal standard text. It only accesses data as provided by the `Document.get_toc()` method. It will not identify any headers for documents where the table of contents is not available as a collection of bookmarks.
244+
245+
.. method:: get_header_id(self, span: dict, page=None) -> str
246+
247+
Return appropriate markdown header prefix. This is either an empty string or a string of "#" characters followed by a space.
248+
249+
Given a text span from a "dict" extraction variant, determine the markdown header prefix string of 0 to n concatenated "#" characters.
250+
251+
:arg dict span: a dictionary containing the text span information. This is the same dictionary as returned by `page.get_text("dict")`.
252+
253+
:arg Page page: the owning page object. This can be used when additional information needs to be extracted.
254+
255+
:returns: a string of "#" characters followed by a space.
256+
257+
258+
259+
**How to use class TocHeaders**
260+
261+
This is a version of previous **example 2** that uses :class:`TocHeaders` for header identification::
262+
263+
import pymupdf, pymupdf4llm
264+
265+
filename = "input.pdf"
266+
267+
doc = pymupdf.open(filename) # use a Document for subsequent processing
268+
my_headers = pymupdf4llm.TocHeaders(doc) # use the table of contents for determining headers
269+
270+
# this will *NOT* scan the document for font sizes!
271+
md_text = pymupdf4llm.to_markdown(doc, hdr_info=my_headers)
272+
273+
-----
274+
227275
.. class:: pdf_markdown_reader.PDFMarkdownReader
228276

229277
.. method:: load_data(file_path: Union[Path, str], extra_info: Optional[Dict] = None, **load_kwargs: Any) -> List[LlamaIndexDocument]

0 commit comments

Comments
 (0)