Skip to content

Commit e869c87

Browse files
committed
Address 4439
Address 4439 (type in init of Xml class. Also replaced unnecessary use of `math.sqrt()` by Python's built-in exponential operator. Add more details to pymupdf4llm
1 parent cc114d7 commit e869c87

File tree

4 files changed

+147
-22
lines changed

4 files changed

+147
-22
lines changed

docs/pymupdf4llm/api.rst

Lines changed: 133 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ The |PyMuPDF4LLM| API
1616

1717
Prints the version of the library.
1818

19-
.. method:: to_markdown(doc: pymupdf.Document | str, *, pages: list | range | None = None, hdr_info: Any = None, write_images: bool = False, embed_images: bool = False, dpi: int = 150, filename=None, image_path="", image_format="png", image_size_limit=0.05, force_text=True, margins=0, page_chunks: bool = False, page_width: float = 612, page_height: float = None, table_strategy="lines_strict", graphics_limit: int = None, ignore_code: bool = False, extract_words: bool = False, show_progress: bool = False, use_glyphs=False) -> str | list[dict]
19+
.. method:: to_markdown(doc: pymupdf.Document | str, *, pages: list | range | None = None, hdr_info: Any = None, write_images: bool = False, embed_images: bool = False, ignore_images: bool = False, ignore_graphics: bool = False, dpi: int = 150, filename=None, image_path="", image_format="png", image_size_limit=0.05, force_text=True, margins=0, page_chunks: bool = False, page_width: float = 612, page_height: float = None, table_strategy="lines_strict", graphics_limit: int = None, ignore_code: bool = False, extract_words: bool = False, show_progress: bool = False, use_glyphs=False) -> str | list[dict]
2020

2121
Read the pages of the file and outputs the text of its pages in |Markdown| format. How this should happen in detail can be influenced by a number of parameters. Please note that there exists **support for building page chunks** from the |Markdown| text.
2222

@@ -30,6 +30,10 @@ The |PyMuPDF4LLM| API
3030

3131
:arg bool embed_images: like `write_images`, but images will be included in the markdown text as base64-encoded strings. Ignores `write_images` and `image_path` if used. This may drastically increase the size of your markdown text.
3232

33+
:arg bool ignore_images: (New in v.0.0.20) Disregard images on the page. This may help detecting text correctly when pages are very crowded (often the case for documents representing presentation slides). Also speeds up processing time.
34+
35+
:arg bool ignore_graphics: (New in v.0.0.20) Disregard vector graphics on the page. This may help detecting text correctly when pages are very crowded (often the case for documents representing presentation slides). Also speeds up processing time. Vector graphics are still used for table detection.
36+
3337
:arg float image_size_limit: this must be a positive value less than 1. Images are ignored if `width / page.rect.width <= image_size_limit` or `height / page.rect.height <= image_size_limit`. For instance, the default value 0.05 means that to be considered for inclusion, an image's width and height must be larger than 5% of the page's width and height, respectively.
3438

3539
:arg int dpi: specify the desired image resolution in dots per inch. Relevant only if `write_images=True`. Default value is 150.
@@ -62,13 +66,13 @@ The |PyMuPDF4LLM| API
6266

6367
- **"words"** - if `extract_words=True` was used. This is a list of tuples `(x0, y0, x1, y1, "wordstring", bno, lno, wno)` as delivered by `page.get_text("words")`. The **sequence** of these tuples however is the same as produced in the markdown text string and thus honors multi-column text. This is also true for text in tables: words are extracted in the sequence of table row cells.
6468

65-
:arg str filename: (New in v.0.0.19) Overwrites or sets the desired image file name of written images. Useful when the document is provided as a memory object (which has no inherent name).
69+
:arg str filename: (New in v.0.0.19) Overwrites or sets the desired image file name of written images. Useful when the document is provided as a memory object (which has no inherent file name).
6670

67-
:arg float page_width: specify a desired page width. This is ignored for documents with a fixed page width like PDF, XPS etc. **Reflowable** documents however, like e-books, office or text files have no fixed page dimensions and by default are assumed to have Letter format width (612) and an **"infinite"** page height. This means that the full document is treated as one large page.
71+
:arg float page_width: specify a desired page width. This is ignored for documents with a fixed page width like PDF, XPS etc. **Reflowable** documents however, like e-books, office [#f2]_ or text files have no fixed page dimensions and by default are assumed to have Letter format width (612) and an **"infinite"** page height. This means that the **full document is treated as one large page.**
6872

6973
:arg float page_height: specify a desired page height. For relevance see the `page_width` parameter. If using the default `None`, the document will appear as one large page with a width of `page_width`. Consequently in this case, no markdown page separators will occur (except the final one), respectively only one page chunk will be returned.
7074

71-
:arg str table_strategy: table detection strategy. Default is `"lines_strict"` which ignores background colors. In some occasions, other strategies may be more successful, for example `"lines"` which uses all vector graphics objects for detection.
75+
:arg str table_strategy: `table detection strategy <https://pymupdf.readthedocs.io/en/latest/page.html#Page.find_tables>`_. Default is `"lines_strict"` which ignores background colors. In some occasions, other strategies may be more successful, for example `"lines"` which uses all vector graphics objects for detection. **Changed in v0.0.19:** A value of `None` will not perform any table detection at all. This may be useful when you know that your document contains no tables. Execution time savings can be significant.
7276

7377
:arg int graphics_limit: use this to limit dealing with excess amounts of vector graphics elements. Scientific documents, or pages simulating text via graphics commands may contain tens of thousands of these objects. As vector graphics are analyzed for multiple purposes, runtime may quickly become intolerable. With this parameter, all vector graphics will be ignored if their count exceeds the threshold. **Changed in v0.0.19:** The page will still be processed, and text, tables and images should be extracted.
7478

@@ -97,6 +101,129 @@ The |PyMuPDF4LLM| API
97101
----
98102

99103

104+
.. class:: IdentifyHeaders
105+
106+
.. method:: __init__(self, doc: pymupdf.Document | str, *, pages: list | range | None = None, body_limit: float = 11, max_levels: int = 6)
107+
108+
Create an object which maps text font sizes to the respective number of '#' characters which are used by Markdown syntax to indicate header levels. The object is created by scanning the document for font size "popularity". The most popular font size and all smaller sizes are used for body text. Larger font sizes are mapped to the respective header levels - which correspond to the HTML tags `<h1>` to `<h6>`.
109+
110+
All font sizes are rounded to integer values.
111+
112+
If more than 6 header levels would be required, then the largest number smaller than the `<h6>` font size is used for body text.
113+
114+
Please note that creating the object will read and inspect the text of the entire document - independently of reading the document again in the `to_markdown()` method subequently. Method `to_markdown()` by default **will create this object** if you do not override its `hdr_info=None` parameter.
115+
116+
117+
:arg Document,str doc: the file, to be specified either as a file path string, or as a |PyMuPDF| Document (created via `pymupdf.open`). In order to use `pathlib.Path` specifications, Python file-like objects, documents in memory etc. you **must** use a |PyMuPDF| Document.
118+
119+
:arg list pages: optional, the pages to consider. If omitted all pages are processed.
120+
121+
:arg float body_limit: the default font size limit for body text. Only used when the document scan does not deliver valid information.
122+
123+
:arg int max_levels: the maximum number of header levels to be used. Valid values are in `range(1, 7)`. The default is 6, which corresponds to the HTML tags `<h1>` to `<h6>`. A smaller value will limit the number of generated header levels. For instance, a value of 3 will only generate header tags "#", "##" and "###". Body text will be assumed for all font sizes smaller than the one corresponding to "###".
124+
125+
126+
.. method:: get_header_id(self, span: dict, page=None) -> str
127+
128+
Return appropriate markdown header prefix. This is either "" or a string of "#" characters followed by a space.
129+
130+
Given a text span from a "dict"" extraction, determine the
131+
markdown header prefix string of 0 to n concatenated '#' characters.
132+
133+
:arg dict span: a dictionary containing the text span information. This is the same dictionary as returned by `page.get_text("dict")`.
134+
135+
:arg Page page: the owning page object. This can be used when additional information needs to be extracted.
136+
137+
:returns: a string of "#" characters followed by a space.
138+
139+
.. attibute:: header_id
140+
141+
A dictionary mapping (integer) font sizes to Markdown header strings like ``{14: '# ', 12: '## '}``. The dictionary is created by the `IdentifyHeaders` constructor. The keys are the font sizes of the text spans in the document. The values are the respective header strings.
142+
143+
.. attibute:: body_limit
144+
145+
An integer value indicating the font size limit for body text. This is computed as ``min(header_id.keys()) - 1``. In the above example, body_limit would be 11.
146+
147+
148+
**How to limit header levels (example)**
149+
150+
Limit the generated header levels to 3::
151+
152+
import pymupdf, pymupdf4llm
153+
154+
filename = "input.pdf"
155+
doc = pymupdf.open(filename) # use a Document for subsequent processing
156+
my_headers = pymupdf4llm.IdentifyHeaders(doc, max_levels=3) # generate header info
157+
md_text = pymupdf4llm.to_markdown(doc, hdr_info=my_headers)
158+
159+
160+
**How to provide your own header logic (example 1)**
161+
162+
Provide your own function which uses pre-determined, fixed font sizes::
163+
164+
import pymupdf, pymupdf4llm
165+
166+
filename = "input.pdf"
167+
doc = pymupdf.open(filename) # use a Document for subsequent processing
168+
169+
def my_headers(span, page=None):
170+
"""
171+
Provide some custom header logic.
172+
This is a callable which accepts a text span and the page.
173+
Could be extended to check for other properties of the span, for
174+
instance the font name, text color and other attributes.
175+
"""
176+
# header level is h1 if font size is larger than 14
177+
# header level is h2 if font size is larger than 10
178+
# otherwise it is body text
179+
if span["size"] > 14:
180+
return "# "
181+
elif span["size"] > 10:
182+
return "## "
183+
else:
184+
return ""
185+
186+
# this will *NOT* scan the document for font sizes!
187+
md_text = pymupdf4llm.to_markdown(doc, hdr_info=my_headers)
188+
189+
**How to provide your own header logic (example 2)**
190+
191+
This user function uses the document's Table of Contents -- under the assumption that the bookmark text is also present as a header line on the page (which certainly need not be the case!)::
192+
193+
import pymupdf, pymupdf4llm
194+
195+
filename = "input.pdf"
196+
doc = pymupdf.open(filename) # use a Document for subsequent processing
197+
TOC = doc.get_toc() # get the table of contents for determing headers
198+
199+
def my_headers(span, page=None):
200+
"""
201+
Provide some custom header logic (experimental!).
202+
This callable checks whether the span text matches any of the
203+
TOC titles on this page.
204+
If so, use TOC hierarchy level as header level.
205+
"""
206+
# TOC items on this page:
207+
toc = [t for t in TOC if t[-1] == page.number + 1]
208+
209+
if not toc: # no TOC items on this page
210+
return ""
211+
212+
# look for a match in the TOC items
213+
for lvl, title, _ in toc:
214+
if span["text"].startswith(title):
215+
return "#" * lvl + " "
216+
if title.startswith(span["text"]):
217+
return "#" * lvl + " "
218+
219+
return ""
220+
221+
# this will *NOT* scan the document for font sizes!
222+
md_text = pymupdf4llm.to_markdown(doc, hdr_info=my_headers)
223+
224+
----
225+
226+
100227
.. class:: pdf_markdown_reader.PDFMarkdownReader
101228

102229
.. method:: load_data(file_path: Union[Path, str], extra_info: Optional[Dict] = None, **load_kwargs: Any) -> List[LlamaIndexDocument]
@@ -115,6 +242,8 @@ For a list of changes, please see file `CHANGES.md <https://github.com/pymupdf/R
115242

116243
.. [#f1] `LlamaIndex documentation <https://docs.llamaindex.ai/en/stable/>`_
117244
245+
.. [#f2] When using PyMuPDF-Pro, supported office documents are converted internally into a PDF-like format. Therefore, they **will have fixed page dimensions** and be no longer "reflowable". Consequently, the page width and page height specifications will be ignored as well in these cases.
246+
118247
119248
120249

docs/pymupdf4llm/index.rst

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -142,10 +142,4 @@ Blogs
142142
- `Building a RAG Chatbot GUI with the ChatGPT API and PyMuPDF <https://artifex.com/blog/building-a-rag-chatbot-gui-with-the-chatgpt-api-and-pymupdf>`_
143143
- `RAG/LLM and PDF: Conversion to Markdown Text with PyMuPDF <https://artifex.com/blog/rag-llm-and-pdf-conversion-to-markdown-text-with-pymupdf>`_
144144

145-
146-
147-
148-
149-
150-
151145
.. include:: ../footer.rst

docs/tutorial.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@ This tutorial will show you the use of |PyMuPDF|, :title:`MuPDF` in :title:`Pyth
1313

1414
Because :title:`MuPDF` supports not only PDF, but also XPS, OpenXPS, CBZ, CBR, FB2 and EPUB formats, so does PyMuPDF [#f1]_. Nevertheless, for the sake of brevity we will only talk about PDF files. At places where indeed only PDF files are supported, this will be mentioned explicitly.
1515

16+
In addition to this introduction, please do visit PyMuPDF's `Youtube Channel <https://www.youtube.com/@PyMuPDF>`_ which covers most of the following in the form of Youtube "Shorts" and longer videos.
17+
1618
Importing the Bindings
1719
==========================
1820
The Python bindings to MuPDF are made available by this import statement. We also show here how your version can be checked::

src/__init__.py

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2066,12 +2066,12 @@ def __enter__(self):
20662066
def __exit__(self, *args):
20672067
pass
20682068

2069-
def __init__( self, rhs):
2070-
if isinstance( rhs, mupdf.FzXml):
2069+
def __init__(self, rhs):
2070+
if isinstance(rhs, mupdf.FzXml):
20712071
self.this = rhs
2072-
elif isinstance( str):
2073-
buff = mupdf.fz_new_buffer_from_copied_data( rhs)
2074-
self.this = mupdf.fz_parse_xml_from_html5( buff)
2072+
elif isinstance(rhs, str):
2073+
buff = mupdf.fz_new_buffer_from_copied_data(rhs)
2074+
self.this = mupdf.fz_parse_xml_from_html5(buff)
20752075
else:
20762076
assert 0, f'Unsupported type for rhs: {type(rhs)}'
20772077

@@ -6642,7 +6642,7 @@ def uri(self):
66426642
class Matrix:
66436643

66446644
def __abs__(self):
6645-
return math.sqrt(sum([c*c for c in self]))
6645+
return (sum([c*c for c in self])) ** 0.5
66466646

66476647
def __add__(self, m):
66486648
if hasattr(m, "__float__"):
@@ -10638,7 +10638,7 @@ def __del__(self):
1063810638
class Point:
1063910639

1064010640
def __abs__(self):
10641-
return math.sqrt(self.x * self.x + self.y * self.y)
10641+
return (self.x * self.x + self.y * self.y) ** 0.5
1064210642

1064310643
def __add__(self, p):
1064410644
if hasattr(p, "__float__"):
@@ -10749,7 +10749,7 @@ def abs_unit(self):
1074910749
s = self.x * self.x + self.y * self.y
1075010750
if s < EPSILON:
1075110751
return Point(0,0)
10752-
s = math.sqrt(s)
10752+
s = s ** 0.5
1075310753
return Point(abs(self.x) / s, abs(self.y) / s)
1075410754

1075510755
def distance_to(self, *args):
@@ -10815,7 +10815,7 @@ def unit(self):
1081510815
s = self.x * self.x + self.y * self.y
1081610816
if s < EPSILON:
1081710817
return Point(0,0)
10818-
s = math.sqrt(s)
10818+
s = s ** 0.5
1081910819
return Point(self.x / s, self.y / s)
1082010820

1082110821
__div__ = __truediv__
@@ -11276,7 +11276,7 @@ def morph(self, p, m):
1127611276
return self.quad.morph(p, m)
1127711277

1127811278
def norm(self):
11279-
return math.sqrt(sum([c*c for c in self]))
11279+
return (sum([c*c for c in self])) ** 0.5
1128011280

1128111281
def normalize(self):
1128211282
"""Replace rectangle with its finite version."""
@@ -13222,7 +13222,7 @@ def morph(self, p, m):
1322213222
return self.quad.morph(p, m)
1322313223

1322413224
def norm(self):
13225-
return math.sqrt(sum([c*c for c in self]))
13225+
return (sum([c*c for c in self])) ** 0.5
1322613226

1322713227
def normalize(self):
1322813228
"""Replace rectangle with its valid version."""
@@ -18664,7 +18664,7 @@ def jm_trace_text_span(dev, span, type_, ctm, colorspace, color, alpha, seqno):
1866418664

1866518665
mat = mupdf.fz_concat(span.trm(), ctm) # text transformation matrix
1866618666
dir = mupdf.fz_transform_vector(mupdf.fz_make_point(1, 0), mat) # writing direction
18667-
fsize = math.sqrt(dir.x * dir.x + dir.y * dir.y) # font size
18667+
fsize = (dir.x * dir.x + dir.y * dir.y) ** 0.5 # font size
1866818668

1866918669
dir = mupdf.fz_normalize_vector(dir)
1867018670

0 commit comments

Comments
 (0)