You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Address 4439 (type in init of Xml class.
Also replaced unnecessary use of `math.sqrt()` by Python's built-in exponential operator.
Add more details to pymupdf4llm
Read the pages of the file and outputs the text of its pages in |Markdown| format. How this should happen in detail can be influenced by a number of parameters. Please note that there exists **support for building page chunks** from the |Markdown| text.
22
22
@@ -30,6 +30,10 @@ The |PyMuPDF4LLM| API
30
30
31
31
:arg bool embed_images: like `write_images`, but images will be included in the markdown text as base64-encoded strings. Ignores `write_images` and `image_path` if used. This may drastically increase the size of your markdown text.
32
32
33
+
:arg bool ignore_images: (New in v.0.0.20) Disregard images on the page. This may help detecting text correctly when pages are very crowded (often the case for documents representing presentation slides). Also speeds up processing time.
34
+
35
+
:arg bool ignore_graphics: (New in v.0.0.20) Disregard vector graphics on the page. This may help detecting text correctly when pages are very crowded (often the case for documents representing presentation slides). Also speeds up processing time. Vector graphics are still used for table detection.
36
+
33
37
:arg float image_size_limit: this must be a positive value less than 1. Images are ignored if `width / page.rect.width <= image_size_limit` or `height / page.rect.height <= image_size_limit`. For instance, the default value 0.05 means that to be considered for inclusion, an image's width and height must be larger than 5% of the page's width and height, respectively.
34
38
35
39
:arg int dpi: specify the desired image resolution in dots per inch. Relevant only if `write_images=True`. Default value is 150.
@@ -62,13 +66,13 @@ The |PyMuPDF4LLM| API
62
66
63
67
- **"words"** - if `extract_words=True` was used. This is a list of tuples `(x0, y0, x1, y1, "wordstring", bno, lno, wno)` as delivered by `page.get_text("words")`. The **sequence** of these tuples however is the same as produced in the markdown text string and thus honors multi-column text. This is also true for text in tables: words are extracted in the sequence of table row cells.
64
68
65
-
:arg str filename: (New in v.0.0.19) Overwrites or sets the desired image file name of written images. Useful when the document is provided as a memory object (which has no inherent name).
69
+
:arg str filename: (New in v.0.0.19) Overwrites or sets the desired image file name of written images. Useful when the document is provided as a memory object (which has no inherent file name).
66
70
67
-
:arg float page_width: specify a desired page width. This is ignored for documents with a fixed page width like PDF, XPS etc. **Reflowable** documents however, like e-books, office or text files have no fixed page dimensions and by default are assumed to have Letter format width (612) and an **"infinite"** page height. This means that the full document is treated as one large page.
71
+
:arg float page_width: specify a desired page width. This is ignored for documents with a fixed page width like PDF, XPS etc. **Reflowable** documents however, like e-books, office [#f2]_ or text files have no fixed page dimensions and by default are assumed to have Letter format width (612) and an **"infinite"** page height. This means that the **full document is treated as one large page.**
68
72
69
73
:arg float page_height: specify a desired page height. For relevance see the `page_width` parameter. If using the default `None`, the document will appear as one large page with a width of `page_width`. Consequently in this case, no markdown page separators will occur (except the final one), respectively only one page chunk will be returned.
70
74
71
-
:arg str table_strategy: table detection strategy. Default is `"lines_strict"` which ignores background colors. In some occasions, other strategies may be more successful, for example `"lines"` which uses all vector graphics objects for detection.
75
+
:arg str table_strategy:`table detection strategy<https://pymupdf.readthedocs.io/en/latest/page.html#Page.find_tables>`_. Default is `"lines_strict"` which ignores background colors. In some occasions, other strategies may be more successful, for example `"lines"` which uses all vector graphics objects for detection. **Changed in v0.0.19:** A value of `None` will not perform any table detection at all. This may be useful when you know that your document contains no tables. Execution time savings can be significant.
72
76
73
77
:arg int graphics_limit: use this to limit dealing with excess amounts of vector graphics elements. Scientific documents, or pages simulating text via graphics commands may contain tens of thousands of these objects. As vector graphics are analyzed for multiple purposes, runtime may quickly become intolerable. With this parameter, all vector graphics will be ignored if their count exceeds the threshold. **Changed in v0.0.19:** The page will still be processed, and text, tables and images should be extracted.
74
78
@@ -97,6 +101,129 @@ The |PyMuPDF4LLM| API
97
101
----
98
102
99
103
104
+
.. class:: IdentifyHeaders
105
+
106
+
.. method:: __init__(self, doc: pymupdf.Document | str, *, pages: list | range | None = None, body_limit: float = 11, max_levels: int = 6)
107
+
108
+
Create an object which maps text font sizes to the respective number of '#' characters which are used by Markdown syntax to indicate header levels. The object is created by scanning the document for font size "popularity". The most popular font size and all smaller sizes are used for body text. Larger font sizes are mapped to the respective header levels - which correspond to the HTML tags `<h1>` to `<h6>`.
109
+
110
+
All font sizes are rounded to integer values.
111
+
112
+
If more than 6 header levels would be required, then the largest number smaller than the `<h6>` font size is used for body text.
113
+
114
+
Please note that creating the object will read and inspect the text of the entire document - independently of reading the document again in the `to_markdown()` method subequently. Method `to_markdown()` by default **will create this object** if you do not override its `hdr_info=None` parameter.
115
+
116
+
117
+
:arg Document,str doc: the file, to be specified either as a file path string, or as a |PyMuPDF| Document (created via `pymupdf.open`). In order to use `pathlib.Path` specifications, Python file-like objects, documents in memory etc. you **must** use a |PyMuPDF| Document.
118
+
119
+
:arg list pages: optional, the pages to consider. If omitted all pages are processed.
120
+
121
+
:arg float body_limit: the default font size limit for body text. Only used when the document scan does not deliver valid information.
122
+
123
+
:arg int max_levels: the maximum number of header levels to be used. Valid values are in `range(1, 7)`. The default is 6, which corresponds to the HTML tags `<h1>` to `<h6>`. A smaller value will limit the number of generated header levels. For instance, a value of 3 will only generate header tags "#", "##" and "###". Body text will be assumed for all font sizes smaller than the one corresponding to "###".
Return appropriate markdown header prefix. This is either "" or a string of "#" characters followed by a space.
129
+
130
+
Given a text span from a "dict"" extraction, determine the
131
+
markdown header prefix string of 0 to n concatenated '#' characters.
132
+
133
+
:arg dict span: a dictionary containing the text span information. This is the same dictionary as returned by `page.get_text("dict")`.
134
+
135
+
:arg Page page: the owning page object. This can be used when additional information needs to be extracted.
136
+
137
+
:returns: a string of "#" characters followed by a space.
138
+
139
+
.. attibute:: header_id
140
+
141
+
A dictionary mapping (integer) font sizes to Markdown header strings like ``{14: '# ', 12: '## '}``. The dictionary is created by the `IdentifyHeaders` constructor. The keys are the font sizes of the text spans in the document. The values are the respective header strings.
142
+
143
+
.. attibute:: body_limit
144
+
145
+
An integer value indicating the font size limit for body text. This is computed as ``min(header_id.keys()) - 1``. In the above example, body_limit would be 11.
146
+
147
+
148
+
**How to limit header levels (example)**
149
+
150
+
Limit the generated header levels to 3::
151
+
152
+
import pymupdf, pymupdf4llm
153
+
154
+
filename = "input.pdf"
155
+
doc = pymupdf.open(filename) # use a Document for subsequent processing
156
+
my_headers = pymupdf4llm.IdentifyHeaders(doc, max_levels=3) # generate header info
**How to provide your own header logic (example 2)**
190
+
191
+
This user function uses the document's Table of Contents -- under the assumption that the bookmark text is also present as a header line on the page (which certainly need not be the case!)::
192
+
193
+
import pymupdf, pymupdf4llm
194
+
195
+
filename = "input.pdf"
196
+
doc = pymupdf.open(filename) # use a Document for subsequent processing
197
+
TOC = doc.get_toc() # get the table of contents for determing headers
198
+
199
+
def my_headers(span, page=None):
200
+
"""
201
+
Provide some custom header logic (experimental!).
202
+
This callable checks whether the span text matches any of the
203
+
TOC titles on this page.
204
+
If so, use TOC hierarchy level as header level.
205
+
"""
206
+
# TOC items on this page:
207
+
toc = [t for t in TOC if t[-1] == page.number + 1]
208
+
209
+
if not toc: # no TOC items on this page
210
+
return ""
211
+
212
+
# look for a match in the TOC items
213
+
for lvl, title, _ in toc:
214
+
if span["text"].startswith(title):
215
+
return "#" * lvl + " "
216
+
if title.startswith(span["text"]):
217
+
return "#" * lvl + " "
218
+
219
+
return ""
220
+
221
+
# this will *NOT* scan the document for font sizes!
.. [#f2] When using PyMuPDF-Pro, supported office documents are converted internally into a PDF-like format. Therefore, they **will have fixed page dimensions** and be no longer "reflowable". Consequently, the page width and page height specifications will be ignored as well in these cases.
Copy file name to clipboardExpand all lines: docs/pymupdf4llm/index.rst
-6Lines changed: 0 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -142,10 +142,4 @@ Blogs
142
142
- `Building a RAG Chatbot GUI with the ChatGPT API and PyMuPDF <https://artifex.com/blog/building-a-rag-chatbot-gui-with-the-chatgpt-api-and-pymupdf>`_
143
143
- `RAG/LLM and PDF: Conversion to Markdown Text with PyMuPDF <https://artifex.com/blog/rag-llm-and-pdf-conversion-to-markdown-text-with-pymupdf>`_
Copy file name to clipboardExpand all lines: docs/tutorial.rst
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,6 +13,8 @@ This tutorial will show you the use of |PyMuPDF|, :title:`MuPDF` in :title:`Pyth
13
13
14
14
Because :title:`MuPDF` supports not only PDF, but also XPS, OpenXPS, CBZ, CBR, FB2 and EPUB formats, so does PyMuPDF [#f1]_. Nevertheless, for the sake of brevity we will only talk about PDF files. At places where indeed only PDF files are supported, this will be mentioned explicitly.
15
15
16
+
In addition to this introduction, please do visit PyMuPDF's `Youtube Channel <https://www.youtube.com/@PyMuPDF>`_ which covers most of the following in the form of Youtube "Shorts" and longer videos.
17
+
16
18
Importing the Bindings
17
19
==========================
18
20
The Python bindings to MuPDF are made available by this import statement. We also show here how your version can be checked::
0 commit comments