You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This change reflects the behavioral changes caused by the introduction of a file recognizer in MuPDF.
Basically, the file extension and the `filetype` parameter have lost most of their significance because the document type is now always derived from the actual file content.
* Changed in v1.14.13: support `io.BytesIO` for memory documents.
174
-
* Changed in v1.19.6: Clearer, shorter and more consistent exception messages. File type "pdf" is always assumed if not specified. Empty files and memory areas will always lead to exceptions.
175
-
176
-
Creates a *Document* object.
173
+
Create a ``Document`` object.
177
174
178
175
* With default parameters, a **new empty PDF** document will be created.
179
-
* If *stream* is given, then the document is created from memory and, if not a PDF, either *filename* or *filetype* must indicate its type.
180
-
* If *stream* is `None`, then a document is created from the file given by *filename*. Its type is inferred from the extension. This can be overruled by *filetype.*
176
+
* If ``stream`` is given, then the document is created from memory.
177
+
* If ``stream`` is `None`, then a document is created from the file given by ``filename``.
181
178
182
-
:arg str,pathlib filename: A UTF-8 string or *pathlib* object containing a file path. The document type is inferred from the filename extension. If not present or not matching :ref:`a supported type<Supported_File_Types>`, a PDF document is assumed. For memory documents, this argument may be used instead of `filetype`, see below.
179
+
:arg str,pathlib filename: A UTF-8 string or ``pathlib.Path`` object containing a file path. The document type is always determined from the file content. The ``filetype`` parameter can be used to ensure that the detected type is as expected or, respectively, to force treating any file as plain text.
183
180
184
-
:arg bytes,bytearray,BytesIO stream: A memory area containing a supported document. If not a PDF, its type **must** be specified by either `filename` or `filetype`.
181
+
:arg bytes,bytearray,BytesIO stream: A memory area containing file data. The document type is **always** detected from the data content. The ``filetype`` parameter is ignored except for undetected data content. In that case only, using ``filetype="txt"`` will treat the data as containing plain text.
185
182
186
-
:arg str filetype: A string specifying the type of document. This may be anything looking like a filename (e.g. "x.pdf"), in which case MuPDF uses the extension to determine the type, or a mime type like *application/pdf*. Just using strings like "pdf" or ".pdf" will also work. May be omitted for PDF documents, otherwise must match :ref:`a supported document type<Supported_File_Types>`.
183
+
:arg str filetype: A string specifying the type of document. This may be anything looking like a filename (e.g. "x.pdf"), in which case MuPDF uses the extension to determine the type, or a mime type like ``application/pdf``. Just using strings like "pdf" or ".pdf" will also work. Can be omitted for :ref:`a supported document type<Supported_File_Types>`.
184
+
185
+
If opening a file name / path only, it will be used to ensure that the detected type is as expected. An exception is raised for a mismatch. Using `filetype="txt"` will treat any file as containing plain text.
186
+
187
+
When opening from memory, this parameter is ignored except for undetected data content. Only in that case, using ``filetype="txt"`` will treat the data as containing plain text.
187
188
188
189
:arg rect_like rect: a rectangle specifying the desired page size. This parameter is only meaningful for documents with a variable page layout ("reflowable" documents), like e-books or HTML, and ignored otherwise. If specified, it must be a non-empty, finite rectangle with top-left coordinates (0, 0). Together with parameter *fontsize*, each page will be accordingly laid out and hence also determine the number of pages.
189
190
190
-
:arg float width: may used together with *height* as an alternative to *rect* to specify layout information.
191
+
:arg float width: may used together with ``height`` as an alternative to ``rect`` to specify layout information.
191
192
192
-
:arg float height: may used together with *width* as an alternative to *rect* to specify layout information.
193
+
:arg float height: may used together with ``width`` as an alternative to ``rect`` to specify layout information.
193
194
194
-
:arg float fontsize: the default :data:`fontsize` for reflowable document types. This parameter is ignored if none of the parameters *rect* or *width* and *height* are specified. Will be used to calculate the page layout.
195
+
:arg float fontsize: the default :data:`fontsize` for reflowable document types. This parameter is ignored if none of the parameters ``rect`` or ``width`` and ``height`` are specified. Will be used to calculate the page layout.
195
196
196
197
:raises TypeError: if the *type* of any parameter does not conform.
197
198
:raises FileNotFoundError: if the file / path cannot be found. Re-implemented as subclass of `RuntimeError`.
@@ -203,31 +204,26 @@ For details on **embedded files** refer to Appendix 3.
203
204
204
205
In case of problems you can see more detail in the internal messages store: `print(pymupdf.TOOLS.mupdf_warnings())` (which will be emptied by this call, but you can also prevent this -- consult :meth:`Tools.mupdf_warnings`).
205
206
206
-
.. note:: Not all document types are checked for valid formats already at open time. Raster images for example will raise exceptions only later, when trying to access the content. Other types (notably with non-binary content) may also be opened (and sometimes **accessed**) successfully -- sometimes even when having invalid content for the format:
207
-
208
-
* HTM, HTML, XHTML: **always** opened, `metadata["format"]` is "HTML5", resp. "XHTML".
209
-
* XML, FB2: **always** opened, `metadata["format"]` is "FictionBook2".
210
-
211
207
Overview of possible forms, note: `open` is a synonym of `Document`::
>>> doc = pymupdf.open(stream=mem_area) # works for any supported type
217
+
>>> doc = pymupdf.open(stream=unknown-type, filetype="txt") # treat as plain text
222
218
>>>
223
219
>>> # new empty PDF
224
220
>>> doc = pymupdf.open()
225
221
>>> doc = pymupdf.open(None)
226
222
>>> doc = pymupdf.open("")
227
223
228
-
.. note:: Raster images with a wrong (but supported) file extension **are no problem**. MuPDF will determine the correct image type when file **content** is actually accessed and will process it without complaint. So `pymupdf.open("file.jpg")` will work even for a PNG image.
224
+
.. note:: Raster images with a wrong (but supported) file extension **are no problem**. MuPDF will determine the correct image type when file **content** is actually accessed and will process it without complaint.
229
225
230
-
The Document class can be also be used as a **context manager**. On exit, the document will automatically be closed.
226
+
The Document class can be also be used as a **context manager**. Exiting the content manager will close the document automatically.
231
227
232
228
>>> import pymupdf
233
229
>>> with pymupdf.open(...) as doc:
@@ -921,8 +917,8 @@ For details on **embedded files** refer to Appendix 3.
921
917
922
918
* **xref** (*int*) is the image object number
923
919
* **smask** (*int*) is the object number of its soft-mask image
924
-
* **width** (*int*) is the image width
925
-
* **height** (*int*) is the image height
920
+
* *``width``* (*int*) is the image width
921
+
* *``height``* (*int*) is the image height
926
922
* **bpc** (*int*) denotes the number of bits per component (normally 8)
927
923
* **colorspace** (*str*) a string naming the colorspace (like **DeviceRGB**)
928
924
* **alt_colorspace** (*str*) is any alternate colorspace depending on the value of **colorspace**
@@ -998,8 +994,8 @@ For details on **embedded files** refer to Appendix 3.
998
994
Re-paginate ("reflow") the document based on the given page dimension and fontsize. This only affects some document types like e-books and HTML. Ignored if not supported. Supported documents have *True* in property :attr:`is_reflowable`.
999
995
1000
996
:arg rect_like rect: desired page size. Must be finite, not empty and start at point (0, 0).
1001
-
:arg float width: use it together with *height* as alternative to *rect*.
1002
-
:arg float height: use it together with *width* as alternative to *rect*.
997
+
:arg float width: use it together with ``height`` as alternative to ``rect``.
998
+
:arg float height: use it together with ``width`` as alternative to ``rect``.
1003
999
:arg float fontsize: the desired default fontsize.
1004
1000
1005
1001
.. method:: select(s)
@@ -1744,8 +1740,8 @@ For details on **embedded files** refer to Appendix 3.
1744
1740
1745
1741
* *ext* (*str*) image type (e.g. *'jpeg'*), usable as image file extension
1746
1742
* *smask* (*int*) :data:`xref` number of a stencil (/SMask) image or zero
1747
-
* *width* (*int*) image width
1748
-
* *height* (*int*) image height
1743
+
* ``width`` (*int*) image width
1744
+
* ``height`` (*int*) image height
1749
1745
* *colorspace* (*int*) the image's *colorspace.n* number.
1750
1746
* *cs-name* (*str*) the image's *colorspace.name*.
1751
1747
* *xres* (*int*) resolution in x direction. Please also see :data:`resolution`.
Copy file name to clipboardExpand all lines: docs/how-to-open-a-file.rst
+12-22Lines changed: 12 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,28 +35,25 @@ To open a file, do the following:
35
35
.. note:: The above creates a :ref:`Document`. The instruction `doc = pymupdf.Document("a.pdf")` does exactly the same. So, `open` is just a convenient alias and you can find its full API documented in that chapter.
36
36
37
37
38
-
Opening with :index:`a Wrong File Extension <pair: wrong; file extension>`
If you have a document with a wrong file extension for its type, you can still correctly open it.
41
+
If you have a document with a wrong file extension for its type, do not worry: it will still be opened correctly, thanks to the integrated file "content recognizer".
42
42
43
-
Assume that *"some.file"* is actually an **XPS**. Open it like so:
44
-
45
-
.. code-block:: python
46
-
47
-
doc = pymupdf.open("some.file", filetype="xps")
43
+
This component looks at the actual data in the file using a number of heuristics -- independent of the file extension. This of course is also true for file names **without** an extension.
48
44
45
+
Here is a list of details about how the file content recognizer works:
49
46
47
+
* When opening from a file name, use the ``filetype`` parameter if you need to make sure that the created :ref:`Document` is of the expected type. An exception is raised for any mismatch.
50
48
51
-
.. note::
49
+
* Text files are an exception: they do not contain recognizable internal structures at all. Here, the file extension ".txt" and the ``filetype`` parameter continue to play a role and are used to create a "Tex" document. Correspondingly, text files with other / no extensions, can successfully be opened using `filetype="txt"`.
52
50
53
-
|PyMuPDF| itself does not try to determine the file type from the file contents. **You** are responsible for supplying the file type information in some way -- either implicitly, via the file extension, or explicitly as shown with the `filetype` parameter. There are pure :title:`Python` packages like `filetype <https://pypi.org/project/filetype/>`_ that help you doing this. Also consult the :ref:`Document` chapter for a full description.
54
-
55
-
If |PyMuPDF| encounters a file with an unknown / missing extension, it will try to open it as a |PDF|. So in these cases there is no need for additional precautions. Similarly, for memory documents, you can just specify `doc=pymupdf.open(stream=mem_area)` to open it as a |PDF| document.
56
-
57
-
If you attempt to open an unsupported file then |PyMuPDF| will throw a file data error.
51
+
* Using `filetype="txt"` will treat **any** file as containing plain text when opened from a file name / path -- even when its content is a supported document type.
58
52
53
+
* When opening from a stream, the file content recognizer will ignore the ``filetype`` parameter entirely for known file types -- even in case of a mismatch or when `filetype="txt"` was specified.
59
54
55
+
* Streams with a known file type cannot be opened as plain text.
56
+
* Specifying ``filetype`` currently only has an effect when no match was found. Then using ``filetype="txt"`` will treat the file as containing plain text.
60
57
61
58
62
59
----------
@@ -164,14 +161,7 @@ Opening a `JSON` file
164
161
165
162
And so on!
166
163
167
-
As you can imagine many text based file formats can be *very simply opened* and *interpreted* by |PyMuPDF|. This can make data analysis and extraction for a wide range of previously unavailable files suddenly possible.
168
-
169
-
170
-
171
-
172
-
173
-
174
-
164
+
As you can imagine many text based file formats can be *very simply opened* and *interpreted* by |PyMuPDF|. This can make data analysis and extraction for a wide range of previously unavailable files possible.
0 commit comments