Skip to content

Commit 0b29e6a

Browse files
committed
Adjust to File Recognizer
This change reflects the behavioral changes caused by the introduction of a file recognizer in MuPDF. Basically, the file extension and the `filetype` parameter have lost most of their significance because the document type is now always derived from the actual file content.
1 parent 94d3fa3 commit 0b29e6a

File tree

2 files changed

+38
-52
lines changed

2 files changed

+38
-52
lines changed

docs/document.rst

Lines changed: 26 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -170,28 +170,29 @@ For details on **embedded files** refer to Appendix 3.
170170

171171
.. method:: __init__(self, filename=None, stream=None, *, filetype=None, rect=None, width=0, height=0, fontsize=11)
172172

173-
* Changed in v1.14.13: support `io.BytesIO` for memory documents.
174-
* Changed in v1.19.6: Clearer, shorter and more consistent exception messages. File type "pdf" is always assumed if not specified. Empty files and memory areas will always lead to exceptions.
175-
176-
Creates a *Document* object.
173+
Create a ``Document`` object.
177174

178175
* With default parameters, a **new empty PDF** document will be created.
179-
* If *stream* is given, then the document is created from memory and, if not a PDF, either *filename* or *filetype* must indicate its type.
180-
* If *stream* is `None`, then a document is created from the file given by *filename*. Its type is inferred from the extension. This can be overruled by *filetype.*
176+
* If ``stream`` is given, then the document is created from memory.
177+
* If ``stream`` is `None`, then a document is created from the file given by ``filename``.
181178

182-
:arg str,pathlib filename: A UTF-8 string or *pathlib* object containing a file path. The document type is inferred from the filename extension. If not present or not matching :ref:`a supported type<Supported_File_Types>`, a PDF document is assumed. For memory documents, this argument may be used instead of `filetype`, see below.
179+
:arg str,pathlib filename: A UTF-8 string or ``pathlib.Path`` object containing a file path. The document type is always determined from the file content. The ``filetype`` parameter can be used to ensure that the detected type is as expected or, respectively, to force treating any file as plain text.
183180

184-
:arg bytes,bytearray,BytesIO stream: A memory area containing a supported document. If not a PDF, its type **must** be specified by either `filename` or `filetype`.
181+
:arg bytes,bytearray,BytesIO stream: A memory area containing file data. The document type is **always** detected from the data content. The ``filetype`` parameter is ignored except for undetected data content. In that case only, using ``filetype="txt"`` will treat the data as containing plain text.
185182

186-
:arg str filetype: A string specifying the type of document. This may be anything looking like a filename (e.g. "x.pdf"), in which case MuPDF uses the extension to determine the type, or a mime type like *application/pdf*. Just using strings like "pdf" or ".pdf" will also work. May be omitted for PDF documents, otherwise must match :ref:`a supported document type<Supported_File_Types>`.
183+
:arg str filetype: A string specifying the type of document. This may be anything looking like a filename (e.g. "x.pdf"), in which case MuPDF uses the extension to determine the type, or a mime type like ``application/pdf``. Just using strings like "pdf" or ".pdf" will also work. Can be omitted for :ref:`a supported document type<Supported_File_Types>`.
184+
185+
If opening a file name / path only, it will be used to ensure that the detected type is as expected. An exception is raised for a mismatch. Using `filetype="txt"` will treat any file as containing plain text.
186+
187+
When opening from memory, this parameter is ignored except for undetected data content. Only in that case, using ``filetype="txt"`` will treat the data as containing plain text.
187188

188189
:arg rect_like rect: a rectangle specifying the desired page size. This parameter is only meaningful for documents with a variable page layout ("reflowable" documents), like e-books or HTML, and ignored otherwise. If specified, it must be a non-empty, finite rectangle with top-left coordinates (0, 0). Together with parameter *fontsize*, each page will be accordingly laid out and hence also determine the number of pages.
189190

190-
:arg float width: may used together with *height* as an alternative to *rect* to specify layout information.
191+
:arg float width: may used together with ``height`` as an alternative to ``rect`` to specify layout information.
191192

192-
:arg float height: may used together with *width* as an alternative to *rect* to specify layout information.
193+
:arg float height: may used together with ``width`` as an alternative to ``rect`` to specify layout information.
193194

194-
:arg float fontsize: the default :data:`fontsize` for reflowable document types. This parameter is ignored if none of the parameters *rect* or *width* and *height* are specified. Will be used to calculate the page layout.
195+
:arg float fontsize: the default :data:`fontsize` for reflowable document types. This parameter is ignored if none of the parameters ``rect`` or ``width`` and ``height`` are specified. Will be used to calculate the page layout.
195196

196197
:raises TypeError: if the *type* of any parameter does not conform.
197198
:raises FileNotFoundError: if the file / path cannot be found. Re-implemented as subclass of `RuntimeError`.
@@ -203,31 +204,26 @@ For details on **embedded files** refer to Appendix 3.
203204

204205
In case of problems you can see more detail in the internal messages store: `print(pymupdf.TOOLS.mupdf_warnings())` (which will be emptied by this call, but you can also prevent this -- consult :meth:`Tools.mupdf_warnings`).
205206

206-
.. note:: Not all document types are checked for valid formats already at open time. Raster images for example will raise exceptions only later, when trying to access the content. Other types (notably with non-binary content) may also be opened (and sometimes **accessed**) successfully -- sometimes even when having invalid content for the format:
207-
208-
* HTM, HTML, XHTML: **always** opened, `metadata["format"]` is "HTML5", resp. "XHTML".
209-
* XML, FB2: **always** opened, `metadata["format"]` is "FictionBook2".
210-
211207
Overview of possible forms, note: `open` is a synonym of `Document`::
212208

213209
>>> # from a file
214210
>>> doc = pymupdf.open("some.xps")
215211
>>> # handle wrong extension
216-
>>> doc = pymupdf.open("some.file", filetype="xps")
212+
>>> doc = pymupdf.open("some.file", filetype="xps") # assert expected type
213+
>>> doc = pymupdf.open("some.file", filetype="txt") # treat as plain text
217214
>>>
218-
>>> # from memory, filetype is required if not a PDF
219-
>>> doc = pymupdf.open("xps", mem_area)
220-
>>> doc = pymupdf.open(None, mem_area, "xps")
221-
>>> doc = pymupdf.open(stream=mem_area, filetype="xps")
215+
>>> # from memory
216+
>>> doc = pymupdf.open(stream=mem_area) # works for any supported type
217+
>>> doc = pymupdf.open(stream=unknown-type, filetype="txt") # treat as plain text
222218
>>>
223219
>>> # new empty PDF
224220
>>> doc = pymupdf.open()
225221
>>> doc = pymupdf.open(None)
226222
>>> doc = pymupdf.open("")
227223

228-
.. note:: Raster images with a wrong (but supported) file extension **are no problem**. MuPDF will determine the correct image type when file **content** is actually accessed and will process it without complaint. So `pymupdf.open("file.jpg")` will work even for a PNG image.
224+
.. note:: Raster images with a wrong (but supported) file extension **are no problem**. MuPDF will determine the correct image type when file **content** is actually accessed and will process it without complaint.
229225

230-
The Document class can be also be used as a **context manager**. On exit, the document will automatically be closed.
226+
The Document class can be also be used as a **context manager**. Exiting the content manager will close the document automatically.
231227

232228
>>> import pymupdf
233229
>>> with pymupdf.open(...) as doc:
@@ -921,8 +917,8 @@ For details on **embedded files** refer to Appendix 3.
921917

922918
* **xref** (*int*) is the image object number
923919
* **smask** (*int*) is the object number of its soft-mask image
924-
* **width** (*int*) is the image width
925-
* **height** (*int*) is the image height
920+
* *``width``* (*int*) is the image width
921+
* *``height``* (*int*) is the image height
926922
* **bpc** (*int*) denotes the number of bits per component (normally 8)
927923
* **colorspace** (*str*) a string naming the colorspace (like **DeviceRGB**)
928924
* **alt_colorspace** (*str*) is any alternate colorspace depending on the value of **colorspace**
@@ -998,8 +994,8 @@ For details on **embedded files** refer to Appendix 3.
998994
Re-paginate ("reflow") the document based on the given page dimension and fontsize. This only affects some document types like e-books and HTML. Ignored if not supported. Supported documents have *True* in property :attr:`is_reflowable`.
999995

1000996
:arg rect_like rect: desired page size. Must be finite, not empty and start at point (0, 0).
1001-
:arg float width: use it together with *height* as alternative to *rect*.
1002-
:arg float height: use it together with *width* as alternative to *rect*.
997+
:arg float width: use it together with ``height`` as alternative to ``rect``.
998+
:arg float height: use it together with ``width`` as alternative to ``rect``.
1003999
:arg float fontsize: the desired default fontsize.
10041000

10051001
.. method:: select(s)
@@ -1744,8 +1740,8 @@ For details on **embedded files** refer to Appendix 3.
17441740

17451741
* *ext* (*str*) image type (e.g. *'jpeg'*), usable as image file extension
17461742
* *smask* (*int*) :data:`xref` number of a stencil (/SMask) image or zero
1747-
* *width* (*int*) image width
1748-
* *height* (*int*) image height
1743+
* ``width`` (*int*) image width
1744+
* ``height`` (*int*) image height
17491745
* *colorspace* (*int*) the image's *colorspace.n* number.
17501746
* *cs-name* (*str*) the image's *colorspace.name*.
17511747
* *xres* (*int*) resolution in x direction. Please also see :data:`resolution`.

docs/how-to-open-a-file.rst

Lines changed: 12 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -35,28 +35,25 @@ To open a file, do the following:
3535
.. note:: The above creates a :ref:`Document`. The instruction `doc = pymupdf.Document("a.pdf")` does exactly the same. So, `open` is just a convenient alias and you can find its full API documented in that chapter.
3636

3737

38-
Opening with :index:`a Wrong File Extension <pair: wrong; file extension>`
39-
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
38+
File Recognizer: Opening with :index:`a Wrong File Extension <pair: wrong; file extension>`
39+
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
4040

41-
If you have a document with a wrong file extension for its type, you can still correctly open it.
41+
If you have a document with a wrong file extension for its type, do not worry: it will still be opened correctly, thanks to the integrated file "content recognizer".
4242

43-
Assume that *"some.file"* is actually an **XPS**. Open it like so:
44-
45-
.. code-block:: python
46-
47-
doc = pymupdf.open("some.file", filetype="xps")
43+
This component looks at the actual data in the file using a number of heuristics -- independent of the file extension. This of course is also true for file names **without** an extension.
4844

45+
Here is a list of details about how the file content recognizer works:
4946

47+
* When opening from a file name, use the ``filetype`` parameter if you need to make sure that the created :ref:`Document` is of the expected type. An exception is raised for any mismatch.
5048

51-
.. note::
49+
* Text files are an exception: they do not contain recognizable internal structures at all. Here, the file extension ".txt" and the ``filetype`` parameter continue to play a role and are used to create a "Tex" document. Correspondingly, text files with other / no extensions, can successfully be opened using `filetype="txt"`.
5250

53-
|PyMuPDF| itself does not try to determine the file type from the file contents. **You** are responsible for supplying the file type information in some way -- either implicitly, via the file extension, or explicitly as shown with the `filetype` parameter. There are pure :title:`Python` packages like `filetype <https://pypi.org/project/filetype/>`_ that help you doing this. Also consult the :ref:`Document` chapter for a full description.
54-
55-
If |PyMuPDF| encounters a file with an unknown / missing extension, it will try to open it as a |PDF|. So in these cases there is no need for additional precautions. Similarly, for memory documents, you can just specify `doc=pymupdf.open(stream=mem_area)` to open it as a |PDF| document.
56-
57-
If you attempt to open an unsupported file then |PyMuPDF| will throw a file data error.
51+
* Using `filetype="txt"` will treat **any** file as containing plain text when opened from a file name / path -- even when its content is a supported document type.
5852

53+
* When opening from a stream, the file content recognizer will ignore the ``filetype`` parameter entirely for known file types -- even in case of a mismatch or when `filetype="txt"` was specified.
5954

55+
* Streams with a known file type cannot be opened as plain text.
56+
* Specifying ``filetype`` currently only has an effect when no match was found. Then using ``filetype="txt"`` will treat the file as containing plain text.
6057

6158

6259
----------
@@ -164,14 +161,7 @@ Opening a `JSON` file
164161
165162
And so on!
166163

167-
As you can imagine many text based file formats can be *very simply opened* and *interpreted* by |PyMuPDF|. This can make data analysis and extraction for a wide range of previously unavailable files suddenly possible.
168-
169-
170-
171-
172-
173-
174-
164+
As you can imagine many text based file formats can be *very simply opened* and *interpreted* by |PyMuPDF|. This can make data analysis and extraction for a wide range of previously unavailable files possible.
175165

176166

177167
.. include:: footer.rst

0 commit comments

Comments
 (0)