You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After the introduction of a complete file content recognizer, the only relevant value for PyMuPDF's open parameter "filetype" remaining is "txt".
This change will implement this such that specifying filetype="txt" will open files or memory data as plain text Documents.
Other values will be silently ignored and no longer lead to confusing behavior.
Copy file name to clipboardExpand all lines: docs/document.rst
+5-11Lines changed: 5 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -176,16 +176,12 @@ For details on **embedded files** refer to Appendix 3.
176
176
* If ``stream`` is given, then the document is created from memory.
177
177
* If ``stream`` is `None`, then a document is created from the file given by ``filename``.
178
178
179
-
:arg str,pathlib filename: A UTF-8 string or ``pathlib.Path`` object containing a file path. The document type is always determined from the file content. The ``filetype`` parameter can be used to ensure that the detected type is as expected or, respectively, to force treating any file as plain text.
179
+
:arg str,pathlib filename: A UTF-8 string or ``pathlib.Path`` object containing a file path. The document type is always determined from the file content. The ``filetype`` parameter can be used to override this and open the file as a plain text document.
180
180
181
181
:arg bytes,bytearray,BytesIO stream: A memory area containing file data. The document type is **always** detected from the data content. The ``filetype`` parameter is ignored except for undetected data content. In that case only, using ``filetype="txt"`` will treat the data as containing plain text.
182
182
183
-
:arg str filetype:A string specifying the type of document. This may be anything looking like a filename (e.g. "x.pdf"), in which case MuPDF uses the extension to determine the type, or a mime type like ``application/pdf``. Just using strings like "pdf" or ".pdf" will also work. Can be omitted for :ref:`a supported document type<Supported_File_Types>`.
183
+
:arg str filetype:Currently only used to force opening the file as a plain text document. Use the value `"txt"` to achieve this. Before the implementation of MuPDF's file content recognizer, this parameter was essential to help determining the file type. As this is no longer necessary, other values are ignored.
184
184
185
-
If opening a file name / path only, it will be used to ensure that the detected type is as expected. An exception is raised for a mismatch. Using `filetype="txt"` will treat any file as containing plain text.
186
-
187
-
When opening from memory, this parameter is ignored except for undetected data content. Only in that case, using ``filetype="txt"`` will treat the data as containing plain text.
188
-
189
185
:arg rect_like rect: a rectangle specifying the desired page size. This parameter is only meaningful for documents with a variable page layout ("reflowable" documents), like e-books or HTML, and ignored otherwise. If specified, it must be a non-empty, finite rectangle with top-left coordinates (0, 0). Together with parameter *fontsize*, each page will be accordingly laid out and hence also determine the number of pages.
190
186
191
187
:arg float width: may used together with ``height`` as an alternative to ``rect`` to specify layout information.
@@ -207,14 +203,12 @@ For details on **embedded files** refer to Appendix 3.
207
203
Overview of possible forms, note: `open` is a synonym of `Document`::
208
204
209
205
>>> # from a file
210
-
>>> doc = pymupdf.open("some.xps")
211
-
>>> # handle wrong extension
212
-
>>> doc = pymupdf.open("some.file", filetype="xps") # assert expected type
206
+
>>> doc = pymupdf.open("some.file") # file type determined from content
213
207
>>> doc = pymupdf.open("some.file", filetype="txt") # treat as plain text
214
208
>>>
215
209
>>> # from memory
216
-
>>> doc = pymupdf.open(stream=mem_area) # works for any supported type
217
-
>>> doc = pymupdf.open(stream=unknown-type, filetype="txt") # treat as plain text
210
+
>>> doc = pymupdf.open(stream=mem_area) # file type determined from content
211
+
>>> doc = pymupdf.open(stream=mem_area, filetype="txt") # treat as plain text
If you have a document with a wrong file extension for its type, do not worry: it will still be opened correctly, thanks to the integrated file "content recognizer".
41
+
If you have a document with a wrong file extension for its type, do not worry: it will still be opened correctly, thanks to the integrated file "content recognizer" of the base library.
42
42
43
43
This component looks at the actual data in the file using a number of heuristics -- independent of the file extension. This of course is also true for file names **without** an extension.
44
44
45
45
Here is a list of details about how the file content recognizer works:
46
46
47
-
* When opening from a file name, use the ``filetype`` parameter if you need to make sure that the created :ref:`Document` is of the expected type. An exception is raised for any mismatch.
47
+
* When opening from a file name or a memory area, all supported :ref:`Document` types are automatically recognized by their content.
48
48
49
-
* Text files are an exception: they do not contain recognizable internal structures at all. Here, the file extension ".txt" and the ``filetype`` parameter continue to play a role and are used to create a "Tex" document. Correspondingly, text files with other / no extensions, can successfully be opened using `filetype="txt"`.
49
+
* Text files are an exception: they do not contain recognizable internal structures at all. If opening from a file name with a known plain text extension (like "txt" or "text") everything will still work.
50
50
51
-
* Using `filetype="txt"` will treat **any** file as containing plain text when opened from a file name / path -- even when its content is a supported document type.
51
+
* If opening from memory or from a file extension that is not known to be plain text, then ``filetype="txt"`` must be specified.
52
52
53
-
* When opening from a stream, the file content recognizer will ignore the ``filetype`` parameter entirely for known file types -- even in case of a mismatch or when `filetype="txt"` was specified.
54
-
55
-
* Streams with a known file type cannot be opened as plain text.
56
-
* Specifying ``filetype`` currently only has an effect when no match was found. Then using ``filetype="txt"`` will treat the file as containing plain text.
53
+
* Using `filetype="txt"` will treat **any** file as if containing plain text -- even when its content is a supported document type.
0 commit comments