Skip to content

Commit 0101907

Browse files
committed
Filetype "txt" becomes the only supported value
After the introduction of a complete file content recognizer, the only relevant value for PyMuPDF's open parameter "filetype" remaining is "txt". This change will implement this such that specifying filetype="txt" will open files or memory data as plain text Documents. Other values will be silently ignored and no longer lead to confusing behavior.
1 parent 09ea755 commit 0101907

File tree

3 files changed

+34
-29
lines changed

3 files changed

+34
-29
lines changed

docs/document.rst

Lines changed: 5 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -176,16 +176,12 @@ For details on **embedded files** refer to Appendix 3.
176176
* If ``stream`` is given, then the document is created from memory.
177177
* If ``stream`` is `None`, then a document is created from the file given by ``filename``.
178178

179-
:arg str,pathlib filename: A UTF-8 string or ``pathlib.Path`` object containing a file path. The document type is always determined from the file content. The ``filetype`` parameter can be used to ensure that the detected type is as expected or, respectively, to force treating any file as plain text.
179+
:arg str,pathlib filename: A UTF-8 string or ``pathlib.Path`` object containing a file path. The document type is always determined from the file content. The ``filetype`` parameter can be used to override this and open the file as a plain text document.
180180

181181
:arg bytes,bytearray,BytesIO stream: A memory area containing file data. The document type is **always** detected from the data content. The ``filetype`` parameter is ignored except for undetected data content. In that case only, using ``filetype="txt"`` will treat the data as containing plain text.
182182

183-
:arg str filetype: A string specifying the type of document. This may be anything looking like a filename (e.g. "x.pdf"), in which case MuPDF uses the extension to determine the type, or a mime type like ``application/pdf``. Just using strings like "pdf" or ".pdf" will also work. Can be omitted for :ref:`a supported document type<Supported_File_Types>`.
183+
:arg str filetype: Currently only used to force opening the file as a plain text document. Use the value `"txt"` to achieve this. Before the implementation of MuPDF's file content recognizer, this parameter was essential to help determining the file type. As this is no longer necessary, other values are ignored.
184184

185-
If opening a file name / path only, it will be used to ensure that the detected type is as expected. An exception is raised for a mismatch. Using `filetype="txt"` will treat any file as containing plain text.
186-
187-
When opening from memory, this parameter is ignored except for undetected data content. Only in that case, using ``filetype="txt"`` will treat the data as containing plain text.
188-
189185
:arg rect_like rect: a rectangle specifying the desired page size. This parameter is only meaningful for documents with a variable page layout ("reflowable" documents), like e-books or HTML, and ignored otherwise. If specified, it must be a non-empty, finite rectangle with top-left coordinates (0, 0). Together with parameter *fontsize*, each page will be accordingly laid out and hence also determine the number of pages.
190186

191187
:arg float width: may used together with ``height`` as an alternative to ``rect`` to specify layout information.
@@ -207,14 +203,12 @@ For details on **embedded files** refer to Appendix 3.
207203
Overview of possible forms, note: `open` is a synonym of `Document`::
208204

209205
>>> # from a file
210-
>>> doc = pymupdf.open("some.xps")
211-
>>> # handle wrong extension
212-
>>> doc = pymupdf.open("some.file", filetype="xps") # assert expected type
206+
>>> doc = pymupdf.open("some.file") # file type determined from content
213207
>>> doc = pymupdf.open("some.file", filetype="txt") # treat as plain text
214208
>>>
215209
>>> # from memory
216-
>>> doc = pymupdf.open(stream=mem_area) # works for any supported type
217-
>>> doc = pymupdf.open(stream=unknown-type, filetype="txt") # treat as plain text
210+
>>> doc = pymupdf.open(stream=mem_area) # file type determined from content
211+
>>> doc = pymupdf.open(stream=mem_area, filetype="txt") # treat as plain text
218212
>>>
219213
>>> # new empty PDF
220214
>>> doc = pymupdf.open()

docs/how-to-open-a-file.rst

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -38,22 +38,19 @@ To open a file, do the following:
3838
File Recognizer: Opening with :index:`a Wrong File Extension <pair: wrong; file extension>`
3939
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
4040

41-
If you have a document with a wrong file extension for its type, do not worry: it will still be opened correctly, thanks to the integrated file "content recognizer".
41+
If you have a document with a wrong file extension for its type, do not worry: it will still be opened correctly, thanks to the integrated file "content recognizer" of the base library.
4242

4343
This component looks at the actual data in the file using a number of heuristics -- independent of the file extension. This of course is also true for file names **without** an extension.
4444

4545
Here is a list of details about how the file content recognizer works:
4646

47-
* When opening from a file name, use the ``filetype`` parameter if you need to make sure that the created :ref:`Document` is of the expected type. An exception is raised for any mismatch.
47+
* When opening from a file name or a memory area, all supported :ref:`Document` types are automatically recognized by their content.
4848

49-
* Text files are an exception: they do not contain recognizable internal structures at all. Here, the file extension ".txt" and the ``filetype`` parameter continue to play a role and are used to create a "Tex" document. Correspondingly, text files with other / no extensions, can successfully be opened using `filetype="txt"`.
49+
* Text files are an exception: they do not contain recognizable internal structures at all. If opening from a file name with a known plain text extension (like "txt" or "text") everything will still work.
5050

51-
* Using `filetype="txt"` will treat **any** file as containing plain text when opened from a file name / path -- even when its content is a supported document type.
51+
* If opening from memory or from a file extension that is not known to be plain text, then ``filetype="txt"`` must be specified.
5252

53-
* When opening from a stream, the file content recognizer will ignore the ``filetype`` parameter entirely for known file types -- even in case of a mismatch or when `filetype="txt"` was specified.
54-
55-
* Streams with a known file type cannot be opened as plain text.
56-
* Specifying ``filetype`` currently only has an effect when no match was found. Then using ``filetype="txt"`` will treat the file as containing plain text.
53+
* Using `filetype="txt"` will treat **any** file as if containing plain text -- even when its content is a supported document type.
5754

5855

5956
----------

src/__init__.py

Lines changed: 24 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2916,8 +2916,6 @@ def __init__(self, filename=None, stream=None, filetype=None, rect=None, width=0
29162916
else:
29172917
raise TypeError(f"bad stream: {type(stream)=}.")
29182918
stream = self.stream
2919-
if not (filename or filetype):
2920-
filename = 'pdf'
29212919
else:
29222920
self.stream = None
29232921

@@ -2962,21 +2960,37 @@ def __init__(self, filename=None, stream=None, filetype=None, rect=None, width=0
29622960
# setting self.stream above ensures that the bytes will not be
29632961
# garbage collected?
29642962
data = mupdf.fz_open_memory(mupdf.python_buffer_data(c), len(c))
2965-
magic = filename
2966-
if not magic:
2963+
if filename is not None:
2964+
magic = filename
2965+
elif filetype is not None:
29672966
magic = filetype
2968-
# fixme: pymupdf does:
2969-
# handler = fz_recognize_document(gctx, filetype);
2970-
# if (!handler) raise ValueError( MSG_BAD_FILETYPE)
2971-
# but prefer to leave fz_open_document_with_stream() to raise.
2967+
else:
2968+
magic = ""
2969+
if magic.endswith(("txt", "text", "log")):
2970+
magic = "txt"
2971+
else:
2972+
magic = ""
29722973
try:
2973-
doc = mupdf.fz_open_document_with_stream(magic, data)
2974+
if magic == "txt":
2975+
handler = mupdf.ll_fz_recognize_document(magic)
2976+
accel = mupdf.FzStream()
2977+
archive = mupdf.FzArchive(None)
2978+
doc = mupdf.ll_fz_document_handler_open(
2979+
handler,
2980+
data.m_internal,
2981+
accel.m_internal,
2982+
archive.m_internal,
2983+
None, # recognize_state
2984+
)
2985+
doc = mupdf.FzDocument(doc)
2986+
else:
2987+
doc = mupdf.fz_open_document_with_stream(magic, data)
29742988
except Exception as e:
29752989
if g_exceptions_verbose > 1: exception_info()
29762990
raise FileDataError('Failed to open stream') from e
29772991
else:
29782992
if filename:
2979-
if not filetype:
2993+
if filetype != "txt":
29802994
try:
29812995
doc = mupdf.fz_open_document(filename)
29822996
except Exception as e:

0 commit comments

Comments
 (0)