Skip to content

Commit 1fb4d0d

Browse files
committed
Supporting MuPDF file recognizer
We previously used slightly different logic for opening documents from files versus from memory. This fix strives to always use MuPDF's file type recognition and thus become independent from the value of file extensions as much as possible. This works for almost all document types. Exceptions are "txt", "fb2" and "mobi", where either a valid file extension must be present, or the respective filetype must be provided.
1 parent 104051b commit 1fb4d0d

File tree

8 files changed

+192
-88
lines changed

8 files changed

+192
-88
lines changed

docs/document.rst

Lines changed: 9 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -176,15 +176,11 @@ For details on **embedded files** refer to Appendix 3.
176176
* If ``stream`` is given, then the document is created from memory.
177177
* If ``stream`` is `None`, then a document is created from the file given by ``filename``.
178178

179-
:arg str,pathlib filename: A UTF-8 string or ``pathlib.Path`` object containing a file path. The document type is always determined from the file content. The ``filetype`` parameter can be used to ensure that the detected type is as expected or, respectively, to force treating any file as plain text.
179+
:arg str,pathlib filename: A UTF-8 string or ``pathlib.Path`` object containing a file path. The document type is (almost [#f8]_) always determined from the file content. The ``filetype`` parameter can be used to force treating any file as plain text. For plain text files, there is no unambiguous way to recognize the content. Therefore the file extension or the ``filetype`` parameter must be given.
180180

181-
:arg bytes,bytearray,BytesIO stream: A memory area containing file data. The document type is **always** detected from the data content. The ``filetype`` parameter is ignored except for undetected data content. In that case only, using ``filetype="txt"`` will treat the data as containing plain text.
181+
:arg bytes,bytearray,BytesIO stream: A memory area containing file data. With few exceptions [#f8]_, the document type is detected from the data content.
182182

183-
:arg str filetype: A string specifying the type of document. This may be anything looking like a filename (e.g. "x.pdf"), in which case MuPDF uses the extension to determine the type, or a mime type like ``application/pdf``. Just using strings like "pdf" or ".pdf" will also work. Can be omitted for :ref:`a supported document type<Supported_File_Types>`.
184-
185-
If opening a file name / path only, it will be used to ensure that the detected type is as expected. An exception is raised for a mismatch. Using `filetype="txt"` will treat any file as containing plain text.
186-
187-
When opening from memory, this parameter is ignored except for undetected data content. Only in that case, using ``filetype="txt"`` will treat the data as containing plain text.
183+
:arg str filetype: A string specifying the type of document. Will be ignored in most [#f8]_ cases for :ref:`a supported document type<Supported_File_Types>`. Text-based files usually have no unambiguous way to recognize the content. Therefore the file extension or the ``filetype`` parameter (especially when opening from memory) must usually be given.
188184

189185
:arg rect_like rect: a rectangle specifying the desired page size. This parameter is only meaningful for documents with a variable page layout ("reflowable" documents), like e-books or HTML, and ignored otherwise. If specified, it must be a non-empty, finite rectangle with top-left coordinates (0, 0). Together with parameter *fontsize*, each page will be accordingly laid out and hence also determine the number of pages.
190186

@@ -208,13 +204,12 @@ For details on **embedded files** refer to Appendix 3.
208204

209205
>>> # from a file
210206
>>> doc = pymupdf.open("some.xps")
211-
>>> # handle wrong extension
212-
>>> doc = pymupdf.open("some.file", filetype="xps") # assert expected type
213-
>>> doc = pymupdf.open("some.file", filetype="txt") # treat as plain text
207+
>>> # handle wrong / missing extension when required
208+
>>> doc = pymupdf.open("some.file", filetype="mobi") # treat as MOBI e-book
214209
>>>
215210
>>> # from memory
216-
>>> doc = pymupdf.open(stream=mem_area) # works for any supported type
217-
>>> doc = pymupdf.open(stream=unknown-type, filetype="txt") # treat as plain text
211+
>>> doc = pymupdf.open(stream=mem_area) # works for most supported types
212+
>>> doc = pymupdf.open(stream=ambiguous, filetype="mobi") # treat as MOBI e-book
218213
>>>
219214
>>> # new empty PDF
220215
>>> doc = pymupdf.open()
@@ -2211,4 +2206,6 @@ Other Examples
22112206
22122207
.. [#f7] This only works under certain conditions. For example, if there is normal text covered by some image on top of it, then this is undetectable and the respective text is **not** removed. Similar is true for white text on white background, and so on.
22132208
2209+
.. [#f8] Almost all supported document types -- including all images -- are detected by MuPDF's built-in content recognizer. Exceptions are many text-based formats like plain text, program source code, etc. which have no unambiguous way for content identification. The e-book formats MOBI (extension ``.mobi``) and FictionBook (extension ``.fb2``) are two other exceptions which will probably be covered by the recognition feature soon. In these cases, the respective file extensions **must** be present - or (especially when opening from memory) the ``filetype`` must specify the document type.
2210+
22142211
.. include:: footer.rst

docs/how-to-open-a-file.rst

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -38,22 +38,19 @@ To open a file, do the following:
3838
File Recognizer: Opening with :index:`a Wrong File Extension <pair: wrong; file extension>`
3939
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
4040

41-
If you have a document with a wrong file extension for its type, do not worry: it will still be opened correctly, thanks to the integrated file "content recognizer".
41+
If you have a document with a wrong file extension for its type, do not worry: it will still be opened correctly, thanks to the integrated file "content recognizer" in the base library.
4242

4343
This component looks at the actual data in the file using a number of heuristics -- independent of the file extension. This of course is also true for file names **without** an extension.
4444

4545
Here is a list of details about how the file content recognizer works:
4646

47-
* When opening from a file name, use the ``filetype`` parameter if you need to make sure that the created :ref:`Document` is of the expected type. An exception is raised for any mismatch.
47+
* Whether opening from a file name or from memory, the recognizer in most cases will determine the correct document type. It does not need or even look at the file extension - which is not available anyway when opening from memory.
4848

49-
* Text files are an exception: they do not contain recognizable internal structures at all. Here, the file extension ".txt" and the ``filetype`` parameter continue to play a role and are used to create a "Tex" document. Correspondingly, text files with other / no extensions, can successfully be opened using `filetype="txt"`.
49+
* Text files are an exception: they do not contain recognizable internal structures at all. Here, the file extension ".txt" and the ``filetype`` parameter continue to play a role and are used to create a "Text" document. Correspondingly, text files with other / no extensions, can successfully be opened using `filetype="txt"`.
5050

51-
* Using `filetype="txt"` will treat **any** file as containing plain text when opened from a file name / path -- even when its content is a supported document type.
51+
* Currently, two e-book formats, FictionBook and MOBI, are not automatically recognized. They require the extensions ".fb2" and ".mobi" respectively. Use the ``filetype`` parameter accordingly to open them from memory.
5252

53-
* When opening from a stream, the file content recognizer will ignore the ``filetype`` parameter entirely for known file types -- even in case of a mismatch or when `filetype="txt"` was specified.
54-
55-
* Streams with a known file type cannot be opened as plain text.
56-
* Specifying ``filetype`` currently only has an effect when no match was found. Then using ``filetype="txt"`` will treat the file as containing plain text.
53+
* Using `filetype="txt"` will treat **any** file as containing plain text -- even when its content is a supported document type.
5754

5855

5956
----------

src/__init__.py

Lines changed: 50 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -2922,8 +2922,6 @@ def __init__(self, filename=None, stream=None, filetype=None, rect=None, width=0
29222922
else:
29232923
raise TypeError(f"bad stream: {type(stream)=}.")
29242924
stream = self.stream
2925-
if not (filename or filetype):
2926-
filename = 'pdf'
29272925
else:
29282926
self.stream = None
29292927

@@ -2951,6 +2949,17 @@ def __init__(self, filename=None, stream=None, filetype=None, rect=None, width=0
29512949
w = r.x1 - r.x0
29522950
h = r.y1 - r.y0
29532951

2952+
if from_file:
2953+
_, magic2 = os.path.splitext(filename)
2954+
if magic2.startswith("."):
2955+
magic2 = magic2[1:]
2956+
else:
2957+
magic2 = ""
2958+
if isinstance(filetype, str):
2959+
magic = filetype
2960+
else:
2961+
magic = ""
2962+
29542963
if stream is not None:
29552964
assert isinstance(stream, (bytes, memoryview))
29562965
if len(stream) == 0:
@@ -2962,65 +2971,56 @@ def __init__(self, filename=None, stream=None, filetype=None, rect=None, width=0
29622971
buffer_ = mupdf.fz_new_buffer_from_copied_data(c)
29632972
data = mupdf.fz_open_buffer(buffer_)
29642973
else:
2965-
# Pass raw bytes data to mupdf.fz_open_memory(). This assumes
2966-
# that the bytes string will not be modified; i think the
2967-
# original PyMuPDF code makes the same assumption. Presumably
2968-
# setting self.stream above ensures that the bytes will not be
2969-
# garbage collected?
29702974
data = mupdf.fz_open_memory(mupdf.python_buffer_data(c), len(c))
2971-
magic = filename
2972-
if not magic:
2973-
magic = filetype
2974-
# fixme: pymupdf does:
2975-
# handler = fz_recognize_document(gctx, filetype);
2976-
# if (!handler) raise ValueError( MSG_BAD_FILETYPE)
2977-
# but prefer to leave fz_open_document_with_stream() to raise.
2975+
29782976
try:
2979-
doc = mupdf.fz_open_document_with_stream(magic, data)
2977+
if magic:
2978+
handler = mupdf.ll_fz_recognize_document(magic)
2979+
if not handler:
2980+
raise FileDataError("Failed to open stream as {magic}")
2981+
accel = mupdf.FzStream()
2982+
archive = mupdf.FzArchive(None)
2983+
doc = mupdf.ll_fz_document_handler_open(
2984+
handler,
2985+
data.m_internal,
2986+
accel.m_internal,
2987+
archive.m_internal,
2988+
None, # recognize_state
2989+
)
2990+
doc = mupdf.FzDocument(doc)
2991+
else:
2992+
doc = mupdf.fz_open_document_with_stream(magic, data)
29802993
except Exception as e:
29812994
if g_exceptions_verbose > 1: exception_info()
29822995
raise FileDataError('Failed to open stream') from e
29832996
else:
29842997
if filename:
2985-
if not filetype:
2998+
if magic == "txt":
2999+
handler = mupdf.ll_fz_recognize_document(magic)
3000+
else:
3001+
stream = mupdf.FzStream(filename)
3002+
handler = mupdf.ll_fz_recognize_document_stream_content(stream.m_internal, magic)
3003+
if not handler and magic2:
3004+
handler = mupdf.ll_fz_recognize_document_stream_content(stream.m_internal, magic2)
3005+
if handler:
3006+
#log( f'{handler.open=}')
3007+
#log( f'{dir(handler.open)=}')
29863008
try:
2987-
doc = mupdf.fz_open_document(filename)
3009+
accel = mupdf.FzStream()
3010+
archive = mupdf.FzArchive(None)
3011+
doc = mupdf.ll_fz_document_handler_open(
3012+
handler,
3013+
stream.m_internal,
3014+
accel.m_internal,
3015+
archive.m_internal,
3016+
None, # recognize_state
3017+
)
29883018
except Exception as e:
29893019
if g_exceptions_verbose > 1: exception_info()
2990-
raise FileDataError(f'Failed to open file {filename!r}.') from e
3020+
raise FileDataError(f'Failed to open file {filename!r}') from e
3021+
doc = mupdf.FzDocument(doc)
29913022
else:
2992-
handler = mupdf.ll_fz_recognize_document(filetype)
2993-
if handler:
2994-
if handler.open:
2995-
#log( f'{handler.open=}')
2996-
#log( f'{dir(handler.open)=}')
2997-
try:
2998-
stream = mupdf.FzStream(filename)
2999-
accel = mupdf.FzStream()
3000-
archive = mupdf.FzArchive(None)
3001-
if mupdf_version_tuple >= (1, 24, 8):
3002-
doc = mupdf.ll_fz_document_handler_open(
3003-
handler,
3004-
stream.m_internal,
3005-
accel.m_internal,
3006-
archive.m_internal,
3007-
None, # recognize_state
3008-
)
3009-
else:
3010-
doc = mupdf.ll_fz_document_open_fn_call(
3011-
handler.open,
3012-
stream.m_internal,
3013-
accel.m_internal,
3014-
archive.m_internal,
3015-
)
3016-
except Exception as e:
3017-
if g_exceptions_verbose > 1: exception_info()
3018-
raise FileDataError(f'Failed to open file {filename!r} as type {filetype!r}.') from e
3019-
doc = mupdf.FzDocument( doc)
3020-
else:
3021-
assert 0
3022-
else:
3023-
raise ValueError( MSG_BAD_FILETYPE)
3023+
raise ValueError(MSG_BAD_FILETYPE)
30243024
else:
30253025
pdf = mupdf.PdfDocument()
30263026
doc = mupdf.FzDocument(pdf)

tests/resources/fb2-file.fb2

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<FictionBook xmlns="http://www.gribuser.ru/xml/fictionbook/2.0" xmlns:xlink="http://www.w3.org/1999/xlink">
3+
<description>
4+
<title-info>
5+
<genre>computers</genre>
6+
<author>
7+
<first-name>Chris</first-name>
8+
<last-name>Clark</last-name>
9+
</author>
10+
<book-title>Sample FB2 book</book-title>
11+
<annotation>
12+
<p>Short sample of a FictionBook2 book with simple metadata. Based on test_book.md from https://github.com/clach04/sample_reading_media</p>
13+
</annotation>
14+
<keywords>ebook,sample,markdown,fb2,FictionBook2</keywords>
15+
</title-info>
16+
<document-info>
17+
<author>
18+
<nickname>clach04</nickname>
19+
<home-page>https://github.com/clach04/sample_reading_media</home-page>
20+
</author>
21+
22+
<program-used>vim and scite</program-used>
23+
<src-url>https://github.com/clach04/sample_reading_media</src-url>
24+
<version>1.0</version>
25+
<history>
26+
<p>Initial version, written by hand.</p>
27+
</history>
28+
</document-info>
29+
</description>
30+
<body>
31+
<title>
32+
<p>This is a title</p>
33+
</title>
34+
35+
<section id="test-header-h1">
36+
<title>
37+
<p>Test Header h1</p>
38+
</title>
39+
40+
<p>A test paragraph.</p>
41+
<p>Another test paragraph.</p>
42+
</section>
43+
44+
<section id="another-test-header-h1">
45+
<title>
46+
<p>Another Test Header h1</p>
47+
</title>
48+
49+
<section id="a-test-header-h2">
50+
<title>
51+
<p>A Test Header h2</p>
52+
</title>
53+
54+
<section id="a-test-header-h3">
55+
<title>
56+
<p>A Test Header h3</p>
57+
</title>
58+
59+
<p>Yet more copy</p>
60+
</section>
61+
</section>
62+
</section>
63+
</body>
64+
</FictionBook>

tests/resources/mobi-file.mobi

47.6 KB
Binary file not shown.

tests/resources/svg-file.svg

Lines changed: 18 additions & 0 deletions
Loading

tests/resources/xps-file.xps

446 KB
Binary file not shown.

0 commit comments

Comments
 (0)