Skip to content

Commit 68e4558

Browse files
docs/ src/: simplified specification of Tesseract data.
docs/functions.rst docs/installation.rst Updated Tesseract information. src/__init__.py: Removed global TESSDATA_PREFIX as not required any more. Pixmap.pdfocr_save() pdfocr_tobytes() Use get_tessdata() to infer tessdata if unspecified. get_tessdata(): Added optional `tessdata` arg; is returned directly if set. Raise exceptions if we cannot find tesseract data (used to return False.) src/utils.py: Removed global TESSDATA_PREFIX as not required any more. get_textpage_ocr() (and Page.get_textpage_ocr()): Use get_tessdata() to infer tessdata if unspecified.
1 parent a21e935 commit 68e4558

File tree

4 files changed

+74
-64
lines changed

4 files changed

+74
-64
lines changed

docs/functions.rst

Lines changed: 13 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,6 @@ Yet others are handy, general-purpose utilities.
6868
:meth:`get_tessdata` locates the language support of the Tesseract-OCR installation
6969
:attr:`fitz_fontdescriptors` dictionary of available supplement fonts
7070
:attr:`PYMUPDF_MESSAGE` destination of |PyMuPDF| messages.
71-
:attr:`TESSDATA_PREFIX` a copy of `os.environ["TESSDATA_PREFIX"]`
7271
:attr:`pdfcolor` dictionary of almost 500 RGB colors in PDF format.
7372
==================================== ==============================================================
7473

@@ -379,18 +378,6 @@ Yet others are handy, general-purpose utilities.
379378
Also see `set_messages()`.
380379

381380

382-
-----
383-
384-
.. attribute:: TESSDATA_PREFIX
385-
386-
* New in v1.19.4
387-
388-
Copy of `os.environ["TESSDATA_PREFIX"]` for convenient checking whether there is integrated Tesseract OCR support.
389-
390-
If this attribute is `None`, Tesseract-OCR is either not installed, or the environment variable is not set to point to Tesseract's language support folder.
391-
392-
.. note:: This variable is now checked before OCR functions are tried. This prevents verbose messages from MuPDF.
393-
394381
-----
395382

396383
.. attribute:: pdfcolor
@@ -850,13 +837,22 @@ Yet others are handy, general-purpose utilities.
850837

851838
-----
852839

853-
.. method:: get_tessdata()
840+
.. method:: get_tessdata(tessdata=None)
841+
842+
Detect Tesseract language support folder.
854843

855-
Return the name of Tesseract's language support folder. Use this function if the environment variable `TESSDATA_PREFIX` has not been set.
844+
This function is used to enable OCR via Tesseract even if the language
845+
support folder is not specified directly or in environment variable
846+
TESSDATA_PREFIX.
856847

857-
:returns: `os.getenv("TESSDATA_PREFIX")` if not `None`. Otherwise, if Tesseract-OCR is installed, locate the name of `tessdata`. If no installation is found, return `False`.
848+
* If <tessdata> is set we return it directly.
849+
850+
* Otherwise we return `os.environ['TESSDATA_PREFIX']` if set.
851+
852+
* Otherwise we search for a Tesseract installation and return its language
853+
support folder.
858854

859-
The folder name can be used as parameter `tessdata` in methods :meth:`Page.get_textpage_ocr`, :meth:`Pixmap.pdfocr_save` and :meth:`Pixmap.pdfocr_tobytes`.
855+
* Otherwise we raise an exception.
860856

861857
-----
862858

docs/installation.rst

Lines changed: 23 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -159,7 +159,12 @@ Notes
159159
* `Pillow <https://pypi.org/project/Pillow/>`_ is required for :meth:`Pixmap.pil_save` and :meth:`Pixmap.pil_tobytes`.
160160
* `fontTools <https://pypi.org/project/fonttools/>`_ is required for :meth:`Document.subset_fonts`.
161161
* `pymupdf-fonts <https://pypi.org/project/pymupdf-fonts/>`_ is a collection of nice fonts to be used for text output methods.
162-
* `Tesseract-OCR <https://github.com/tesseract-ocr/tesseract>`_ for optical character recognition in images and document pages. Tesseract is separate software, not a Python package. To enable OCR functions in PyMuPDF, the software must be installed and the system environment variable `"TESSDATA_PREFIX"` must be defined and contain the `tessdata` folder name of the Tesseract installation location. See below.
162+
*
163+
`Tesseract-OCR <https://github.com/tesseract-ocr/tesseract>`_ for optical
164+
character recognition in images and document pages. Tesseract is separate
165+
software, not a Python package. To enable OCR functions in PyMuPDF,
166+
Tesseract must be installed and the `tessdata` folder name specified; see
167+
below.
163168

164169
.. note:: You can install these additional components at any time -- before or after installing PyMuPDF. PyMuPDF will detect their presence during import or when the respective functions are being used.
165170

@@ -271,18 +276,27 @@ If you do not intend to use this feature, skip this step. Otherwise, it is requi
271276

272277
PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need `Tesseract’s language support data <https://github.com/tesseract-ocr/tessdata>`_.
273278

274-
The language support folder location must be communicated either via storing it in the environment variable `"TESSDATA_PREFIX"`, or as a parameter in the applicable functions.
279+
If not specified explicitly, PyMuPDF will attempt to find the installed
280+
Tesseract's tessdata, but this should probably not be relied upon.
281+
282+
Otherwise PyMuPDF requires that Tesseract's language support folder is
283+
specified explicitly either in PyMuPDF OCR functions' `tessdata` arguments or
284+
`os.environ["TESSDATA_PREFIX"]`.
275285

276286
So for a working OCR functionality, make sure to complete this checklist:
277287

278288
1. Locate Tesseract's language support folder. Typically you will find it here:
279-
- Windows: `C:/Program Files/Tesseract-OCR/tessdata`
280-
- Unix systems: `/usr/share/tesseract-ocr/4.00/tessdata`
281-
282-
2. Set the environment variable `TESSDATA_PREFIX`
283-
- Windows: `setx TESSDATA_PREFIX "C:/Program Files/Tesseract-OCR/tessdata"`
284-
- Unix systems: `declare -x TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata`
285289

286-
.. note:: On Windows systems, this must happen outside Python -- before starting your script. Just manipulating `os.environ` will not work!
290+
* Windows: `C:/Program Files/Tesseract-OCR/tessdata`
291+
* Unix systems: `/usr/share/tesseract-ocr/4.00/tessdata`
292+
293+
2. Specify the language support folder when calling PyMuPDF OCR functions:
294+
295+
* Set the `tessdata` argument.
296+
* Or set `os.environ["TESSDATA_PREFIX"]` from within Python.
297+
* Or set environment variable `TESSDATA_PREFIX` before running Python, for example:
298+
299+
* Windows: `setx TESSDATA_PREFIX "C:/Program Files/Tesseract-OCR/tessdata"`
300+
* Unix systems: `declare -x TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata`
287301

288302
.. include:: footer.rst

src/__init__.py

Lines changed: 37 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -430,7 +430,6 @@ def _format_g(value, *, fmt='%g'):
430430
Page = 'Page_forward_decl'
431431
Point = 'Point_forward_decl'
432432

433-
TESSDATA_PREFIX = os.environ.get("TESSDATA_PREFIX")
434433
matrix_like = 'matrix_like'
435434
point_like = 'point_like'
436435
quad_like = 'quad_like'
@@ -10305,8 +10304,7 @@ def pdfocr_save(self, filename, compress=1, language=None, tessdata=None):
1030510304
'''
1030610305
Save pixmap as an OCR-ed PDF page.
1030710306
'''
10308-
if not TESSDATA_PREFIX and not tessdata:
10309-
raise RuntimeError('No OCR support: TESSDATA_PREFIX not set')
10307+
tessdata = get_tessdata(tessdata)
1031010308
opts = mupdf.FzPdfocrOptions()
1031110309
opts.compress = compress
1031210310
if language:
@@ -10328,15 +10326,15 @@ def pdfocr_tobytes(self, compress=True, language="eng", tessdata=None):
1032810326
compress: (bool) compress, default 1 (True).
1032910327
language: (str) language(s) occurring on page, default "eng" (English),
1033010328
multiples like "eng+ger" for English and German.
10331-
tessdata: (str) folder name of Tesseract's language support. Must be
10332-
given if environment variable TESSDATA_PREFIX is not set.
10329+
tessdata: (str) folder name of Tesseract's language support. If None
10330+
we use environment variable TESSDATA_PREFIX or search for
10331+
Tesseract installation.
1033310332
Notes:
10334-
On failure, make sure Tesseract is installed and you have set the
10335-
environment variable "TESSDATA_PREFIX" to the folder containing your
10336-
Tesseract's language support data.
10333+
On failure, make sure Tesseract is installed and you have set
10334+
<tessdata> or environment variable "TESSDATA_PREFIX" to the folder
10335+
containing your Tesseract's language support data.
1033710336
"""
10338-
if not TESSDATA_PREFIX and not tessdata:
10339-
raise RuntimeError('No OCR support: TESSDATA_PREFIX not set')
10337+
tessdata = get_tessdata(tessdata)
1034010338
from io import BytesIO
1034110339
bio = BytesIO()
1034210340
self.pdfocr_save(bio, compress=compress, language=language, tessdata=tessdata)
@@ -18309,55 +18307,59 @@ def make_utf16be(s):
1830918307
return "(" + r + ")"
1831018308

1831118309

18312-
def get_tessdata():
18313-
"""Detect Tesseract-OCR and return its language support folder.
18310+
def get_tessdata(tessdata=None):
18311+
"""Detect Tesseract language support folder.
1831418312

18315-
This function can be used to enable OCR via Tesseract even if the
18316-
environment variable TESSDATA_PREFIX has not been set.
18317-
If the value of TESSDATA_PREFIX is None, the function tries to locate
18318-
Tesseract-OCR and fills the required variable.
18313+
This function is used to enable OCR via Tesseract even if the language
18314+
support folder is not specified directly or in environment variable
18315+
TESSDATA_PREFIX.
1831918316

18320-
Returns:
18321-
Folder name of tessdata if Tesseract-OCR is available, otherwise False.
18322-
"""
18323-
TESSDATA_PREFIX = os.getenv("TESSDATA_PREFIX")
18324-
if TESSDATA_PREFIX: # use environment variable if set
18325-
return TESSDATA_PREFIX
18317+
* If <tessdata> is set we return it directly.
18318+
18319+
* Otherwise we return `os.environ['TESSDATA_PREFIX']` if set.
18320+
18321+
* Otherwise we search for a Tesseract installation and return its language
18322+
support folder.
1832618323

18324+
* Otherwise we raise an exception.
1832718325
"""
18328-
Try to locate the tesseract-ocr installation.
18329-
"""
18326+
if tessdata:
18327+
return tessdata
18328+
tessdata = os.getenv("TESSDATA_PREFIX")
18329+
if tessdata: # use environment variable if set
18330+
return tessdata
18331+
18332+
# Try to locate the tesseract-ocr installation.
18333+
1833018334
import subprocess
1833118335
# Windows systems:
1833218336
if sys.platform == "win32":
1833318337
cp = subprocess.run("where tesseract", shell=1, capture_output=1, check=0, text=True)
1833418338
response = cp.stdout.strip()
1833518339
if cp.returncode or not response:
18336-
message("Tesseract-OCR is not installed")
18337-
return False
18340+
raise RuntimeError("No tessdata specified and Tesseract is not installed")
1833818341
dirname = os.path.dirname(response) # path of tesseract.exe
1833918342
tessdata = os.path.join(dirname, "tessdata") # language support
1834018343
if os.path.exists(tessdata): # all ok?
1834118344
return tessdata
1834218345
else: # should not happen!
18343-
message("unexpected: Tesseract-OCR has no 'tessdata' folder")
18344-
return False
18346+
raise RuntimeError("No tessdata specified and Tesseract installation has no {tessdata} folder")
1834518347

1834618348
# Unix-like systems:
1834718349
cp = subprocess.run("whereis tesseract-ocr", shell=1, capture_output=1, check=0, text=True)
1834818350
response = cp.stdout.strip().split()
1834918351
if cp.returncode or len(response) != 2: # if not 2 tokens: no tesseract-ocr
18350-
message("tesseract-ocr is not installed")
18351-
return False
18352+
raise RuntimeError("No tessdata specified and Tesseract is not installed")
1835218353

1835318354
# search tessdata in folder structure
1835418355
dirname = response[1] # contains tesseract-ocr installation folder
18355-
tessdatas = glob.glob(f"{dirname}/*/tessdata")
18356+
pattern = f"{dirname}/*/tessdata"
18357+
tessdatas = glob.glob(pattern)
1835618358
tessdatas.sort()
18357-
if len(tessdatas) == 0:
18358-
message("unexpected: tesseract-ocr has no 'tessdata' folder")
18359-
return False
18360-
return tessdatas[-1]
18359+
if tessdatas:
18360+
return tessdatas[-1]
18361+
else:
18362+
raise RuntimeError("No tessdata specified and Tesseract installation has no {pattern} folder.")
1836118363

1836218364

1836318365
def css_for_pymupdf_font(

src/utils.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@
2525

2626
g_exceptions_verbose = pymupdf.g_exceptions_verbose
2727

28-
TESSDATA_PREFIX = os.environ.get("TESSDATA_PREFIX")
2928
point_like = "point_like"
3029
rect_like = "rect_like"
3130
matrix_like = "matrix_like"
@@ -748,8 +747,7 @@ def get_textpage_ocr(
748747
full: (bool) whether to OCR the full page image, or only its images (default)
749748
"""
750749
pymupdf.CheckParent(page)
751-
if not TESSDATA_PREFIX and not tessdata:
752-
raise RuntimeError("No OCR support: TESSDATA_PREFIX not set")
750+
tessdata = pymupdf.get_tessdata(tessdata)
753751

754752
def full_ocr(page, dpi, language, flags):
755753
zoom = dpi / 72

0 commit comments

Comments
 (0)