docs/ src/: simplified specification of Tesseract data.

julian-smith-artifex-com · julian-smith-artifex-com · commit 68e4558fe142 · 2024-11-27T18:25:15.000Z
docs/functions.rst
docs/installation.rst
    Updated Tesseract information.

src/__init__.py:
    Removed global TESSDATA_PREFIX as not required any more.
    Pixmap.pdfocr_save()
    pdfocr_tobytes()
        Use get_tessdata() to infer tessdata if unspecified.
    get_tessdata():
        Added optional `tessdata` arg; is returned directly if set.
        Raise exceptions if we cannot find tesseract data (used to return
        False.)

src/utils.py:
    Removed global TESSDATA_PREFIX as not required any more.
    get_textpage_ocr() (and Page.get_textpage_ocr()):
        Use get_tessdata() to infer tessdata if unspecified.
diff --git a/docs/functions.rst b/docs/functions.rst
@@ -68,7 +68,6 @@ Yet others are handy, general-purpose utilities.
 :meth:`get_tessdata`                 locates the language support of the Tesseract-OCR installation
 :attr:`fitz_fontdescriptors`         dictionary of available supplement fonts
 :attr:`PYMUPDF_MESSAGE`              destination of |PyMuPDF| messages.
-:attr:`TESSDATA_PREFIX`              a copy of `os.environ["TESSDATA_PREFIX"]`
 :attr:`pdfcolor`                     dictionary of almost 500 RGB colors in PDF format.
 ==================================== ==============================================================
 
@@ -379,18 +378,6 @@ Yet others are handy, general-purpose utilities.
       Also see `set_messages()`.
 
 
------
-
-   .. attribute:: TESSDATA_PREFIX
-
-      * New in v1.19.4
-
-      Copy of `os.environ["TESSDATA_PREFIX"]` for convenient checking whether there is integrated Tesseract OCR support.
-
-      If this attribute is `None`, Tesseract-OCR is either not installed, or the environment variable is not set to point to Tesseract's language support folder.
-
-      .. note:: This variable is now checked before OCR functions are tried. This prevents verbose messages from MuPDF.
-
 -----
 
    .. attribute:: pdfcolor
@@ -850,13 +837,22 @@ Yet others are handy, general-purpose utilities.
 
 -----
 
-   .. method:: get_tessdata()
+   .. method:: get_tessdata(tessdata=None)
+    
+    Detect Tesseract language support folder.
 
-      Return the name of Tesseract's language support folder. Use this function if the environment variable `TESSDATA_PREFIX` has not been set.
+    This function is used to enable OCR via Tesseract even if the language
+    support folder is not specified directly or in environment variable
+    TESSDATA_PREFIX.
 
-      :returns: `os.getenv("TESSDATA_PREFIX")` if not `None`. Otherwise, if Tesseract-OCR is installed, locate the name of `tessdata`. If no installation is found, return `False`.
+    * If <tessdata> is set we return it directly.
+    
+    * Otherwise we return `os.environ['TESSDATA_PREFIX']` if set.
+    
+    * Otherwise we search for a Tesseract installation and return its language
+      support folder.
 
-         The folder name can be used as parameter `tessdata` in methods :meth:`Page.get_textpage_ocr`, :meth:`Pixmap.pdfocr_save` and :meth:`Pixmap.pdfocr_tobytes`.
+    * Otherwise we raise an exception.
 
 -----
 
diff --git a/docs/installation.rst b/docs/installation.rst
@@ -159,7 +159,12 @@ Notes
   * `Pillow <https://pypi.org/project/Pillow/>`_ is required for :meth:`Pixmap.pil_save` and :meth:`Pixmap.pil_tobytes`.
   * `fontTools <https://pypi.org/project/fonttools/>`_ is required for :meth:`Document.subset_fonts`.
   * `pymupdf-fonts <https://pypi.org/project/pymupdf-fonts/>`_ is a collection of nice fonts to be used for text output methods.
-  * `Tesseract-OCR <https://github.com/tesseract-ocr/tesseract>`_ for optical character recognition in images and document pages. Tesseract is separate software, not a Python package. To enable OCR functions in PyMuPDF, the software must be installed and the system environment variable `"TESSDATA_PREFIX"` must be defined and contain the `tessdata` folder name of the Tesseract installation location. See below.
+  * 
+    `Tesseract-OCR <https://github.com/tesseract-ocr/tesseract>`_ for optical
+    character recognition in images and document pages. Tesseract is separate
+    software, not a Python package. To enable OCR functions in PyMuPDF,
+    Tesseract must be installed and the `tessdata` folder name specified; see
+    below.
 
   .. note:: You can install these additional components at any time -- before or after installing PyMuPDF. PyMuPDF will detect their presence during import or when the respective functions are being used.
 
@@ -271,18 +276,27 @@ If you do not intend to use this feature, skip this step. Otherwise, it is requi
 
 PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need `Tesseract’s language support data <https://github.com/tesseract-ocr/tessdata>`_.
 
-The language support folder location must be communicated either via storing it in the environment variable `"TESSDATA_PREFIX"`, or as a parameter in the applicable functions.
+If not specified explicitly, PyMuPDF will attempt to find the installed
+Tesseract's tessdata, but this should probably not be relied upon.
+
+Otherwise PyMuPDF requires that Tesseract's language support folder is
+specified explicitly either in PyMuPDF OCR functions' `tessdata` arguments or
+`os.environ["TESSDATA_PREFIX"]`.
 
 So for a working OCR functionality, make sure to complete this checklist:
 
 1. Locate Tesseract's language support folder. Typically you will find it here:
-    - Windows: `C:/Program Files/Tesseract-OCR/tessdata`
-    - Unix systems: `/usr/share/tesseract-ocr/4.00/tessdata`
-
-2. Set the environment variable `TESSDATA_PREFIX`
-    - Windows: `setx TESSDATA_PREFIX "C:/Program Files/Tesseract-OCR/tessdata"`
-    - Unix systems: `declare -x TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata`
 
-.. note:: On Windows systems, this must happen outside Python -- before starting your script. Just manipulating `os.environ` will not work!
+   * Windows: `C:/Program Files/Tesseract-OCR/tessdata`
+   * Unix systems: `/usr/share/tesseract-ocr/4.00/tessdata`
+
+2. Specify the language support folder when calling PyMuPDF OCR functions:
+   
+   * Set the `tessdata` argument.
+   * Or set `os.environ["TESSDATA_PREFIX"]` from within Python.
+   * Or set environment variable `TESSDATA_PREFIX` before running Python, for example:
+   
+     * Windows: `setx TESSDATA_PREFIX "C:/Program Files/Tesseract-OCR/tessdata"`
+     * Unix systems: `declare -x TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata`
 
 .. include:: footer.rst
diff --git a/src/__init__.py b/src/__init__.py
@@ -430,7 +430,6 @@ def _format_g(value, *, fmt='%g'):
 Page = 'Page_forward_decl'
 Point = 'Point_forward_decl'
 
-TESSDATA_PREFIX = os.environ.get("TESSDATA_PREFIX")
 matrix_like = 'matrix_like'
 point_like = 'point_like'
 quad_like = 'quad_like'
@@ -10305,8 +10304,7 @@ def pdfocr_save(self, filename, compress=1, language=None, tessdata=None):
         '''
         Save pixmap as an OCR-ed PDF page.
         '''
-        if not TESSDATA_PREFIX and not tessdata:
-            raise RuntimeError('No OCR support: TESSDATA_PREFIX not set')
+        tessdata = get_tessdata(tessdata)
         opts = mupdf.FzPdfocrOptions()
         opts.compress = compress
         if language:
@@ -10328,15 +10326,15 @@ def pdfocr_tobytes(self, compress=True, language="eng", tessdata=None):
             compress: (bool) compress, default 1 (True).
             language: (str) language(s) occurring on page, default "eng" (English),
                     multiples like "eng+ger" for English and German.
-            tessdata: (str) folder name of Tesseract's language support. Must be
-                    given if environment variable TESSDATA_PREFIX is not set.
+            tessdata: (str) folder name of Tesseract's language support. If None
+                    we use environment variable TESSDATA_PREFIX or search for
+                    Tesseract installation.
         Notes:
-            On failure, make sure Tesseract is installed and you have set the
-            environment variable "TESSDATA_PREFIX" to the folder containing your
-            Tesseract's language support data.
+            On failure, make sure Tesseract is installed and you have set
+            <tessdata> or environment variable "TESSDATA_PREFIX" to the folder
+            containing your Tesseract's language support data.
         """
-        if not TESSDATA_PREFIX and not tessdata:
-            raise RuntimeError('No OCR support: TESSDATA_PREFIX not set')
+        tessdata = get_tessdata(tessdata)
         from io import BytesIO
         bio = BytesIO()
         self.pdfocr_save(bio, compress=compress, language=language, tessdata=tessdata)
@@ -18309,55 +18307,59 @@ def make_utf16be(s):
     return "(" + r + ")"
 
 
-def get_tessdata():
-    """Detect Tesseract-OCR and return its language support folder.
+def get_tessdata(tessdata=None):
+    """Detect Tesseract language support folder.
 
-    This function can be used to enable OCR via Tesseract even if the
-    environment variable TESSDATA_PREFIX has not been set.
-    If the value of TESSDATA_PREFIX is None, the function tries to locate
-    Tesseract-OCR and fills the required variable.
+    This function is used to enable OCR via Tesseract even if the language
+    support folder is not specified directly or in environment variable
+    TESSDATA_PREFIX.
 
-    Returns:
-        Folder name of tessdata if Tesseract-OCR is available, otherwise False.
-    """
-    TESSDATA_PREFIX = os.getenv("TESSDATA_PREFIX")
-    if TESSDATA_PREFIX:  # use environment variable if set
-        return TESSDATA_PREFIX
+    * If <tessdata> is set we return it directly.
+    
+    * Otherwise we return `os.environ['TESSDATA_PREFIX']` if set.
+    
+    * Otherwise we search for a Tesseract installation and return its language
+      support folder.
 
+    * Otherwise we raise an exception.
     """
-    Try to locate the tesseract-ocr installation.
-    """
+    if tessdata:
+        return tessdata
+    tessdata = os.getenv("TESSDATA_PREFIX")
+    if tessdata:  # use environment variable if set
+        return tessdata
+
+    # Try to locate the tesseract-ocr installation.
+    
     import subprocess
     # Windows systems:
     if sys.platform == "win32":
         cp = subprocess.run("where tesseract", shell=1, capture_output=1, check=0, text=True)
         response = cp.stdout.strip()
         if cp.returncode or not response:
-            message("Tesseract-OCR is not installed")
-            return False
+            raise RuntimeError("No tessdata specified and Tesseract is not installed")
         dirname = os.path.dirname(response)  # path of tesseract.exe
         tessdata = os.path.join(dirname, "tessdata")  # language support
         if os.path.exists(tessdata):  # all ok?
             return tessdata
         else:  # should not happen!
-            message("unexpected: Tesseract-OCR has no 'tessdata' folder")
-            return False
+            raise RuntimeError("No tessdata specified and Tesseract installation has no {tessdata} folder")
 
     # Unix-like systems:
     cp = subprocess.run("whereis tesseract-ocr", shell=1, capture_output=1, check=0, text=True)
     response = cp.stdout.strip().split()
     if cp.returncode or len(response) != 2:  # if not 2 tokens: no tesseract-ocr
-        message("tesseract-ocr is not installed")
-        return False
+        raise RuntimeError("No tessdata specified and Tesseract is not installed")
 
     # search tessdata in folder structure
     dirname = response[1]  # contains tesseract-ocr installation folder
-    tessdatas = glob.glob(f"{dirname}/*/tessdata")
+    pattern = f"{dirname}/*/tessdata"
+    tessdatas = glob.glob(pattern)
     tessdatas.sort()
-    if len(tessdatas) == 0:
-        message("unexpected: tesseract-ocr has no 'tessdata' folder")
-        return False
-    return tessdatas[-1]
+    if tessdatas:
+        return tessdatas[-1]
+    else:
+        raise RuntimeError("No tessdata specified and Tesseract installation has no {pattern} folder.")
 
 
 def css_for_pymupdf_font(
diff --git a/src/utils.py b/src/utils.py
@@ -25,7 +25,6 @@
 
 g_exceptions_verbose = pymupdf.g_exceptions_verbose
 
-TESSDATA_PREFIX = os.environ.get("TESSDATA_PREFIX")
 point_like = "point_like"
 rect_like = "rect_like"
 matrix_like = "matrix_like"
@@ -748,8 +747,7 @@ def get_textpage_ocr(
         full: (bool) whether to OCR the full page image, or only its images (default)
     """
     pymupdf.CheckParent(page)
-    if not TESSDATA_PREFIX and not tessdata:
-        raise RuntimeError("No OCR support: TESSDATA_PREFIX not set")
+    tessdata = pymupdf.get_tessdata(tessdata)
 
     def full_ocr(page, dpi, language, flags):
         zoom = dpi / 72