fix: Set OMP_THREAD_LIMIT for better tesseract performance (#185)

awalker4 · web-flow · commit 080ccfaa3d47 · 2023-08-25T18:10:17.000-04:00
I've spent some time playing with this var, and I came up with [this gist](https://gist.github.com/awalker4/8581d76d373c1bc51e0f2676a6ad816c). I ran this on a 4 core EC2 instance. Processing 3 pages without the limit takes 153s. With the limit is 5s 😍 . When the number of pages is higher than number of cores, it just hangs without this var.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,7 @@
+## 0.5.17
+
+* Use `OMP_THREAD_LIMIT` to improve tesseract performance
+
 ## 0.5.16
 
 * Fix to no longer create a directory for storing processed images
diff --git a/unstructured_inference/__version__.py b/unstructured_inference/__version__.py
@@ -1 +1 @@
-__version__ = "0.5.16"  # pragma: no cover
+__version__ = "0.5.17"  # pragma: no cover
diff --git a/unstructured_inference/models/tesseract.py b/unstructured_inference/models/tesseract.py
@@ -1,3 +1,4 @@
+import os
 from typing import Dict
 
 import pytesseract
@@ -9,6 +10,11 @@
 
 TesseractError = pytesseract.pytesseract.TesseractError
 
+# Force tesseract to be single threaded,
+# otherwise we see major performance problems
+if "OMP_THREAD_LIMIT" not in os.environ:
+    os.environ["OMP_THREAD_LIMIT"] = "1"
+
 
 def load_agent(languages: str = "eng"):
     """Loads the Tesseract OCR agent as a global variable to ensure that we only load it once.

Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-__version__ = "0.5.16" # pragma: no cover`
	`1`	`+__version__ = "0.5.17" # pragma: no cover`