Skip to content

Commit 080ccfa

Browse files
authored
fix: Set OMP_THREAD_LIMIT for better tesseract performance (#185)
I've spent some time playing with this var, and I came up with [this gist](https://gist.github.com/awalker4/8581d76d373c1bc51e0f2676a6ad816c). I ran this on a 4 core EC2 instance. Processing 3 pages without the limit takes 153s. With the limit is 5s 😍 . When the number of pages is higher than number of cores, it just hangs without this var.
1 parent 9b6aa8e commit 080ccfa

File tree

3 files changed

+11
-1
lines changed

3 files changed

+11
-1
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
## 0.5.17
2+
3+
* Use `OMP_THREAD_LIMIT` to improve tesseract performance
4+
15
## 0.5.16
26

37
* Fix to no longer create a directory for storing processed images
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.5.16" # pragma: no cover
1+
__version__ = "0.5.17" # pragma: no cover

unstructured_inference/models/tesseract.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import os
12
from typing import Dict
23

34
import pytesseract
@@ -9,6 +10,11 @@
910

1011
TesseractError = pytesseract.pytesseract.TesseractError
1112

13+
# Force tesseract to be single threaded,
14+
# otherwise we see major performance problems
15+
if "OMP_THREAD_LIMIT" not in os.environ:
16+
os.environ["OMP_THREAD_LIMIT"] = "1"
17+
1218

1319
def load_agent(languages: str = "eng"):
1420
"""Loads the Tesseract OCR agent as a global variable to ensure that we only load it once.

0 commit comments

Comments
 (0)