You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Chore: add env ENTIRE_PAGE_OCR to specify paddle/tesseract for entire page ocr (#209)
### Summary
We need a way to use paddle for the entire page OCR since the OCR result
could be better than tesseract, which has shown on some image files with
tables. This PR adds an environment variable `ENTIRE_PAGE_OCR` that can
be set to `paddle` or `tesseract`. We still use tesseract as default
since paddle performs poorly on entire-page English PDF files.
### Test
if you are on x86 arch, please run this snippet to install paddle
(paddle still doesn't work on m1/m2 chip locally):
```
pip install paddlepaddle #or pip install unstructured.paddlepaddle if on aarch64 arch
pip install unstructured_paddleocr
export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib64
```
run the following script to see different entire page result from paddle
and tesseract
```
from unstructured_inference.inference.layout import DocumentLayout
import os
def get_layout_from_image(ocr_languages):
layout = DocumentLayout.from_image_file("sample-docs/table-multi-row-column-cells.png", ocr_languages=ocr_languages)
# Create a list to store the layout elements with only "text" and "type" fields
elements_dict_list = []
for page in layout.pages:
for element in page.elements:
element_dict = {
"text": element.text,
"type": element.type
}
elements_dict_list.append(element_dict)
return elements_dict_list
# default is tesseract
os.environ['ENTIRE_PAGE_OCR'] = "tesseract"
tesseract_elements = get_layout_from_image(ocr_languages="eng")
# set env to use paddle and call function agin
os.environ['ENTIRE_PAGE_OCR'] = "paddle"
paddle_elements = get_layout_from_image(ocr_languages="en")
# should expect difference
assert tesseract_elements != paddle_elements
# compare result
print(tesseract_elements)
print(paddle_elements)
```
### Note
There are different language code between tesseract and paddle on the
same language i.e, `en` in paddle and `eng` in tesseract for English.
This can be addressed once we introduce the language mappings from
standard language code to tesseract and to paddle respectively. However,
unlike tesseract, paddle does support passing in multiple languages, and
we will fallback to tesseract if thats the case (future PR).
0 commit comments