chore: stop passing language code from tesseract mapping to paddle (#226)

yuming-long · shreyanid · web-flow · commit cf15726a99db · 2023-09-27T16:46:19.000-07:00
### Summary A user is flagging the assertion error for paddle language code: ``` AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng ``` and tried setting the `ocr_languages` param to 'en' (the correct lang code for english in paddle) but also didn't work. The reason is that the `ocr_languages` uses the mapping for tesseract code which will convert `en` to `eng` since thats the correct lang code for english in tesseract. The quick workaround here is stop passing the lang code to paddle and let it use default `en`, and this will be addressed once we have the lang code mapping for paddle. ### Test looks like user used this branch and got the lang parameter working from [linked comments](Unstructured-IO/unstructured-api#247 (comment)) :) on api repo: ``` pip install paddlepaddle pip install "unstructured.PaddleOCR" export ENTIRE_PAGE_OCR=paddle make run-web-app ``` * check error before this change: ``` curl -X 'POST' 'http://localhost:8000/general/v0/general' -H 'accept: application/json' -F 'files=@sample-docs/english-and-korean.png' -F 'ocr_languages=en' | jq -C . | less -R ``` will see the error: ``` { "detail": "param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng" } ``` also in logger you will see `INFO Loading paddle with CPU on language=eng...` since tesseract mapping converts `en` to `eng`. * check after this change: Checkout to this branch and install inference repo into your env (the same env thats running api) with `pip install -e .` Rerun `make run-web-app` Run the curl command again, you won't get the result on m1 chip since paddle doesn't work on it but from the logger info you can see `2023-09-27 12:48:48,120 unstructured_inference INFO Loading paddle with CPU on language=en...`, which means the lang parameter is using default `en` (logger info is coming from [this line](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/paddle_ocr.py#L22)). --------- Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,7 @@
+## 0.6.6
+
+* Stop passing ocr_languages parameter into paddle to avoid invalid paddle language code error, this will be fixed until
+we have the mapping from standard language code to paddle language code.
 ## 0.6.5
 
 * Add functionality to keep extracted image elements while merging inferred layout with extracted layout
diff --git a/unstructured_inference/__version__.py b/unstructured_inference/__version__.py
@@ -1 +1 @@
-__version__ = "0.6.5"  # pragma: no cover
+__version__ = "0.6.6"  # pragma: no cover
diff --git a/unstructured_inference/inference/layout.py b/unstructured_inference/inference/layout.py
@@ -275,12 +275,11 @@ def get_elements_with_detection_model(
                 )
 
             if entrie_page_ocr == "paddle":
-                logger.info("Processing entrie page OCR with paddle...")
+                logger.info("Processing entire page OCR with paddle...")
                 from unstructured_inference.models import paddle_ocr
 
-                # TODO(yuming): paddle only support one language at once,
-                # change ocr to tesseract if passed in multilanguages.
-                ocr_data = paddle_ocr.load_agent(language=self.ocr_languages).ocr(
+                # TODO(yuming): pass ocr language to paddle when we have language mapping for paddle
+                ocr_data = paddle_ocr.load_agent().ocr(
                     np.array(self.image),
                     cls=True,
                 )

Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-__version__ = "0.6.5" # pragma: no cover`
	`1`	`+__version__ = "0.6.6" # pragma: no cover`
Original file line number	Diff line number	Diff line change
`@@ -275,12 +275,11 @@ def get_elements_with_detection_model(`
`275`	`275`	`)`
`276`	`276`
`277`	`277`	`if entrie_page_ocr == "paddle":`
`278`		`- logger.info("Processing entrie page OCR with paddle...")`
	`278`	`+ logger.info("Processing entire page OCR with paddle...")`
`279`	`279`	`from unstructured_inference.models import paddle_ocr`
`280`	`280`
`281`		`- # TODO(yuming): paddle only support one language at once,`
`282`		`- # change ocr to tesseract if passed in multilanguages.`
`283`		`- ocr_data = paddle_ocr.load_agent(language=self.ocr_languages).ocr(`
	`281`	`+ # TODO(yuming): pass ocr language to paddle when we have language mapping for paddle`
	`282`	`+ ocr_data = paddle_ocr.load_agent().ocr(`
`284`	`283`	`np.array(self.image),`
`285`	`284`	`cls=True,`
`286`	`285`	`)`