You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
chore: stop passing language code from tesseract mapping to paddle (#226)
### Summary
A user is flagging the assertion error for paddle language code:
```
AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng
```
and tried setting the `ocr_languages` param to 'en' (the correct lang
code for english in paddle) but also didn't work.
The reason is that the `ocr_languages` uses the mapping for tesseract
code which will convert `en` to `eng` since thats the correct lang code
for english in tesseract.
The quick workaround here is stop passing the lang code to paddle and
let it use default `en`, and this will be addressed once we have the
lang code mapping for paddle.
### Test
looks like user used this branch and got the lang parameter working from
[linked
comments](Unstructured-IO/unstructured-api#247 (comment))
:)
on api repo:
```
pip install paddlepaddle
pip install "unstructured.PaddleOCR"
export ENTIRE_PAGE_OCR=paddle
make run-web-app
```
* check error before this change:
```
curl -X 'POST' 'http://localhost:8000/general/v0/general' -H 'accept: application/json' -F 'files=@sample-docs/english-and-korean.png' -F 'ocr_languages=en' | jq -C . | less -R
```
will see the error:
```
{
"detail": "param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng"
}
```
also in logger you will see `INFO Loading paddle with CPU on
language=eng...` since tesseract mapping converts `en` to `eng`.
* check after this change:
Checkout to this branch and install inference repo into your env (the
same env thats running api) with `pip install -e .`
Rerun `make run-web-app`
Run the curl command again, you won't get the result on m1 chip since
paddle doesn't work on it but from the logger info you can see
`2023-09-27 12:48:48,120 unstructured_inference INFO Loading paddle with
CPU on language=en...`, which means the lang parameter is using default
`en` (logger info is coming from [this
line](https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/paddle_ocr.py#L22)).
---------
Co-authored-by: shreyanid <[email protected]>
0 commit comments