You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+4-1Lines changed: 4 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,12 @@
1
+
## 0.0.48-dev0
2
+
3
+
***Adds `languages` kwarg**`ocr_languages` will eventually be depricated and replaced by `lanugages` to specify what languages to use for OCR
4
+
1
5
## 0.0.47
2
6
3
7
***Adds `chunking_strategy` kwarg and associated params** These params allow users to "chunk" elements into larger or smaller `CompositeElement`s
4
8
***Remove `parent_id` from the element metadata**. New metadata fields are causing errors with existing installs. We'll readd this once a fix is widely available.
5
9
***Fix some pdfs incorrectly returning a file is encrypted error**. The `pypdf.is_encrypted` check caused us to return this error even if the file is readable.
Copy file name to clipboardExpand all lines: README.md
+17Lines changed: 17 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -86,6 +86,7 @@ We also support models to be used locally, for example, `yolox`. Please refer to
86
86
87
87
#### OCR languages
88
88
89
+
Note: This kwarg will eventually be deprecated. Please use `languages`.
89
90
You can also specify what languages to use for OCR with the `ocr_languages` kwarg. See the [Tesseract documentation](https://github.com/tesseract-ocr/tessdata) for a full list of languages and install instructions. OCR is only applied if the text is not already available in the PDF document.
90
91
91
92
```
@@ -100,6 +101,22 @@ curl -X 'POST' \
100
101
| jq -C . | less -R
101
102
```
102
103
104
+
#### Languages
105
+
106
+
You can also specify what languages to use for OCR with the `languages` kwarg. See the [Tesseract documentation](https://github.com/tesseract-ocr/tessdata) for a full list of languages and install instructions. OCR is only applied if the text is not already available in the PDF document.
When elements are extracted from PDFs or images, it may be useful to get their bounding boxes as well. Set the `coordinates` parameter to `true` to add this field to the elements in the response.
0 commit comments