Skip to content

Commit 051be5a

Browse files
Remove unstructured.pytesseract fork (#3454)
A second attempt at #3360, this PR removes unstructured's dependency on its own fork of `pytesseract`. (The original reason for the fork, the addition of `run_and_get_multiple_output`, was removed [here](https://github.com/madmaze/pytesseract/releases/tag/v0.3.12).) --------- Co-authored-by: Christine Straub <[email protected]>
1 parent 2373eaa commit 051be5a

File tree

27 files changed

+45
-67
lines changed

27 files changed

+45
-67
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.15.2-dev4
1+
## 0.15.2-dev5
22

33
### Enhancements
44

@@ -12,6 +12,7 @@
1212
* **Accommodate single-column CSV files.** Resolves a limitation of `partition_csv()` where delimiter detection would fail on a single-column CSV file (which naturally has no delimeters).
1313
* **Accommodate `image/jpg` in PPTX as alias for `image/jpeg`.** Resolves problem partitioning PPTX files having an invalid `image/jpg` (should be `image/jpeg`) MIME-type in the `[Content_Types].xml` member of the PPTX Zip archive.
1414
* **Fixes an issue in Object Detection metrics** The issue was in preprocessing/validating the ground truth and predicted data for object detection metrics.
15+
* **Removes dependency on unstructured.pytesseract** Unstructured forked pytesseract while waiting for code to be upstreamed. Now that the new version has been released, this fork can be removed.
1516

1617
## 0.15.1
1718

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ COPY test_unstructured test_unstructured
1010
COPY example-docs example-docs
1111

1212
RUN chown -R notebook-user:notebook-user /app && \
13-
apk add font-ubuntu && \
13+
apk add font-ubuntu git && \
1414
fc-cache -fv && \
1515
ln -s /usr/bin/python3.11 /usr/bin/python3
1616

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ install-test:
4646
python3 -m pip install -r requirements/test.txt
4747
# NOTE(yao) - CI seem to always install tesseract to test so it would make sense to also require
4848
# pytesseract installation into the virtual env for testing
49-
python3 -m pip install unstructured.pytesseract -c requirements/deps/constraints.txt
49+
python3 -m pip install pytesseract -c requirements/deps/constraints.txt
5050
# python3 -m pip install argilla==1.28.0 -c requirements/deps/constraints.txt
5151
# NOTE(robinson) - Installing weaviate-client separately here because the requests
5252
# version conflicts with label_studio_sdk

requirements/deps/constraints.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,8 @@ Office365-REST-Python-Client<2.4.3
2222
# unstructured-inference to be upgraded when unstructured library is upgraded
2323
# https://github.com/Unstructured-IO/unstructured/issues/1458
2424
# unstructured-inference
25-
# use the known compatible version of weaviate and unstructured.pytesseract
26-
unstructured.pytesseract>=0.3.12
25+
# use the known compatible version of weaviate and pytesseract
26+
pytesseract @ git+https://github.com/madmaze/pytesseract[email protected]
2727
weaviate-client>3.25.0
2828
# Note(yuming) - pining to avoid conflict with paddle install
2929
matplotlib==3.7.2

requirements/extra-pdf-image.in

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,9 @@ pdfminer.six
77
pikepdf
88
pillow_heif
99
pypdf
10-
pytesseract
1110
google-cloud-vision
1211
effdet
1312
# Do not move to constraints.in, otherwise unstructured-inference will not be upgraded
1413
# when unstructured library is.
1514
unstructured-inference==0.7.36
16-
# unstructured fork of pytesseract that provides an interface to allow for multiple output formats
17-
# from one tesseract call
18-
unstructured.pytesseract>=0.3.12
15+
pytesseract>=0.3.12

requirements/extra-pdf-image.txt

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,6 @@ packaging==23.2
138138
# pikepdf
139139
# pytesseract
140140
# transformers
141-
# unstructured-pytesseract
142141
pandas==2.2.2
143142
# via layoutparser
144143
pdf2image==1.17.0
@@ -163,7 +162,6 @@ pillow==10.4.0
163162
# pillow-heif
164163
# pytesseract
165164
# torchvision
166-
# unstructured-pytesseract
167165
pillow-heif==0.18.0
168166
# via -r ./extra-pdf-image.in
169167
portalocker==2.10.1
@@ -204,8 +202,10 @@ pypdf==4.3.1
204202
# -r ./extra-pdf-image.in
205203
pypdfium2==4.30.0
206204
# via pdfplumber
207-
pytesseract==0.3.10
208-
# via -r ./extra-pdf-image.in
205+
pytesseract @ git+https://github.com/madmaze/[email protected]
206+
# via
207+
# -c ././deps/constraints.txt
208+
# -r ./extra-pdf-image.in
209209
python-dateutil==2.9.0.post0
210210
# via
211211
# -c ./base.txt
@@ -290,10 +290,6 @@ tzdata==2024.1
290290
# via pandas
291291
unstructured-inference==0.7.36
292292
# via -r ./extra-pdf-image.in
293-
unstructured-pytesseract==0.3.12
294-
# via
295-
# -c ././deps/constraints.txt
296-
# -r ./extra-pdf-image.in
297293
urllib3==1.26.19
298294
# via
299295
# -c ././deps/constraints.txt

requirements/extra-pptx.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ lxml==5.2.2
88
# via python-pptx
99
pillow==10.4.0
1010
# via python-pptx
11-
python-pptx==1.0.1
11+
python-pptx==1.0.2
1212
# via -r ./extra-pptx.in
1313
typing-extensions==4.12.2
1414
# via python-pptx

requirements/ingest/embed-aws-bedrock.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ langchain-community==0.2.11
6363
# via
6464
# -c ./ingest/../deps/constraints.txt
6565
# -r ./ingest/embed-aws-bedrock.in
66-
langchain-core==0.2.28
66+
langchain-core==0.2.29
6767
# via
6868
# langchain
6969
# langchain-community

requirements/ingest/embed-huggingface.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ jsonpatch==1.33
4545
# via langchain-core
4646
jsonpointer==3.0.0
4747
# via jsonpatch
48-
langchain-core==0.2.28
48+
langchain-core==0.2.29
4949
# via langchain-huggingface
5050
langchain-huggingface==0.0.3
5151
# via -r ./ingest/embed-huggingface.in

requirements/ingest/embed-openai.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ jsonpatch==1.33
5353
# via langchain-core
5454
jsonpointer==3.0.0
5555
# via jsonpatch
56-
langchain-core==0.2.28
56+
langchain-core==0.2.29
5757
# via langchain-openai
5858
langchain-openai==0.1.20
5959
# via -r ./ingest/embed-openai.in

0 commit comments

Comments
 (0)