Skip to content

Commit d99b399

Browse files
build(deps): Remove unstructured.paddlepaddle fork (#3506)
This PR aims to remove "unstructured.paddlepaddle" fork. Previously, we used `unstructured.paddlepaddle` fork to support `unstructured.paddleocr` on arm64 architecture. But currently, `unstructured.paddleocr` with `unstructured.paddlepaddle` fails to work on `arm64` architecture. Also, `unstructured.paddleocr` with the latest version of the original `paddlepaddle` works on both `amd64` and `arm64` architectures. ### Testing ``` os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" elements = partition_pdf( filename=<file_path>, strategy="hi_res", infer_table_structure=True, ) ```
1 parent a2ae2ed commit d99b399

File tree

6 files changed

+53
-16
lines changed

6 files changed

+53
-16
lines changed

.github/workflows/ci.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,6 @@ jobs:
7272
- name: Install all doc and test dependencies
7373
run: |
7474
make install-ci
75-
make install-paddleocr
7675
make install-all-ingest
7776
make check-licenses
7877

Dockerfile

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,6 @@ RUN chown -R notebook-user:notebook-user /app && \
1717
USER notebook-user
1818

1919
RUN find requirements/ -type f -name "*.txt" -exec pip3.11 install --no-cache-dir --user -r '{}' ';' && \
20-
pip3.11 install unstructured.paddlepaddle && \
2120
python3.11 -c "from unstructured.nlp.tokenize import download_nltk_packages; download_nltk_packages()" && \
2221
python3.11 -c "from unstructured.partition.model_init import initialize; initialize()" && \
2322
python3.11 -c "from unstructured_inference.models.tables import UnstructuredTableTransformerModel; model = UnstructuredTableTransformerModel(); model.initialize('microsoft/table-transformer-structure-recognition')"

Makefile

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -277,10 +277,6 @@ install-local-inference: install install-all-docs
277277
install-pandoc:
278278
ARCH=${ARCH} ./scripts/install-pandoc.sh
279279

280-
.PHONY: install-paddleocr
281-
install-paddleocr:
282-
ARCH=${ARCH} ./scripts/install-paddleocr.sh
283-
284280
## pip-compile: compiles all base/dev/test requirements
285281
.PHONY: pip-compile
286282
pip-compile:

requirements/extra-paddleocr.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
-c ./deps/constraints.txt
22
-c base.txt
33

4+
paddlepaddle==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
45
unstructured.paddleocr==2.8.0.1

requirements/extra-paddleocr.txt

Lines changed: 52 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,13 @@
44
#
55
# pip-compile ./extra-paddleocr.in
66
#
7+
anyio==3.7.1
8+
# via
9+
# -c ././deps/constraints.txt
10+
# -c ./base.txt
11+
# httpx
12+
astor==0.8.1
13+
# via paddlepaddle
714
attrdict==2.0.1
815
# via unstructured-paddleocr
916
cachetools==5.4.0
@@ -12,6 +19,8 @@ certifi==2024.7.4
1219
# via
1320
# -c ././deps/constraints.txt
1421
# -c ./base.txt
22+
# httpcore
23+
# httpx
1524
# requests
1625
charset-normalizer==3.3.2
1726
# via
@@ -27,13 +36,33 @@ cycler==0.12.1
2736
# via matplotlib
2837
cython==3.0.11
2938
# via unstructured-paddleocr
39+
decorator==5.1.1
40+
# via paddlepaddle
3041
et-xmlfile==1.1.0
3142
# via openpyxl
43+
exceptiongroup==1.2.2
44+
# via
45+
# -c ./base.txt
46+
# anyio
3247
fonttools==4.53.1
3348
# via matplotlib
49+
h11==0.14.0
50+
# via
51+
# -c ./base.txt
52+
# httpcore
53+
httpcore==1.0.5
54+
# via
55+
# -c ./base.txt
56+
# httpx
57+
httpx==0.27.0
58+
# via
59+
# -c ./base.txt
60+
# paddlepaddle
3461
idna==3.7
3562
# via
3663
# -c ./base.txt
64+
# anyio
65+
# httpx
3766
# requests
3867
imageio==2.34.2
3968
# via
@@ -59,7 +88,9 @@ matplotlib==3.9.1.post1
5988
more-itertools==10.4.0
6089
# via cssutils
6190
networkx==3.2.1
62-
# via scikit-image
91+
# via
92+
# paddlepaddle
93+
# scikit-image
6394
numpy==1.26.4
6495
# via
6596
# -c ./base.txt
@@ -69,6 +100,8 @@ numpy==1.26.4
69100
# matplotlib
70101
# opencv-contrib-python
71102
# opencv-python
103+
# opt-einsum
104+
# paddlepaddle
72105
# scikit-image
73106
# scipy
74107
# shapely
@@ -85,25 +118,34 @@ opencv-python==4.8.0.76
85118
# unstructured-paddleocr
86119
openpyxl==3.1.5
87120
# via unstructured-paddleocr
121+
opt-einsum==3.3.0
122+
# via paddlepaddle
88123
packaging==23.2
89124
# via
90125
# -c ././deps/constraints.txt
91126
# -c ./base.txt
92127
# lazy-loader
93128
# matplotlib
94129
# scikit-image
130+
paddlepaddle==3.0.0b1
131+
# via -r ./extra-paddleocr.in
95132
pdf2image==1.17.0
96133
# via unstructured-paddleocr
97134
pillow==10.4.0
98135
# via
99136
# imageio
100137
# imgaug
101138
# matplotlib
139+
# paddlepaddle
102140
# pdf2image
103141
# scikit-image
104142
# unstructured-paddleocr
105143
premailer==3.10.0
106144
# via unstructured-paddleocr
145+
protobuf==4.23.4
146+
# via
147+
# -c ././deps/constraints.txt
148+
# paddlepaddle
107149
pyclipper==1.3.0.post5
108150
# via unstructured-paddleocr
109151
pyparsing==3.0.9
@@ -144,12 +186,21 @@ six==1.16.0
144186
# attrdict
145187
# imgaug
146188
# python-dateutil
189+
sniffio==1.3.1
190+
# via
191+
# -c ./base.txt
192+
# anyio
193+
# httpx
147194
tifffile==2024.7.24
148195
# via scikit-image
149196
tqdm==4.66.5
150197
# via
151198
# -c ./base.txt
152199
# unstructured-paddleocr
200+
typing-extensions==4.12.2
201+
# via
202+
# -c ./base.txt
203+
# paddlepaddle
153204
unstructured-paddleocr==2.8.0.1
154205
# via -r ./extra-paddleocr.in
155206
urllib3==1.26.19

scripts/install-paddleocr.sh

Lines changed: 0 additions & 9 deletions
This file was deleted.

0 commit comments

Comments
 (0)