Skip to content

Commit c32aeaa

Browse files
authored
fix: wait to run soffice until there is no other soffice process running (#3287)
## Summary This PR addresses an issue where the code could attempt to run `soffice` in multiple processes and closes #3284 The fix is to add a wait mechanism when there is another `soffice` process running in already. ## Diagnosis of issue - `soffice` can only have one process running when using the command `soffice` as is. - on main branch the function `partition.common.convert_office_doc` simply spawns a subprocess to run `soffice` command to convert a `doc` or `ppt` file into `docx` or `pptx` format. - if there are multiple partition calls to process `doc` or `ppt` files and they all want to spawn `soffice` subprocesses only one will succeed while other processes will simply fail and return 1 from the subprocess - in downstream this will lead to errors like `PackageNotFoundError: Package not found at '/tmp/tmpac6lcu4w/document.docx'` ## solution While there are [ways](https://www.reddit.com/r/libreoffice/comments/agk3os/how_to_open_more_than_one_calc_instance_under/) to circumvent the limit of `soffice` by setting a tmp file as user installation env, these kind of solutions rely on the internals of `soffice` and adds maintenance cost to track its changes. This PR solves this problem by adding a wait mechanism: - we first spawning a subprocess to run `soffice` - if the `stdout` is empty and we still have wait time budget left the function first checks if there is another `soffice` running * If yes then the function waits for 0.01s before checking again; * if no the functions spawns a subprocess to run `soffice` and return to beginning of this step * we need to return the the beginning to check if `stdout` is empty because we could have another collision right after `soffice` becomes available. ## test This PR adds two unit tests. Additionally this can be tested by running partition of `.doc` files locally with multiprocessing.
1 parent a7a53f6 commit c32aeaa

File tree

17 files changed

+113
-38
lines changed

17 files changed

+113
-38
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@
66

77
### Fixes
88

9+
* **Fix a bug where multiple `soffice` processes could be attempted** Add a wait mechanism in `convert_office_doc` so that the function first checks if another `soffice` is running already: if yes wait till the other process finishes or till the wait timeout before spawning a subprocess to run `soffice`
10+
911
## 0.14.8
1012

1113
### Enhancements

requirements/dev.txt

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -267,7 +267,9 @@ prompt-toolkit==3.0.47
267267
# ipython
268268
# jupyter-console
269269
psutil==6.0.0
270-
# via ipykernel
270+
# via
271+
# -c ./test.txt
272+
# ipykernel
271273
ptyprocess==0.7.0
272274
# via
273275
# pexpect

requirements/extra-paddleocr.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ attrdict==2.0.1
88
# via unstructured-paddleocr
99
babel==2.15.0
1010
# via flask-babel
11-
bce-python-sdk==0.9.14
11+
bce-python-sdk==0.9.17
1212
# via visualdl
1313
blinker==1.8.2
1414
# via flask

requirements/extra-pdf-image.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,15 +46,15 @@ fsspec==2024.5.0
4646
# -c ././deps/constraints.txt
4747
# huggingface-hub
4848
# torch
49-
google-api-core[grpc]==2.19.0
49+
google-api-core[grpc]==2.19.1
5050
# via google-cloud-vision
5151
google-auth==2.30.0
5252
# via
5353
# google-api-core
5454
# google-cloud-vision
5555
google-cloud-vision==3.7.2
5656
# via -r ./extra-pdf-image.in
57-
googleapis-common-protos==1.63.1
57+
googleapis-common-protos==1.63.2
5858
# via
5959
# google-api-core
6060
# grpcio-status

requirements/ingest/chroma.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ fsspec==2024.5.0
6565
# huggingface-hub
6666
google-auth==2.30.0
6767
# via kubernetes
68-
googleapis-common-protos==1.63.1
68+
googleapis-common-protos==1.63.2
6969
# via opentelemetry-exporter-otlp-proto-grpc
7070
grpcio==1.64.1
7171
# via
@@ -232,7 +232,7 @@ starlette==0.37.2
232232
# via fastapi
233233
sympy==1.12.1
234234
# via onnxruntime
235-
tenacity==8.4.1
235+
tenacity==8.4.2
236236
# via chromadb
237237
tokenizers==0.19.1
238238
# via chromadb

requirements/ingest/clarifai.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ clarifai-grpc==10.5.3
1919
# via clarifai
2020
contextlib2==21.6.0
2121
# via schema
22-
googleapis-common-protos==1.63.1
22+
googleapis-common-protos==1.63.2
2323
# via clarifai-grpc
2424
grpcio==1.64.1
2525
# via clarifai-grpc

requirements/ingest/embed-aws-bedrock.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ requests==2.32.3
120120
# langchain
121121
# langchain-community
122122
# langsmith
123-
s3transfer==0.10.1
123+
s3transfer==0.10.2
124124
# via boto3
125125
six==1.16.0
126126
# via
@@ -130,7 +130,7 @@ sqlalchemy==2.0.31
130130
# via
131131
# langchain
132132
# langchain-community
133-
tenacity==8.4.1
133+
tenacity==8.4.2
134134
# via
135135
# langchain
136136
# langchain-community

requirements/ingest/embed-huggingface.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -167,7 +167,7 @@ sqlalchemy==2.0.31
167167
# langchain-community
168168
sympy==1.12.1
169169
# via torch
170-
tenacity==8.4.1
170+
tenacity==8.4.2
171171
# via
172172
# langchain
173173
# langchain-community

requirements/ingest/embed-openai.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -141,7 +141,7 @@ sqlalchemy==2.0.31
141141
# via
142142
# langchain
143143
# langchain-community
144-
tenacity==8.4.1
144+
tenacity==8.4.2
145145
# via
146146
# langchain
147147
# langchain-community

requirements/ingest/embed-vertexai.txt

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ frozenlist==1.4.1
3939
# via
4040
# aiohttp
4141
# aiosignal
42-
google-api-core[grpc]==2.19.0
42+
google-api-core[grpc]==2.19.1
4343
# via
4444
# google-cloud-aiplatform
4545
# google-cloud-bigquery
@@ -76,12 +76,12 @@ google-resumable-media==2.7.1
7676
# via
7777
# google-cloud-bigquery
7878
# google-cloud-storage
79-
googleapis-common-protos[grpc]==1.63.1
79+
googleapis-common-protos[grpc]==1.63.2
8080
# via
8181
# google-api-core
8282
# grpc-google-iam-v1
8383
# grpcio-status
84-
grpc-google-iam-v1==0.13.0
84+
grpc-google-iam-v1==0.13.1
8585
# via google-cloud-resource-manager
8686
grpcio==1.64.1
8787
# via
@@ -210,7 +210,7 @@ sqlalchemy==2.0.31
210210
# via
211211
# langchain
212212
# langchain-community
213-
tenacity==8.4.1
213+
tenacity==8.4.2
214214
# via
215215
# langchain
216216
# langchain-community

0 commit comments

Comments
 (0)