Skip to content

Commit 37d2f02

Browse files
Feat/bump inference (#4013)
Bump `unstructured-inference` to `1.0.5`, which includes fix to ensure model init is thread safe. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: badGarnet <[email protected]>
1 parent a7e90f7 commit 37d2f02

File tree

11 files changed

+46
-58
lines changed

11 files changed

+46
-58
lines changed

CHANGELOG.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,16 @@
1-
## 0.17.7
1+
## 0.17.8
22

33
### Enhancements
4-
- **Updated Docker file with ENV HF_HUB_OFFLINE=1 to prevent the contianer from trying to access the internet
4+
- **Bump `unstructured-inference` to `1.0.5`** It includes critical fix to ensure inference model initialization is thread safe
55

66
### Features
77

88
### Fixes
99

10-
## 0.17.7-dev0
10+
## 0.17.7
1111

1212
### Enhancements
13+
- **Updated Docker file with ENV HF_HUB_OFFLINE=1 to prevent the contianer from trying to access the internet
1314

1415
### Features
1516

requirements/extra-csv.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ numpy==2.0.2
88
# via
99
# -c ./base.txt
1010
# pandas
11-
pandas==2.2.3
11+
pandas==2.3.0
1212
# via -r ./extra-csv.in
1313
python-dateutil==2.9.0.post0
1414
# via

requirements/extra-pdf-image.in

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,5 +12,5 @@ google-cloud-vision
1212
effdet
1313
# Do not move to constraints.in, otherwise unstructured-inference will not be upgraded
1414
# when unstructured library is.
15-
unstructured-inference>=0.8.10
15+
unstructured-inference>=1.0.5
1616
unstructured.pytesseract>=0.3.12

requirements/extra-pdf-image.txt

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ fsspec==2025.5.1
5252
# torch
5353
google-api-core[grpc]==2.25.0
5454
# via google-cloud-vision
55-
google-auth==2.40.2
55+
google-auth==2.40.3
5656
# via
5757
# google-api-core
5858
# google-cloud-vision
@@ -69,9 +69,9 @@ grpcio==1.72.1
6969
# grpcio-status
7070
grpcio-status==1.72.1
7171
# via google-api-core
72-
hf-xet==1.1.2
72+
hf-xet==1.1.3
7373
# via huggingface-hub
74-
huggingface-hub==0.32.3
74+
huggingface-hub==0.32.4
7575
# via
7676
# accelerate
7777
# timm
@@ -139,7 +139,7 @@ packaging==25.0
139139
# pikepdf
140140
# transformers
141141
# unstructured-pytesseract
142-
pandas==2.2.3
142+
pandas==2.3.0
143143
# via unstructured-inference
144144
pdf2image==1.17.0
145145
# via -r ./extra-pdf-image.in
@@ -184,7 +184,7 @@ pyasn1==0.6.1
184184
# rsa
185185
pyasn1-modules==0.4.2
186186
# via google-auth
187-
pycocotools==2.0.9
187+
pycocotools==2.0.10
188188
# via effdet
189189
pycparser==2.22
190190
# via
@@ -253,14 +253,14 @@ tokenizers==0.21.1
253253
# via
254254
# -c ././deps/constraints.txt
255255
# transformers
256-
torch==2.7.0
256+
torch==2.7.1
257257
# via
258258
# accelerate
259259
# effdet
260260
# timm
261261
# torchvision
262262
# unstructured-inference
263-
torchvision==0.22.0
263+
torchvision==0.22.1
264264
# via
265265
# effdet
266266
# timm
@@ -280,7 +280,7 @@ typing-extensions==4.14.0
280280
# torch
281281
tzdata==2025.2
282282
# via pandas
283-
unstructured-inference==1.0.2
283+
unstructured-inference==1.0.5
284284
# via -r ./extra-pdf-image.in
285285
unstructured-pytesseract==0.3.15
286286
# via -r ./extra-pdf-image.in

requirements/extra-xlsx.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ numpy==2.0.2
1414
# pandas
1515
openpyxl==3.1.5
1616
# via -r ./extra-xlsx.in
17-
pandas==2.2.3
17+
pandas==2.3.0
1818
# via -r ./extra-xlsx.in
1919
python-dateutil==2.9.0.post0
2020
# via

requirements/huggingface.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,9 @@ fsspec==2025.5.1
2525
# via
2626
# huggingface-hub
2727
# torch
28-
hf-xet==1.1.2
28+
hf-xet==1.1.3
2929
# via huggingface-hub
30-
huggingface-hub==0.32.3
30+
huggingface-hub==0.32.4
3131
# via
3232
# tokenizers
3333
# transformers
@@ -90,7 +90,7 @@ tokenizers==0.21.1
9090
# via
9191
# -c ././deps/constraints.txt
9292
# transformers
93-
torch==2.7.0
93+
torch==2.7.1
9494
# via -r ./huggingface.in
9595
tqdm==4.67.1
9696
# via

test_unstructured_ingest/expected-structured-output-html/local-single-file-with-pdf-infer-table-structure/layout-parser-paper-with-table.jpg.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ <h1 class="Title" id="5d45a28d875e403c7294a15f22a0162f">
2626
Large Model
2727
</th>
2828
<th style="border: 1px solid black;">
29-
Notes
29+
| Notes
3030
</th>
3131
</tr>
3232
</thead>

test_unstructured_ingest/expected-structured-output-html/local-single-file-with-pdf-infer-table-structure/layout-parser-paper.pdf.html

Lines changed: 23 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -168,33 +168,21 @@ <h1 class="Title" id="5a1838a8f40b4523094652cf14ab974c">
168168
Dataset
169169
</th>
170170
<th style="border: 1px solid black;">
171-
|
171+
| Base Model'|
172172
</th>
173173
<th style="border: 1px solid black;">
174-
Base Model'|
175-
</th>
176-
<th style="border: 1px solid black;">
177-
Large Model |
178-
</th>
179-
<th style="border: 1px solid black;">
180-
Notes
174+
| Notes
181175
</th>
182176
</tr>
183177
</thead>
184178
<tbody>
185179
<tr style="border: 1px solid black;">
186180
<td style="border: 1px solid black;">
187-
PubLayNet
188-
</td>
189-
<td style="border: 1px solid black;">
190-
B8]|
181+
PubLayNet B8]|
191182
</td>
192183
<td style="border: 1px solid black;">
193184
F/M
194185
</td>
195-
<td style="border: 1px solid black;">
196-
M
197-
</td>
198186
<td style="border: 1px solid black;">
199187
Layouts of modern scientific documents
200188
</td>
@@ -203,14 +191,9 @@ <h1 class="Title" id="5a1838a8f40b4523094652cf14ab974c">
203191
<td style="border: 1px solid black;">
204192
PRImA
205193
</td>
206-
<td style="border: 1px solid black;">
207-
</td>
208194
<td style="border: 1px solid black;">
209195
M
210196
</td>
211-
<td style="border: 1px solid black;">
212-
-
213-
</td>
214197
<td style="border: 1px solid black;">
215198
Layouts of scanned modern magazines and scientific report
216199
</td>
@@ -219,14 +202,9 @@ <h1 class="Title" id="5a1838a8f40b4523094652cf14ab974c">
219202
<td style="border: 1px solid black;">
220203
Newspaper
221204
</td>
222-
<td style="border: 1px solid black;">
223-
</td>
224205
<td style="border: 1px solid black;">
225206
F
226207
</td>
227-
<td style="border: 1px solid black;">
228-
-
229-
</td>
230208
<td style="border: 1px solid black;">
231209
Layouts of scanned US newspapers from the 20th century
232210
</td>
@@ -235,11 +213,6 @@ <h1 class="Title" id="5a1838a8f40b4523094652cf14ab974c">
235213
<td style="border: 1px solid black;">
236214
TableBank
237215
</td>
238-
<td style="border: 1px solid black;">
239-
</td>
240-
<td style="border: 1px solid black;">
241-
F
242-
</td>
243216
<td style="border: 1px solid black;">
244217
F
245218
</td>
@@ -251,14 +224,9 @@ <h1 class="Title" id="5a1838a8f40b4523094652cf14ab974c">
251224
<td style="border: 1px solid black;">
252225
HJDataset
253226
</td>
254-
<td style="border: 1px solid black;">
255-
</td>
256227
<td style="border: 1px solid black;">
257228
F/M
258229
</td>
259-
<td style="border: 1px solid black;">
260-
-
261-
</td>
262230
<td style="border: 1px solid black;">
263231
Layouts of history Japanese documents
264232
</td>
@@ -348,7 +316,10 @@ <h1 class="Title" id="2b81bd7a3f21b84379bfcd4bb175c5d1">
348316
<thead>
349317
<tr style="border: 1px solid black;">
350318
<th style="border: 1px solid black;">
351-
block.pad(top, bottom, right,
319+
block.pad(top, bottom,
320+
</th>
321+
<th style="border: 1px solid black;">
322+
right,
352323
</th>
353324
<th style="border: 1px solid black;">
354325
left)
@@ -365,6 +336,8 @@ <h1 class="Title" id="2b81bd7a3f21b84379bfcd4bb175c5d1">
365336
</td>
366337
<td style="border: 1px solid black;">
367338
</td>
339+
<td style="border: 1px solid black;">
340+
</td>
368341
<td style="border: 1px solid black;">
369342
Scale the current block given the ratio in x and y direction
370343
</td>
@@ -375,6 +348,8 @@ <h1 class="Title" id="2b81bd7a3f21b84379bfcd4bb175c5d1">
375348
</td>
376349
<td style="border: 1px solid black;">
377350
</td>
351+
<td style="border: 1px solid black;">
352+
</td>
378353
<td style="border: 1px solid black;">
379354
Move the current block with the shift distances in x and y direction
380355
</td>
@@ -385,6 +360,8 @@ <h1 class="Title" id="2b81bd7a3f21b84379bfcd4bb175c5d1">
385360
</td>
386361
<td style="border: 1px solid black;">
387362
</td>
363+
<td style="border: 1px solid black;">
364+
</td>
388365
<td style="border: 1px solid black;">
389366
Whether block] is inside of block2
390367
</td>
@@ -395,6 +372,8 @@ <h1 class="Title" id="2b81bd7a3f21b84379bfcd4bb175c5d1">
395372
</td>
396373
<td style="border: 1px solid black;">
397374
</td>
375+
<td style="border: 1px solid black;">
376+
</td>
398377
<td style="border: 1px solid black;">
399378
Return the intersection region of blockl and block2. Coordinate type to be determined based on the inputs
400379
</td>
@@ -405,6 +384,8 @@ <h1 class="Title" id="2b81bd7a3f21b84379bfcd4bb175c5d1">
405384
</td>
406385
<td style="border: 1px solid black;">
407386
</td>
387+
<td style="border: 1px solid black;">
388+
</td>
408389
<td style="border: 1px solid black;">
409390
Return the union region of blockl and block2. Coordinate type to be determined based on the inputs
410391
</td>
@@ -415,6 +396,8 @@ <h1 class="Title" id="2b81bd7a3f21b84379bfcd4bb175c5d1">
415396
</td>
416397
<td style="border: 1px solid black;">
417398
</td>
399+
<td style="border: 1px solid black;">
400+
</td>
418401
<td style="border: 1px solid black;">
419402
Convert the absolute coordinates of block to relative coordinates to block2
420403
</td>
@@ -425,6 +408,8 @@ <h1 class="Title" id="2b81bd7a3f21b84379bfcd4bb175c5d1">
425408
</td>
426409
<td style="border: 1px solid black;">
427410
</td>
411+
<td style="border: 1px solid black;">
412+
</td>
428413
<td style="border: 1px solid black;">
429414
Calculate the absolute coordinates of blockl given the canvas block2’s absolute coordinates
430415
</td>
@@ -435,6 +420,8 @@ <h1 class="Title" id="2b81bd7a3f21b84379bfcd4bb175c5d1">
435420
</td>
436421
<td style="border: 1px solid black;">
437422
</td>
423+
<td style="border: 1px solid black;">
424+
</td>
438425
<td style="border: 1px solid black;">
439426
Obtain the image segments in the block region
440427
</td>

test_unstructured_ingest/expected-structured-output/local-single-file-with-pdf-infer-table-structure/layout-parser-paper-with-table.jpg.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@
4848
"element_id": "dddac446da6c93dc1449ecb5d997c423",
4949
"text": "Dataset | Base Model\" Large Model | Notes PubLayNet [38] P/M M Layouts of modern scientific documents PRImA [3) M - Layouts of scanned modern magazines and scientific reports Newspaper [17] P - Layouts of scanned US newspapers from the 20th century ‘TableBank (18) P P Table region on modern scientific and business document HJDataset (31) | F/M - Layouts of history Japanese documents",
5050
"metadata": {
51-
"text_as_html": "<table><thead><tr><th>Dataset</th><th>| Base Model!|</th><th>Large Model</th><th>Notes</th></tr></thead><tbody><tr><td>PubLayNet [33]</td><td>P/M</td><td>M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA [3]</td><td>M</td><td></td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper [17]</td><td>P</td><td></td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank [18]</td><td>P</td><td></td><td>Table region on modern scientific and business document</td></tr><tr><td>HIDataset [31]</td><td>P/M</td><td></td><td>Layouts of history Japanese documents</td></tr></tbody></table>",
51+
"text_as_html": "<table><thead><tr><th>Dataset</th><th>| Base Model!|</th><th>Large Model</th><th>| Notes</th></tr></thead><tbody><tr><td>PubLayNet [33]</td><td>P/M</td><td>M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA [3]</td><td>M</td><td></td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper [17]</td><td>P</td><td></td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank [18]</td><td>P</td><td></td><td>Table region on modern scientific and business document</td></tr><tr><td>HIDataset [31]</td><td>P/M</td><td></td><td>Layouts of history Japanese documents</td></tr></tbody></table>",
5252
"filetype": "image/jpeg",
5353
"languages": [
5454
"eng"

test_unstructured_ingest/expected-structured-output/local-single-file-with-pdf-infer-table-structure/layout-parser-paper.pdf.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1459,7 +1459,7 @@
14591459
"start_index": 65
14601460
}
14611461
],
1462-
"text_as_html": "<table><thead><tr><th>Dataset</th><th>|</th><th>Base Model'|</th><th>Large Model |</th><th>Notes</th></tr></thead><tbody><tr><td>PubLayNet</td><td>B8]|</td><td>F/M</td><td>M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA</td><td></td><td>M</td><td>-</td><td>Layouts of scanned modern magazines and scientific report</td></tr><tr><td>Newspaper</td><td></td><td>F</td><td>-</td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td></td><td>F</td><td>F</td><td>Table region on modern scientific and business document</td></tr><tr><td>HJDataset</td><td></td><td>F/M</td><td>-</td><td>Layouts of history Japanese documents</td></tr></tbody></table>",
1462+
"text_as_html": "<table><thead><tr><th>Dataset</th><th>| Base Model'|</th><th>| Notes</th></tr></thead><tbody><tr><td>PubLayNet B8]|</td><td>F/M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA</td><td>M</td><td>Layouts of scanned modern magazines and scientific report</td></tr><tr><td>Newspaper</td><td>F</td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td>F</td><td>Table region on modern scientific and business document</td></tr><tr><td>HJDataset</td><td>F/M</td><td>Layouts of history Japanese documents</td></tr></tbody></table>",
14631463
"filetype": "application/pdf",
14641464
"languages": [
14651465
"eng"
@@ -2153,7 +2153,7 @@
21532153
"element_id": "64bc79d1132a89c71837f420d6e4e2dc",
21542154
"text": "Operation Name Description block.pad(top, bottom, right, left) Enlarge the current block according to the input block.scale(fx, fy) Scale the current block given the ratio in x and y direction block.shift(dx, dy) Move the current block with the shift distances in x and y direction block1.is in(block2) Whether block1 is inside of block2 block1.intersect(block2) Return the intersection region of block1 and block2. Coordinate type to be determined based on the inputs. block1.union(block2) Return the union region of block1 and block2. Coordinate type to be determined based on the inputs. block1.relative to(block2) Convert the absolute coordinates of block1 to relative coordinates to block2 block1.condition on(block2) Calculate the absolute coordinates of block1 given the canvas block2’s absolute coordinates block.crop image(image) Obtain the image segments in the block region",
21552155
"metadata": {
2156-
"text_as_html": "<table><thead><tr><th>block.pad(top, bottom, right,</th><th>left)</th><th>Enlarge the current block according to the input</th></tr></thead><tbody><tr><td>block.scale(fx, fy)</td><td></td><td>Scale the current block given the ratio in x and y direction</td></tr><tr><td>block.shift(dx, dy)</td><td></td><td>Move the current block with the shift distances in x and y direction</td></tr><tr><td>block1.is_in(block2)</td><td></td><td>Whether block] is inside of block2</td></tr><tr><td>block1. intersect (block2)</td><td></td><td>Return the intersection region of blockl and block2. Coordinate type to be determined based on the inputs</td></tr><tr><td>block1.union(block2)</td><td></td><td>Return the union region of blockl and block2. Coordinate type to be determined based on the inputs</td></tr><tr><td>block1.relative_to(block2)</td><td></td><td>Convert the absolute coordinates of block to relative coordinates to block2</td></tr><tr><td>block1.condition_on(block2)</td><td></td><td>Calculate the absolute coordinates of blockl given the canvas block2’s absolute coordinates</td></tr><tr><td>block. crop_image (image)</td><td></td><td>Obtain the image segments in the block region</td></tr></tbody></table>",
2156+
"text_as_html": "<table><thead><tr><th>block.pad(top, bottom,</th><th>right,</th><th>left)</th><th>Enlarge the current block according to the input</th></tr></thead><tbody><tr><td>block.scale(fx, fy)</td><td></td><td></td><td>Scale the current block given the ratio in x and y direction</td></tr><tr><td>block.shift(dx, dy)</td><td></td><td></td><td>Move the current block with the shift distances in x and y direction</td></tr><tr><td>block1.is_in(block2)</td><td></td><td></td><td>Whether block] is inside of block2</td></tr><tr><td>block1. intersect (block2)</td><td></td><td></td><td>Return the intersection region of blockl and block2. Coordinate type to be determined based on the inputs</td></tr><tr><td>block1.union(block2)</td><td></td><td></td><td>Return the union region of blockl and block2. Coordinate type to be determined based on the inputs</td></tr><tr><td>block1.relative_to(block2)</td><td></td><td></td><td>Convert the absolute coordinates of block to relative coordinates to block2</td></tr><tr><td>block1.condition_on(block2)</td><td></td><td></td><td>Calculate the absolute coordinates of blockl given the canvas block2’s absolute coordinates</td></tr><tr><td>block. crop_image (image)</td><td></td><td></td><td>Obtain the image segments in the block region</td></tr></tbody></table>",
21572157
"filetype": "application/pdf",
21582158
"languages": [
21592159
"eng"

0 commit comments

Comments
 (0)