Skip to content

Commit df1f7bc

Browse files
authored
Save table prediction in cells format (#2892)
This pull request allows to return predictions in raw cell representation from table transformer. It will be later used to save prediction in a cells format for simpler metrics calculation. This PR has to be merged, after Unstructured-IO/unstructured-inference#335
1 parent 3843af6 commit df1f7bc

File tree

14 files changed

+764
-14
lines changed

14 files changed

+764
-14
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.13.4-dev2
1+
## 0.13.4-dev3
22

33
### Enhancements
44
* **Unique and deterministic hash IDs for elements** Element IDs produced by any partitioning
@@ -8,6 +8,7 @@
88
* **Enable remote chunking via unstructured-ingest** Chunking using unstructured-ingest was
99
previously limited to local chunking using the strategies `basic` and `by_title`. Remote chunking
1010
options via the API are now accessible.
11+
* **Save table in cells format**. `UnstructuredTableTransformerModel` is able to return predicted table in cells format
1112

1213
### Features
1314

requirements/dev.txt

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,6 @@ anyio==3.7.1
99
# -c ././deps/constraints.txt
1010
# httpx
1111
# jupyter-server
12-
appnope==0.1.4
13-
# via
14-
# ipykernel
15-
# ipython
1612
argon2-cffi==23.1.0
1713
# via jupyter-server
1814
argon2-cffi-bindings==21.2.0

requirements/extra-pdf-image.in

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ pillow_heif
99
pypdf
1010
# Do not move to constraints.in, otherwise unstructured-inference will not be upgraded
1111
# when unstructured library is.
12-
unstructured-inference==0.7.27
12+
unstructured-inference==0.7.28
1313
# unstructured fork of pytesseract that provides an interface to allow for multiple output formats
1414
# from one tesseract call
1515
unstructured.pytesseract>=0.3.12

requirements/extra-pdf-image.txt

Lines changed: 35 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ filelock==3.13.4
3737
# huggingface-hub
3838
# torch
3939
# transformers
40+
# triton
4041
flatbuffers==24.3.25
4142
# via onnxruntime
4243
fonttools==4.51.0
@@ -114,6 +115,37 @@ numpy==1.26.4
114115
# scipy
115116
# torchvision
116117
# transformers
118+
nvidia-cublas-cu12==12.1.3.1
119+
# via
120+
# nvidia-cudnn-cu12
121+
# nvidia-cusolver-cu12
122+
# torch
123+
nvidia-cuda-cupti-cu12==12.1.105
124+
# via torch
125+
nvidia-cuda-nvrtc-cu12==12.1.105
126+
# via torch
127+
nvidia-cuda-runtime-cu12==12.1.105
128+
# via torch
129+
nvidia-cudnn-cu12==8.9.2.26
130+
# via torch
131+
nvidia-cufft-cu12==11.0.2.54
132+
# via torch
133+
nvidia-curand-cu12==10.3.2.106
134+
# via torch
135+
nvidia-cusolver-cu12==11.4.5.107
136+
# via torch
137+
nvidia-cusparse-cu12==12.1.0.106
138+
# via
139+
# nvidia-cusolver-cu12
140+
# torch
141+
nvidia-nccl-cu12==2.19.3
142+
# via torch
143+
nvidia-nvjitlink-cu12==12.4.127
144+
# via
145+
# nvidia-cusolver-cu12
146+
# nvidia-cusparse-cu12
147+
nvidia-nvtx-cu12==12.1.105
148+
# via torch
117149
omegaconf==2.3.0
118150
# via effdet
119151
onnx==1.16.0
@@ -275,6 +307,8 @@ tqdm==4.66.2
275307
# transformers
276308
transformers==4.40.0
277309
# via unstructured-inference
310+
triton==2.2.0
311+
# via torch
278312
typing-extensions==4.11.0
279313
# via
280314
# -c ./base.txt
@@ -284,7 +318,7 @@ typing-extensions==4.11.0
284318
# torch
285319
tzdata==2024.1
286320
# via pandas
287-
unstructured-inference==0.7.27
321+
unstructured-inference==0.7.28
288322
# via -r ./extra-pdf-image.in
289323
unstructured-pytesseract==0.3.12
290324
# via

requirements/huggingface.txt

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ filelock==3.13.4
2222
# huggingface-hub
2323
# torch
2424
# transformers
25+
# triton
2526
fsspec==2024.3.1
2627
# via
2728
# huggingface-hub
@@ -54,6 +55,37 @@ numpy==1.26.4
5455
# via
5556
# -c ./base.txt
5657
# transformers
58+
nvidia-cublas-cu12==12.1.3.1
59+
# via
60+
# nvidia-cudnn-cu12
61+
# nvidia-cusolver-cu12
62+
# torch
63+
nvidia-cuda-cupti-cu12==12.1.105
64+
# via torch
65+
nvidia-cuda-nvrtc-cu12==12.1.105
66+
# via torch
67+
nvidia-cuda-runtime-cu12==12.1.105
68+
# via torch
69+
nvidia-cudnn-cu12==8.9.2.26
70+
# via torch
71+
nvidia-cufft-cu12==11.0.2.54
72+
# via torch
73+
nvidia-curand-cu12==10.3.2.106
74+
# via torch
75+
nvidia-cusolver-cu12==11.4.5.107
76+
# via torch
77+
nvidia-cusparse-cu12==12.1.0.106
78+
# via
79+
# nvidia-cusolver-cu12
80+
# torch
81+
nvidia-nccl-cu12==2.19.3
82+
# via torch
83+
nvidia-nvjitlink-cu12==12.4.127
84+
# via
85+
# nvidia-cusolver-cu12
86+
# nvidia-cusparse-cu12
87+
nvidia-nvtx-cu12==12.1.105
88+
# via torch
5789
packaging==23.2
5890
# via
5991
# -c ././deps/constraints.txt
@@ -100,6 +132,8 @@ tqdm==4.66.2
100132
# transformers
101133
transformers==4.40.0
102134
# via -r ./huggingface.in
135+
triton==2.2.0
136+
# via torch
103137
typing-extensions==4.11.0
104138
# via
105139
# -c ./base.txt

requirements/ingest/embed-huggingface.txt

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ filelock==3.13.4
3232
# huggingface-hub
3333
# torch
3434
# transformers
35+
# triton
3536
frozenlist==1.4.1
3637
# via
3738
# aiohttp
@@ -98,6 +99,37 @@ numpy==1.26.4
9899
# scipy
99100
# sentence-transformers
100101
# transformers
102+
nvidia-cublas-cu12==12.1.3.1
103+
# via
104+
# nvidia-cudnn-cu12
105+
# nvidia-cusolver-cu12
106+
# torch
107+
nvidia-cuda-cupti-cu12==12.1.105
108+
# via torch
109+
nvidia-cuda-nvrtc-cu12==12.1.105
110+
# via torch
111+
nvidia-cuda-runtime-cu12==12.1.105
112+
# via torch
113+
nvidia-cudnn-cu12==8.9.2.26
114+
# via torch
115+
nvidia-cufft-cu12==11.0.2.54
116+
# via torch
117+
nvidia-curand-cu12==10.3.2.106
118+
# via torch
119+
nvidia-cusolver-cu12==11.4.5.107
120+
# via torch
121+
nvidia-cusparse-cu12==12.1.0.106
122+
# via
123+
# nvidia-cusolver-cu12
124+
# torch
125+
nvidia-nccl-cu12==2.19.3
126+
# via torch
127+
nvidia-nvjitlink-cu12==12.4.127
128+
# via
129+
# nvidia-cusolver-cu12
130+
# nvidia-cusparse-cu12
131+
nvidia-nvtx-cu12==12.1.105
132+
# via torch
101133
orjson==3.10.1
102134
# via langsmith
103135
packaging==23.2
@@ -168,6 +200,8 @@ tqdm==4.66.2
168200
# transformers
169201
transformers==4.40.0
170202
# via sentence-transformers
203+
triton==2.2.0
204+
# via torch
171205
typing-extensions==4.11.0
172206
# via
173207
# -c ./ingest/../base.txt
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
import pytest
2+
3+
from unstructured.metrics.table.table_formats import SimpleTableCell
4+
5+
6+
@pytest.mark.parametrize(
7+
("row_nums", "column_nums", "x", "y", "w", "h"),
8+
[
9+
([3, 2, 1], [6, 7], 6, 1, 2, 3),
10+
([2], [6, 7], 6, 2, 2, 1),
11+
([1, 2, 3], [20], 20, 1, 1, 3),
12+
([5], [5], 5, 5, 1, 1),
13+
],
14+
)
15+
def test_simple_table_cell_parsing_from_table_transformer_when_expected_input(
16+
row_nums, column_nums, x, y, w, h
17+
):
18+
table_transformer_cell = {"row_nums": row_nums, "column_nums": column_nums, "cell text": "text"}
19+
transformed_cell = SimpleTableCell.from_table_transformer_cell(table_transformer_cell)
20+
expected_cell = SimpleTableCell(x=x, y=y, w=w, h=h, content="text")
21+
assert expected_cell == transformed_cell
22+
23+
24+
def test_simple_table_cell_parsing_from_table_transformer_when_missing_row_nums():
25+
cell = {"row_nums": [], "column_nums": [1], "cell text": "text"}
26+
with pytest.raises(ValueError, match='has missing values under "row_nums" key'):
27+
SimpleTableCell.from_table_transformer_cell(cell)
28+
29+
30+
def test_simple_table_cell_parsing_from_table_transformer_when_missing_column_nums():
31+
cell = {"row_nums": [1], "column_nums": [], "cell text": "text"}
32+
with pytest.raises(ValueError, match='has missing values under "column_nums" key'):
33+
SimpleTableCell.from_table_transformer_cell(cell)

0 commit comments

Comments
 (0)