Skip to content

Commit 5c2acc4

Browse files
authored
feat: add evaluation metric for table extraction (#216)
## Summary This PR resolves [CORE-1930](https://unstructured-ai.atlassian.net/browse/CORE-1930). It adds helpers to compute token ratios between two tables' (actual vs. prediction). This score goes from 0 (no match at all) to 100 (fully matched). ## metric sketch 1. represent tables as rows and columns data, with cells spanning multiple columns or rows occupying the left upper most position while leaving the rest of the positions empty 2. join the table together as a long string of text by `tab_token` (often `\t` ) in a row then `row_break_token` (often `\n`) between rows 3. compare the string representations’ `rapidfuzz.fuzz.partial_token_ratio` with tokenizer that tokenizes words and each number is its own token 4. do 2-3 again but this time join text along the columns instead of rows (in practice just take the transpose of both pred and actual tables) e.g., This table: ![table-multi-row-column-cells](https://github.com/Unstructured-IO/unstructured-inference/assets/647930/4e5bc0c9-9cc9-4cc3-a64b-6a229dd33931) Would be represented as table ```python Disability Category Participants Ballots Completed Ballots Incomplete/Terminated Results None None None None None Accuracy Time to complete Blind 5 1 4 34.5%, n=1 1199 sec, n=1 Low Vision 5 2 3 98.3% n=2 (97.7%, n=3) 1716 sec, n=3 (1934 sec, n=2) Dexterity 5 4 1 98.3%, n=4 1672.1 sec, n=4 Mobility 3 3 0 95.4%, n=3 1416 sec, n=3 ``` then as string by row ```python ('Disability Category\tParticipants\tBallots Completed\tBallots ' 'Incomplete/Terminated\tResults\t\n' '\t\t\t\tAccuracy\tTime to complete\n' 'Blind\t5\t1\t4\t34.5%, n=1\t1199 sec, n=1\n' 'Low Vision\t5\t2\t3\t98.3% n=2 (97.7%, n=3)\t1716 sec, n=3 (1934 sec, n=2)\n' 'Dexterity\t5\t4\t1\t98.3%, n=4\t1672.1 sec, n=4\n' 'Mobility\t3\t3\t0\t95.4%, n=3\t1416 sec, n=3') ``` string by column ```python ('Disability Category\t\tBlind\tLow Vision\tDexterity\tMobility\n' 'Participants\t\t5\t5\t5\t3\n' 'Ballots Completed\t\t1\t2\t4\t3\n' 'Ballots Incomplete/Terminated\t\t4\t3\t1\t0\n' 'Results\tAccuracy\t34.5%, n=1\t98.3% n=2 (97.7%, n=3)\t98.3%, n=4\t95.4%, ' 'n=3\n' '\tTime to complete\t1199 sec, n=1\t1716 sec, n=3 (1934 sec, n=2)\t1672.1 ' 'sec, n=4\t1416 sec, n=3') ``` ## rational 1. This metric choose to represent tables as long strings. This is intentional because we intent to run follow up tasks that treats tables as text (RAG with LLM) 2. we use token ration instead of ratio so that numbers are compared as a whole instead of comparing digits. This also ensures that if a word is split by either OCR or table structure detection it is considered in the metric as negative Use the example above, we examine a few predictions and their scores - predict 1 [pre surge]: lots of structural errors and mistakes in OCR: **row: 58, column: 61** ![Screenshot 2023-09-19 at 8 46 28 AM](https://github.com/Unstructured-IO/unstructured-inference/assets/647930/381cf7a7-625d-40ea-9e25-ec6911b633a5) - predict 2 [post surge]: perfect structural extraction but mistakes in OCR: **row: 92, column: 85** ![Screenshot 2023-09-20 at 9 01 56 AM](https://github.com/Unstructured-IO/unstructured-inference/assets/647930/20498ca8-c3b6-457e-bdea-f131efc867fc) - predict 3 [chipper]: incorrect structure on multi-row cells and multi-line cells, but perfect OCR: **row: 84, column: 86** ![Untitled](https://github.com/Unstructured-IO/unstructured-inference/assets/647930/78042969-01a9-4e9a-840c-1741f7ff71a2) [CORE-1930]: https://unstructured-ai.atlassian.net/browse/CORE-1930?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
1 parent bfb90e3 commit 5c2acc4

File tree

11 files changed

+374
-72
lines changed

11 files changed

+374
-72
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
## 0.5.30
2+
3+
* add an evaluation metric for table comparison based on token similarity
4+
15
## 0.5.29-dev0
26

37
* fix paddle unit tests where `make test` fails since paddle doesn't work on M1/M2 chip locally

requirements/base.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,4 @@ onnx==1.14.1
66
onnxruntime
77
# NOTE(alan): Pinned because this is when the most recent module we import appeared
88
transformers>=4.25.1
9+
rapidfuzz

requirements/base.txt

Lines changed: 20 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -16,15 +16,15 @@ charset-normalizer==3.2.0
1616
# requests
1717
coloredlogs==15.0.1
1818
# via onnxruntime
19-
contourpy==1.1.0
19+
contourpy==1.1.1
2020
# via matplotlib
21-
cryptography==41.0.3
21+
cryptography==41.0.4
2222
# via pdfminer-six
2323
cycler==0.11.0
2424
# via matplotlib
2525
effdet==0.4.1
2626
# via layoutparser
27-
filelock==3.12.3
27+
filelock==3.12.4
2828
# via
2929
# huggingface-hub
3030
# torch
@@ -33,9 +33,9 @@ flatbuffers==23.5.26
3333
# via onnxruntime
3434
fonttools==4.42.1
3535
# via matplotlib
36-
fsspec==2023.6.0
36+
fsspec==2023.9.1
3737
# via huggingface-hub
38-
huggingface-hub==0.16.4
38+
huggingface-hub==0.17.2
3939
# via
4040
# -r requirements/base.in
4141
# timm
@@ -56,7 +56,7 @@ layoutparser[layoutmodels,tesseract]==0.3.4
5656
# via -r requirements/base.in
5757
markupsafe==2.1.3
5858
# via jinja2
59-
matplotlib==3.7.2
59+
matplotlib==3.7.3
6060
# via pycocotools
6161
mpmath==1.3.0
6262
# via sympy
@@ -79,7 +79,7 @@ omegaconf==2.3.0
7979
# via effdet
8080
onnx==1.14.1
8181
# via -r requirements/base.in
82-
onnxruntime==1.15.1
82+
onnxruntime==1.16.0
8383
# via -r requirements/base.in
8484
opencv-python==4.8.0.76
8585
# via
@@ -100,27 +100,27 @@ pdfminer-six==20221105
100100
# via pdfplumber
101101
pdfplumber==0.10.2
102102
# via layoutparser
103-
pillow==10.0.0
103+
pillow==10.0.1
104104
# via
105105
# layoutparser
106106
# matplotlib
107107
# pdf2image
108108
# pdfplumber
109109
# pytesseract
110110
# torchvision
111-
portalocker==2.7.0
111+
portalocker==2.8.2
112112
# via iopath
113-
protobuf==4.24.2
113+
protobuf==4.24.3
114114
# via
115115
# onnx
116116
# onnxruntime
117117
pycocotools==2.0.7
118118
# via effdet
119119
pycparser==2.21
120120
# via cffi
121-
pyparsing==3.0.9
121+
pyparsing==3.1.1
122122
# via matplotlib
123-
pypdfium2==4.19.0
123+
pypdfium2==4.20.0
124124
# via pdfplumber
125125
pytesseract==0.3.10
126126
# via layoutparser
@@ -130,7 +130,7 @@ python-dateutil==2.8.2
130130
# pandas
131131
python-multipart==0.0.6
132132
# via -r requirements/base.in
133-
pytz==2023.3
133+
pytz==2023.3.post1
134134
# via pandas
135135
pyyaml==6.0.1
136136
# via
@@ -139,6 +139,8 @@ pyyaml==6.0.1
139139
# omegaconf
140140
# timm
141141
# transformers
142+
rapidfuzz==3.3.0
143+
# via -r requirements/base.in
142144
regex==2023.8.8
143145
# via transformers
144146
requests==2.31.0
@@ -158,7 +160,7 @@ sympy==1.12
158160
# via
159161
# onnxruntime
160162
# torch
161-
timm==0.9.6
163+
timm==0.9.7
162164
# via effdet
163165
tokenizers==0.13.3
164166
# via transformers
@@ -178,18 +180,17 @@ tqdm==4.66.1
178180
# huggingface-hub
179181
# iopath
180182
# transformers
181-
transformers==4.32.1
183+
transformers==4.33.2
182184
# via -r requirements/base.in
183-
typing-extensions==4.7.1
185+
typing-extensions==4.8.0
184186
# via
185-
# filelock
186187
# huggingface-hub
187188
# iopath
188189
# onnx
189190
# torch
190191
tzdata==2023.3
191192
# via pandas
192-
urllib3==2.0.4
193+
urllib3==2.0.5
193194
# via requests
194-
zipp==3.16.2
195+
zipp==3.17.0
195196
# via importlib-resources

requirements/dev.txt

Lines changed: 25 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ argon2-cffi-bindings==21.2.0
1818
# via argon2-cffi
1919
arrow==1.2.3
2020
# via isoduration
21-
asttokens==2.2.1
21+
asttokens==2.4.0
2222
# via stack-data
2323
async-lru==2.0.4
2424
# via jupyterlab
@@ -34,7 +34,7 @@ beautifulsoup4==4.12.2
3434
# via nbconvert
3535
bleach==6.0.0
3636
# via nbconvert
37-
build==0.10.0
37+
build==1.0.3
3838
# via pip-tools
3939
certifi==2023.7.22
4040
# via
@@ -58,15 +58,15 @@ comm==0.1.4
5858
# via
5959
# ipykernel
6060
# ipywidgets
61-
contourpy==1.1.0
61+
contourpy==1.1.1
6262
# via
6363
# -c requirements/base.txt
6464
# matplotlib
6565
cycler==0.11.0
6666
# via
6767
# -c requirements/base.txt
6868
# matplotlib
69-
debugpy==1.6.7.post1
69+
debugpy==1.8.0
7070
# via ipykernel
7171
decorator==5.1.1
7272
# via ipython
@@ -95,6 +95,7 @@ idna==3.4
9595
# requests
9696
importlib-metadata==6.8.0
9797
# via
98+
# build
9899
# jupyter-client
99100
# jupyter-lsp
100101
# jupyterlab
@@ -107,7 +108,7 @@ importlib-resources==6.0.1
107108
# jsonschema-specifications
108109
# jupyterlab
109110
# matplotlib
110-
ipykernel==6.25.1
111+
ipykernel==6.25.2
111112
# via
112113
# jupyter
113114
# jupyter-console
@@ -121,7 +122,7 @@ ipython==8.12.2
121122
# jupyter-console
122123
ipython-genutils==0.2.0
123124
# via qtconsole
124-
ipywidgets==8.1.0
125+
ipywidgets==8.1.1
125126
# via jupyter
126127
isoduration==20.11.0
127128
# via jsonschema
@@ -180,15 +181,15 @@ jupyter-server==2.7.3
180181
# notebook-shim
181182
jupyter-server-terminals==0.4.4
182183
# via jupyter-server
183-
jupyterlab==4.0.5
184+
jupyterlab==4.0.6
184185
# via notebook
185186
jupyterlab-pygments==0.2.2
186187
# via nbconvert
187-
jupyterlab-server==2.24.0
188+
jupyterlab-server==2.25.0
188189
# via
189190
# jupyterlab
190191
# notebook
191-
jupyterlab-widgets==3.0.8
192+
jupyterlab-widgets==3.0.9
192193
# via ipywidgets
193194
kiwisolver==1.4.5
194195
# via
@@ -199,7 +200,7 @@ markupsafe==2.1.3
199200
# -c requirements/base.txt
200201
# jinja2
201202
# nbconvert
202-
matplotlib==3.7.2
203+
matplotlib==3.7.3
203204
# via
204205
# -c requirements/base.txt
205206
# -r requirements/dev.in
@@ -220,9 +221,9 @@ nbformat==5.9.2
220221
# jupyter-server
221222
# nbclient
222223
# nbconvert
223-
nest-asyncio==1.5.7
224+
nest-asyncio==1.5.8
224225
# via ipykernel
225-
notebook==7.0.3
226+
notebook==7.0.4
226227
# via jupyter
227228
notebook-shim==0.2.3
228229
# via
@@ -256,7 +257,7 @@ pexpect==4.8.0
256257
# via ipython
257258
pickleshare==0.7.5
258259
# via ipython
259-
pillow==10.0.0
260+
pillow==10.0.1
260261
# via
261262
# -c requirements/base.txt
262263
# -c requirements/test.txt
@@ -293,7 +294,7 @@ pygments==2.16.1
293294
# jupyter-console
294295
# nbconvert
295296
# qtconsole
296-
pyparsing==3.0.9
297+
pyparsing==3.1.1
297298
# via
298299
# -c requirements/base.txt
299300
# matplotlib
@@ -307,7 +308,7 @@ python-dateutil==2.8.2
307308
# matplotlib
308309
python-json-logger==2.0.7
309310
# via jupyter-events
310-
pytz==2023.3
311+
pytz==2023.3.post1
311312
# via
312313
# -c requirements/base.txt
313314
# babel
@@ -323,7 +324,7 @@ pyzmq==25.1.1
323324
# jupyter-console
324325
# jupyter-server
325326
# qtconsole
326-
qtconsole==5.4.3
327+
qtconsole==5.4.4
327328
# via jupyter
328329
qtpy==2.4.0
329330
# via qtconsole
@@ -345,7 +346,7 @@ rfc3986-validator==0.1.1
345346
# via
346347
# jsonschema
347348
# jupyter-events
348-
rpds-py==0.10.0
349+
rpds-py==0.10.3
349350
# via
350351
# jsonschema
351352
# referencing
@@ -362,7 +363,7 @@ sniffio==1.3.0
362363
# via
363364
# -c requirements/test.txt
364365
# anyio
365-
soupsieve==2.4.1
366+
soupsieve==2.5
366367
# via beautifulsoup4
367368
stack-data==0.6.2
368369
# via ipython
@@ -387,7 +388,7 @@ tornado==6.3.3
387388
# jupyterlab
388389
# notebook
389390
# terminado
390-
traitlets==5.9.0
391+
traitlets==5.10.0
391392
# via
392393
# comm
393394
# ipykernel
@@ -404,15 +405,15 @@ traitlets==5.9.0
404405
# nbconvert
405406
# nbformat
406407
# qtconsole
407-
typing-extensions==4.7.1
408+
typing-extensions==4.8.0
408409
# via
409410
# -c requirements/base.txt
410411
# -c requirements/test.txt
411412
# async-lru
412413
# ipython
413414
uri-template==1.3.0
414415
# via jsonschema
415-
urllib3==2.0.4
416+
urllib3==2.0.5
416417
# via
417418
# -c requirements/base.txt
418419
# -c requirements/test.txt
@@ -425,13 +426,13 @@ webencodings==0.5.1
425426
# via
426427
# bleach
427428
# tinycss2
428-
websocket-client==1.6.2
429+
websocket-client==1.6.3
429430
# via jupyter-server
430431
wheel==0.41.2
431432
# via pip-tools
432-
widgetsnbextension==4.0.8
433+
widgetsnbextension==4.0.9
433434
# via ipywidgets
434-
zipp==3.16.2
435+
zipp==3.17.0
435436
# via
436437
# -c requirements/base.txt
437438
# importlib-metadata

0 commit comments

Comments
 (0)