Skip to content

Commit 08fafc5

Browse files
christinestraubajjimenoryannikolaidis
authored
Fix: embedded text not getting merged with inferred elements (#2679)
This PR is the second part of fixing "embedded text not getting merged with inferred elements", the first part is done in Unstructured-IO/unstructured-inference#331. ### Summary - replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()` when removing pdfminer (embedded) elements that were merged with inferred elements - use env_config `EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD` introduced in the [first part](Unstructured-IO/unstructured-inference#331) when removing pdfminer (embedded) elements that were merged with inferred elements - bump `unstructured-inference` to 0.7.25 ### Testing PDF: [pwc-financial-statements-p114.pdf](https://github.com/Unstructured-IO/unstructured/files/14707146/pwc-financial-statements-p114.pdf) ``` $ pip uninstall unstructured-inference -y $ git clone -b fix/embedded-text-not-getting-merged-with-inferred-elements [email protected]:Unstructured-IO/unstructured-inference.git && cd unstructured-inference $ pip install -e . ``` ``` elements = partition_pdf( filename="pwc-financial-statements-p114.pdf", strategy="hi_res", infer_table_structure=True, extract_image_block_types=["Image"], ) table_elements = [el for el in elements if el.category == "Table"] print(table_elements[0].text) ``` --------- Co-authored-by: Antonio Jose Jimeno Yepes <[email protected]> Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: christinestraub <[email protected]>
1 parent 56fbaae commit 08fafc5

File tree

11 files changed

+328
-12
lines changed

11 files changed

+328
-12
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.13.0-dev10
1+
## 0.13.0-dev11
22

33
### Enhancements
44

@@ -13,6 +13,7 @@
1313

1414
### Fixes
1515

16+
* **Fix `clean_pdfminer_inner_elements()` to remove only pdfminer (embedded) elements merged with inferred elements** Previously, some embedded elements were removed even if they were not merged with inferred elements. Now, only embedded elements that are already merged with inferred elements are removed.
1617
* **Clarify IAM Role Requirement for GCS Platform Connectors**. The GCS Source Connector requires Storage Object Viewer and GCS Destination Connector requires Storage Object Creator IAM roles.
1718
* **Change table extraction defaults** Change table extraction defaults in favor of using `skip_infer_table_types` parameter and reflect these changes in documentation.
1819
* **Fix OneDrive dates with inconsistent formatting** Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string. See previous fix for SharePoint

requirements/extra-pdf-image.in

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ pillow_heif
99
pypdf
1010
# Do not move to constraints.in, otherwise unstructured-inference will not be upgraded
1111
# when unstructured library is.
12-
unstructured-inference==0.7.23
12+
unstructured-inference==0.7.25
1313
# unstructured fork of pytesseract that provides an interface to allow for multiple output formats
1414
# from one tesseract call
1515
unstructured.pytesseract>=0.3.12

requirements/extra-pdf-image.txt

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -248,7 +248,7 @@ typing-extensions==4.9.0
248248
# torch
249249
tzdata==2024.1
250250
# via pandas
251-
unstructured-inference==0.7.23
251+
unstructured-inference==0.7.25
252252
# via -r extra-pdf-image.in
253253
unstructured-pytesseract==0.3.12
254254
# via
@@ -257,7 +257,6 @@ unstructured-pytesseract==0.3.12
257257
urllib3==1.26.18
258258
# via
259259
# -c base.txt
260-
# -c constraints.in
261260
# requests
262261
wrapt==1.16.0
263262
# via

test_unstructured/partition/utils/test_processing_elements.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,8 +58,8 @@
5858
type="Table",
5959
source=InferenceSource.YOLOX,
6060
),
61-
LayoutElement(bbox=Rectangle(0, 510, 50, 300), text="Inside table2", source=Source.PDFMINER),
62-
LayoutElement(bbox=Rectangle(0, 550, 70, 400), text="Inside table2", source=Source.PDFMINER),
61+
LayoutElement(bbox=Rectangle(0, 510, 50, 600), text="Inside table2", source=Source.PDFMINER),
62+
LayoutElement(bbox=Rectangle(0, 550, 70, 650), text="Inside table2", source=Source.PDFMINER),
6363
]
6464

6565

test_unstructured_ingest/expected-structured-output/biomed-api/75/29/main.PMC6312793.pdf.json

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -623,6 +623,19 @@
623623
"text": "Instance size (m, n) Average number of Locations Times Vehicles (8, 1500) (8, 2000) (8, 2500) (8, 3000) (12, 1500) (12, 2000) (12, 2500) (12, 3000) (16, 1500) (16, 2000) (16, 2500) (16, 3000) 568.40 672.80 923.40 977.00 566.00 732.60 875.00 1119.60 581.80 778.00 879.00 1087.20 975.20 1048.00 1078.00 1113.20 994.00 1040.60 1081.00 1107.40 985.40 1040.60 1083.20 1101.60 652.20 857.20 1082.40 1272.80 642.00 861.20 1096.00 1286.20 667.80 872.40 1076.40 1284.60 668,279.40 1,195,844.80 1,866,175.20 2,705,617.00 674,191.00 1,199,659.80 1,878,745.20 2,711,180.40 673,585.80 1,200,560.80 1,879,387.00 2,684,983.60",
624624
"type": "Table"
625625
},
626+
{
627+
"element_id": "68ec9a56bde1cd8ea67340bf9cb829cb",
628+
"metadata": {
629+
"data_source": {},
630+
"filetype": "application/pdf",
631+
"languages": [
632+
"eng"
633+
],
634+
"page_number": 3
635+
},
636+
"text": "Possible empty travels",
637+
"type": "Title"
638+
},
626639
{
627640
"element_id": "f64bebb0be23116b44b4ad54968178a0",
628641
"metadata": {

test_unstructured_ingest/expected-structured-output/s3/2023-Jan-economic-outlook.pdf.json

Lines changed: 273 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1133,6 +1133,258 @@
11331133
"text": "Year over Year Difference from October 2022 Q4 over Q4 2/ 2021 Estimate 2022 Projections 2023 2024 WEO Projections 1/ 2023 2024 Estimate 2022 Projections 2023 2024 6.2 3.4 2.9 3.1 0.2 –0.1 1.9 3.2 3.0 Advanced Economies United States Euro Area Germany France Italy Spain Japan United Kingdom Canada Other Advanced Economies 3/ 5.4 5.9 5.3 2.6 6.8 6.7 5.5 2.1 7.6 5.0 5.3 2.7 2.0 3.5 1.9 2.6 3.9 5.2 1.4 4.1 3.5 2.8 1.2 1.4 0.7 0.1 0.7 0.6 1.1 1.8 –0.6 1.5 2.0 1.4 1.0 1.6 1.4 1.6 0.9 2.4 0.9 0.9 1.5 2.4 0.1 0.4 0.2 0.4 0.0 0.8 –0.1 0.2 –0.9 0.0 –0.3 –0.2 –0.2 –0.2 –0.1 0.0 –0.4 –0.2 –0.4 0.3 –0.1 –0.2 1.3 0.7 1.9 1.4 0.5 2.1 2.1 1.7 0.4 2.3 1.4 1.1 1.0 0.5 0.0 0.9 0.1 1.3 1.0 –0.5 1.2 2.1 1.6 1.3 2.1 2.3 1.8 1.0 2.8 1.0 1.8 1.9 2.2 Emerging Market and Developing Economies Emerging and Developing Asia China India 4/ Emerging and Developing Europe Russia Latin America and the Caribbean Brazil Mexico Middle East and Central Asia Saudi Arabia Sub-Saharan Africa Nigeria South Africa 6.7 7.4 8.4 8.7 6.9 4.7 7.0 5.0 4.7 4.5 3.2 4.7 3.6 4.9 3.9 4.3 3.0 6.8 0.7 –2.2 3.9 3.1 3.1 5.3 8.7 3.8 3.0 2.6 4.0 5.3 5.2 6.1 1.5 0.3 1.8 1.2 1.7 3.2 2.6 3.8 3.2 1.2 4.2 5.2 4.5 6.8 2.6 2.1 2.1 1.5 1.6 3.7 3.4 4.1 2.9 1.3 0.3 0.4 0.8 0.0 0.9 2.6 0.1 0.2 0.5 –0.4 –1.1 0.1 0.2 0.1 –0.1 0.0 0.0 0.0 0.1 0.6 –0.3 –0.4 –0.2 0.2 0.5 0.0 0.0 0.0 2.5 3.4 2.9 4.3 –2.0 –4.1 2.6 2.8 3.7 . . . 4.6 . . . 2.6 3.0 5.0 6.2 5.9 7.0 3.5 1.0 1.9 0.8 1.1 . . . 2.7 . . . 3.1 0.5 4.1 4.9 4.1 7.1 2.8 2.0 1.9 2.2 1.9 . . . 3.5 . . . 2.9 1.8 Memorandum World Growth Based on Market Exchange Rates European Union ASEAN-5 5/ Middle East and North Africa Emerging Market and Middle-Income Economies Low-Income Developing Countries 6.0 5.5 3.8 4.1 7.0 4.1 3.1 3.7 5.2 5.4 3.8 4.9 2.4 0.7 4.3 3.2 4.0 4.9 2.5 1.8 4.7 3.5 4.1 5.6 0.3 0.0 –0.2 –0.4 0.4 0.0 –0.1 –0.3 –0.2 0.2 0.0 0.1 1.7 1.8 3.7 . . . 2.5 . . . 2.5 1.2 5.7 . . . 5.0 . . . 2.5 2.0 4.0 . . . 4.1 . . . 10.4 9.4 12.1 5.4 6.6 3.4 2.4 2.3 2.6 3.4 2.7 4.6 –0.1 0.0 –0.3 –0.3 –0.4 0.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 65.8 26.4 39.8 7.0 –16.2 –6.3 –7.1 –0.4 –3.3 –0.1 –0.9 0.3 11.2 –2.0 –9.8 1.4 –5.9 –0.2",
11341134
"type": "Table"
11351135
},
1136+
{
1137+
"element_id": "fcadc00fe663ee0e7818b0ffc5c46948",
1138+
"metadata": {
1139+
"data_source": {
1140+
"date_modified": "2023-02-14T07:31:28+00:00",
1141+
"record_locator": {
1142+
"protocol": "s3",
1143+
"remote_file_path": "utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf"
1144+
},
1145+
"url": "s3://utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf",
1146+
"version": "265756457651539296174748931590365722430"
1147+
},
1148+
"filetype": "application/pdf",
1149+
"languages": [
1150+
"eng"
1151+
],
1152+
"page_number": 7
1153+
},
1154+
"text": "World Output",
1155+
"type": "Title"
1156+
},
1157+
{
1158+
"element_id": "0c76bc4e35219e2a31b09428cd47d009",
1159+
"metadata": {
1160+
"data_source": {
1161+
"date_modified": "2023-02-14T07:31:28+00:00",
1162+
"record_locator": {
1163+
"protocol": "s3",
1164+
"remote_file_path": "utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf"
1165+
},
1166+
"url": "s3://utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf",
1167+
"version": "265756457651539296174748931590365722430"
1168+
},
1169+
"filetype": "application/pdf",
1170+
"languages": [
1171+
"eng"
1172+
],
1173+
"page_number": 7
1174+
},
1175+
"text": "World Trade Volume (goods and services) 6/ Advanced Economies Emerging Market and Developing Economies",
1176+
"type": "UncategorizedText"
1177+
},
1178+
{
1179+
"element_id": "3c0578f4d944258ffa4ffac7615f1ff9",
1180+
"metadata": {
1181+
"data_source": {
1182+
"date_modified": "2023-02-14T07:31:28+00:00",
1183+
"record_locator": {
1184+
"protocol": "s3",
1185+
"remote_file_path": "utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf"
1186+
},
1187+
"url": "s3://utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf",
1188+
"version": "265756457651539296174748931590365722430"
1189+
},
1190+
"filetype": "application/pdf",
1191+
"languages": [
1192+
"eng"
1193+
],
1194+
"page_number": 7
1195+
},
1196+
"text": "Commodity Prices Oil 7/ Nonfuel (average based on world commodity import weights)",
1197+
"type": "NarrativeText"
1198+
},
1199+
{
1200+
"element_id": "6bb1e757e09d7fa3aba323a375abd047",
1201+
"metadata": {
1202+
"data_source": {
1203+
"date_modified": "2023-02-14T07:31:28+00:00",
1204+
"record_locator": {
1205+
"protocol": "s3",
1206+
"remote_file_path": "utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf"
1207+
},
1208+
"url": "s3://utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf",
1209+
"version": "265756457651539296174748931590365722430"
1210+
},
1211+
"filetype": "application/pdf",
1212+
"languages": [
1213+
"eng"
1214+
],
1215+
"page_number": 7
1216+
},
1217+
"text": "World Consumer Prices 8/ Advanced Economies 9/ Emerging Market and Developing Economies 8/",
1218+
"type": "UncategorizedText"
1219+
},
1220+
{
1221+
"element_id": "9db439c530ed3425c0a68724de199942",
1222+
"metadata": {
1223+
"data_source": {
1224+
"date_modified": "2023-02-14T07:31:28+00:00",
1225+
"record_locator": {
1226+
"protocol": "s3",
1227+
"remote_file_path": "utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf"
1228+
},
1229+
"url": "s3://utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf",
1230+
"version": "265756457651539296174748931590365722430"
1231+
},
1232+
"filetype": "application/pdf",
1233+
"languages": [
1234+
"eng"
1235+
],
1236+
"page_number": 7
1237+
},
1238+
"text": "4.7 3.1 5.9",
1239+
"type": "UncategorizedText"
1240+
},
1241+
{
1242+
"element_id": "b7948d6976e997e76e343161b4b5d864",
1243+
"metadata": {
1244+
"data_source": {
1245+
"date_modified": "2023-02-14T07:31:28+00:00",
1246+
"record_locator": {
1247+
"protocol": "s3",
1248+
"remote_file_path": "utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf"
1249+
},
1250+
"url": "s3://utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf",
1251+
"version": "265756457651539296174748931590365722430"
1252+
},
1253+
"filetype": "application/pdf",
1254+
"languages": [
1255+
"eng"
1256+
],
1257+
"page_number": 7
1258+
},
1259+
"text": "8.8 7.3 9.9",
1260+
"type": "UncategorizedText"
1261+
},
1262+
{
1263+
"element_id": "e7ac421147471fe341ae242e7544a44c",
1264+
"metadata": {
1265+
"data_source": {
1266+
"date_modified": "2023-02-14T07:31:28+00:00",
1267+
"record_locator": {
1268+
"protocol": "s3",
1269+
"remote_file_path": "utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf"
1270+
},
1271+
"url": "s3://utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf",
1272+
"version": "265756457651539296174748931590365722430"
1273+
},
1274+
"filetype": "application/pdf",
1275+
"languages": [
1276+
"eng"
1277+
],
1278+
"page_number": 7
1279+
},
1280+
"text": "6.6 4.6 8.1",
1281+
"type": "UncategorizedText"
1282+
},
1283+
{
1284+
"element_id": "4b48b0469ba9682a3e385ee7fbb6bbed",
1285+
"metadata": {
1286+
"data_source": {
1287+
"date_modified": "2023-02-14T07:31:28+00:00",
1288+
"record_locator": {
1289+
"protocol": "s3",
1290+
"remote_file_path": "utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf"
1291+
},
1292+
"url": "s3://utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf",
1293+
"version": "265756457651539296174748931590365722430"
1294+
},
1295+
"filetype": "application/pdf",
1296+
"languages": [
1297+
"eng"
1298+
],
1299+
"page_number": 7
1300+
},
1301+
"text": "4.3 2.6 5.5",
1302+
"type": "UncategorizedText"
1303+
},
1304+
{
1305+
"element_id": "5277334fd8abe869f6a8de2e43942c9d",
1306+
"metadata": {
1307+
"data_source": {
1308+
"date_modified": "2023-02-14T07:31:28+00:00",
1309+
"record_locator": {
1310+
"protocol": "s3",
1311+
"remote_file_path": "utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf"
1312+
},
1313+
"url": "s3://utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf",
1314+
"version": "265756457651539296174748931590365722430"
1315+
},
1316+
"filetype": "application/pdf",
1317+
"languages": [
1318+
"eng"
1319+
],
1320+
"page_number": 7
1321+
},
1322+
"text": "0.1 0.2 0.0",
1323+
"type": "UncategorizedText"
1324+
},
1325+
{
1326+
"element_id": "44f0ab7953bb0b3696b9fa3cf0682f35",
1327+
"metadata": {
1328+
"data_source": {
1329+
"date_modified": "2023-02-14T07:31:28+00:00",
1330+
"record_locator": {
1331+
"protocol": "s3",
1332+
"remote_file_path": "utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf"
1333+
},
1334+
"url": "s3://utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf",
1335+
"version": "265756457651539296174748931590365722430"
1336+
},
1337+
"filetype": "application/pdf",
1338+
"languages": [
1339+
"eng"
1340+
],
1341+
"page_number": 7
1342+
},
1343+
"text": "0.2 0.2 0.2",
1344+
"type": "UncategorizedText"
1345+
},
1346+
{
1347+
"element_id": "08e781dd2b6499b1ac8105a47f3520cc",
1348+
"metadata": {
1349+
"data_source": {
1350+
"date_modified": "2023-02-14T07:31:28+00:00",
1351+
"record_locator": {
1352+
"protocol": "s3",
1353+
"remote_file_path": "utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf"
1354+
},
1355+
"url": "s3://utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf",
1356+
"version": "265756457651539296174748931590365722430"
1357+
},
1358+
"filetype": "application/pdf",
1359+
"languages": [
1360+
"eng"
1361+
],
1362+
"page_number": 7
1363+
},
1364+
"text": "9.2 7.8 10.4",
1365+
"type": "UncategorizedText"
1366+
},
1367+
{
1368+
"element_id": "e586cf66e92b356a4611ee2ffdf85a16",
1369+
"metadata": {
1370+
"data_source": {
1371+
"date_modified": "2023-02-14T07:31:28+00:00",
1372+
"record_locator": {
1373+
"protocol": "s3",
1374+
"remote_file_path": "utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf"
1375+
},
1376+
"url": "s3://utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf",
1377+
"version": "265756457651539296174748931590365722430"
1378+
},
1379+
"filetype": "application/pdf",
1380+
"languages": [
1381+
"eng"
1382+
],
1383+
"page_number": 7
1384+
},
1385+
"text": "5.0 3.1 6.6",
1386+
"type": "UncategorizedText"
1387+
},
11361388
{
11371389
"element_id": "46c8e0c55b163d73d3d2766be8d1bf8d",
11381390
"metadata": {
@@ -1217,6 +1469,27 @@
12171469
"text": "6 International Monetary Fund | January 2023",
12181470
"type": "ListItem"
12191471
},
1472+
{
1473+
"element_id": "41d85a7cc007a9c34136a786d6e61c15",
1474+
"metadata": {
1475+
"data_source": {
1476+
"date_modified": "2023-02-14T07:31:28+00:00",
1477+
"record_locator": {
1478+
"protocol": "s3",
1479+
"remote_file_path": "utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf"
1480+
},
1481+
"url": "s3://utic-dev-tech-fixtures/small-pdf-set/2023-Jan-economic-outlook.pdf",
1482+
"version": "265756457651539296174748931590365722430"
1483+
},
1484+
"filetype": "application/pdf",
1485+
"languages": [
1486+
"eng"
1487+
],
1488+
"page_number": 7
1489+
},
1490+
"text": "3.5 2.3 4.5",
1491+
"type": "UncategorizedText"
1492+
},
12201493
{
12211494
"element_id": "95af4f3feb2d03b2310ce31abc0c435d",
12221495
"metadata": {

test_unstructured_ingest/expected-structured-output/s3/recalibrating-risk-report.pdf.json

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -377,6 +377,27 @@
377377
"text": "Nuclear energy and the risk of radiation is one of the most extreme cases in which perceived and actual risks have diverged. The fear of radiation, whilst pre- dating the Second World War, was firmly established by the debate on the potential impacts of low-dose radiation from the fallout from nuclear weapons testing in the early years of the Cold War. Radiation in many ways became linked with the mental imagery of nuclear war, playing an important role in increasing public concern about radiation and its health effects. There is a well-established discrepancy between fact-based risk assessments and public perception of different risks. This is very much the case with nuclear power, and this is clearly highlighted in Figure 1, with laypersons ranking nuclear power as the highest risk out of 30 activities and technologies, with experts ranking nuclear as 20th. In many ways, popular culture’s depiction of radiation has played a role in ensuring that this discrepancy has remained, be it Godzilla, The Incredible Hulk, or The Simpsons, which regularly plays on the notion of radiation from nuclear power plants causing three-eyed fish, something that has been firmly rejected as unscientific.",
378378
"type": "NarrativeText"
379379
},
380+
{
381+
"element_id": "d977fff4c69c437aa4a44a5c5f4bf02e",
382+
"metadata": {
383+
"data_source": {
384+
"date_modified": "2023-02-12T10:09:32+00:00",
385+
"record_locator": {
386+
"protocol": "s3",
387+
"remote_file_path": "utic-dev-tech-fixtures/small-pdf-set/recalibrating-risk-report.pdf"
388+
},
389+
"url": "s3://utic-dev-tech-fixtures/small-pdf-set/recalibrating-risk-report.pdf",
390+
"version": "306475068461766865312866697521104206816"
391+
},
392+
"filetype": "application/pdf",
393+
"languages": [
394+
"eng"
395+
],
396+
"page_number": 4
397+
},
398+
"text": "Rank Order Laypersons",
399+
"type": "Title"
400+
},
380401
{
381402
"element_id": "92a15f52537ead259f4d9c2da1b22454",
382403
"metadata": {
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
metric average sample_sd population_sd count
22
cct-accuracy 0.809 0.24 0.233 17
3-
cct-%missing 0.026 0.033 0.032 17
3+
cct-%missing 0.025 0.032 0.031 17

test_unstructured_ingest/metrics/text-extraction/all-docs-cct.tsv

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,6 @@ handbook-1p.docx docx local-single-file-basic-chunking 0.858 0.029
1313
fake-html-cp1252.html html local-single-file-with-encoding 0.659 0.0
1414
layout-parser-paper-with-table.jpg jpg local-single-file-with-pdf-infer-table-structure 0.716 0.032
1515
layout-parser-paper.pdf pdf local-single-file-with-pdf-infer-table-structure 0.95 0.029
16-
2023-Jan-economic-outlook.pdf pdf s3 0.834 0.054
16+
2023-Jan-economic-outlook.pdf pdf s3 0.84 0.044
1717
page-with-formula.pdf pdf s3 0.971 0.021
18-
recalibrating-risk-report.pdf pdf s3 0.966 0.009
18+
recalibrating-risk-report.pdf pdf s3 0.968 0.008

0 commit comments

Comments
 (0)