Skip to content

Commit 0fa4fee

Browse files
authored
improve orig_elements handling in astra and neo4j (#389)
* improve orig_elements handling in astra and neo4j * . * fix json * fix * more fixes * add test
1 parent 3a2ac7e commit 0fa4fee

File tree

16 files changed

+1145
-1300
lines changed

16 files changed

+1145
-1300
lines changed

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
## 0.5.5
2+
3+
### Enhancements
4+
5+
* **Improve orig_elements handling in astra and neo4j connectors**
6+
17
## 0.5.4
28

39
### Enhancements

test/integration/connectors/expected_results/astradb/stager/DA-1p-with-duplicate-pages.pdf.json

Lines changed: 681 additions & 22 deletions
Large diffs are not rendered by default.

test/integration/connectors/expected_results/astradb/stager/DA-1p-with-duplicate-pages.pdf.ndjson

Lines changed: 22 additions & 22 deletions
Large diffs are not rendered by default.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
import base64
2+
import json
3+
import zlib
4+
5+
from unstructured_ingest.v2.processes.connectors.utils import format_and_truncate_orig_elements
6+
7+
8+
def test_format_and_truncate_orig_elements():
9+
original_elements = [
10+
{
11+
"text": "Hello, world!",
12+
"metadata": {
13+
"image_base64": "iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABwUlEQVR42mNk",
14+
"text_as_html": "<p>Hello, world!</p>",
15+
"page": 1,
16+
},
17+
}
18+
]
19+
json_bytes = json.dumps(original_elements, sort_keys=True).encode("utf-8")
20+
deflated_bytes = zlib.compress(json_bytes)
21+
b64_deflated_bytes = base64.b64encode(deflated_bytes)
22+
b64_deflated_bytes.decode("utf-8")
23+
24+
assert format_and_truncate_orig_elements(
25+
{"text": "Hello, world!", "metadata": {"orig_elements": b64_deflated_bytes.decode("utf-8")}}
26+
) == [{"metadata": {"page": 1}}]

test_e2e/expected-structured-output/azure/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf.json

Lines changed: 11 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -359,33 +359,9 @@
359359
}
360360
}
361361
},
362-
{
363-
"type": "UncategorizedText",
364-
"element_id": "4f2dbe3656a9ebc60c7e3426ad3cb3e3",
365-
"text": "_____________________________________________________________________________________________",
366-
"metadata": {
367-
"filetype": "application/pdf",
368-
"languages": [
369-
"eng"
370-
],
371-
"page_number": 2,
372-
"data_source": {
373-
"url": "abfs://container1/Core-Skills-for-Biomedical-Data-Scientists-2-pages.pdf",
374-
"version": "0x8DB214A673DD8D8",
375-
"record_locator": {
376-
"protocol": "abfs",
377-
"remote_file_path": "abfs://container1/"
378-
},
379-
"date_created": "1678440764.0",
380-
"date_modified": "1678440764.0",
381-
"permissions_data": null,
382-
"filesize_bytes": 41164
383-
}
384-
}
385-
},
386362
{
387363
"type": "NarrativeText",
388-
"element_id": "cd359ae8c49885ead47318021438eead",
364+
"element_id": "c8fdefac1ae82fa42caeceff04853415",
389365
"text": "this commitment, a recent report to the NLM Director recommended working across NIH to identify and develop core skills required of a biomedical data scientist to consistency across the cohort of NIH-trained data scientists. This report provides a set of recommended core skills based on analysis of current BD2K-funded training programs, biomedical data science job ads, and practicing members of the current data science workforce.",
390366
"metadata": {
391367
"filetype": "application/pdf",
@@ -409,7 +385,7 @@
409385
},
410386
{
411387
"type": "Title",
412-
"element_id": "bf8321a34edb7103ec4209f3e4a8a8da",
388+
"element_id": "b5b7392d0a946f5016bfa8ad0c248a9b",
413389
"text": "Methodology",
414390
"metadata": {
415391
"filetype": "application/pdf",
@@ -433,7 +409,7 @@
433409
},
434410
{
435411
"type": "NarrativeText",
436-
"element_id": "1e1d3d1a5c1397fc588393568d829bc8",
412+
"element_id": "d9d8e38d221ae621c0ddbcabaa4a28b4",
437413
"text": "The Workforce Excellence team took a three-pronged approach to identifying core skills required of a biomedical data scientist (BDS), drawing from:",
438414
"metadata": {
439415
"filetype": "application/pdf",
@@ -457,7 +433,7 @@
457433
},
458434
{
459435
"type": "NarrativeText",
460-
"element_id": "45d7ff56632d66a2ab2d4dd2716d4d2e",
436+
"element_id": "ba70aa3bc3ad0dec6a62939c94c5a20c",
461437
"text": "a) Responses to a 2017 Kaggle1 survey2 of over 16,000 self-identified data scientists working across many industries. Analysis of the Kaggle survey responses from the current data science workforce provided insights into the current generation of data scientists, including how they were trained and what programming and analysis skills they use.",
462438
"metadata": {
463439
"filetype": "application/pdf",
@@ -481,7 +457,7 @@
481457
},
482458
{
483459
"type": "NarrativeText",
484-
"element_id": "bf452aac5123fcedda30dd6ed179f41c",
460+
"element_id": "24724b1f0d20a6575f2782fd525c562f",
485461
"text": "b) Data science skills taught in BD2K-funded training programs. A qualitative content analysis was applied to the descriptions of required courses offered under the 12 BD2K-funded training programs. Each course was coded using qualitative data analysis software, with each skill that was present in the description counted once. The coding schema of data science-related skills was inductively developed and was organized into four major categories: (1) statistics and math skills; (2) computer science; (3) subject knowledge; (4) general skills, like communication and teamwork. The coding schema is detailed in Appendix A.",
486462
"metadata": {
487463
"filetype": "application/pdf",
@@ -505,7 +481,7 @@
505481
},
506482
{
507483
"type": "NarrativeText",
508-
"element_id": "ca176cbef532792b1f11830ff7520587",
484+
"element_id": "5e6c73154a1e5f74780c69afbc9bc084",
509485
"text": "c) Desired skills identified from data science-related job ads. 59 job ads from government (8.5%), academia (42.4%), industry (33.9%), and the nonprofit sector (15.3%) were sampled from websites like Glassdoor, Linkedin, and Ziprecruiter. The content analysis methodology and coding schema utilized in analyzing the training programs were applied to the job descriptions. Because many job ads mentioned the same skill more than once, each occurrence of the skill was coded, therefore weighting important skills that were mentioned multiple times in a single ad.",
510486
"metadata": {
511487
"filetype": "application/pdf",
@@ -529,7 +505,7 @@
529505
},
530506
{
531507
"type": "NarrativeText",
532-
"element_id": "11b170fedd889c3b895bbd28acd811ca",
508+
"element_id": "249f6c76b2c99dadbefb8b8811b0d4cd",
533509
"text": "Analysis of the above data provided insights into the current state of biomedical data science training, as well as a view into data science-related skills likely to be needed to prepare the BDS workforce to succeed in the future. Together, these analyses informed recommendations for core skills necessary for a competitive biomedical data scientist.",
534510
"metadata": {
535511
"filetype": "application/pdf",
@@ -553,7 +529,7 @@
553529
},
554530
{
555531
"type": "NarrativeText",
556-
"element_id": "2665aadf75bca259f1f5b4c91a53a301",
532+
"element_id": "f4b34fe2b03c12e48a89276dca673bfb",
557533
"text": "1 Kaggle is an online community for data scientists, serving as a platform for collaboration, competition, and learning: http://kaggle.com",
558534
"metadata": {
559535
"filetype": "application/pdf",
@@ -577,7 +553,7 @@
577553
},
578554
{
579555
"type": "NarrativeText",
580-
"element_id": "8bbfe1c3e6bca9a33226d20d69b2297a",
556+
"element_id": "75e0008cfdfecc18fb8c43490c53d6d4",
581557
"text": "2 In August 2017, Kaggle conducted an industry-wide survey to gain a clearer picture of the state of data science and machine learning. A standard set of questions were asked of all respondents, with more specific questions related to work for employed data scientists and questions related to learning for data scientists in training. Methodology and results: https://www.kaggle.com/kaggle/kaggle-survey-2017",
582558
"metadata": {
583559
"filetype": "application/pdf",
@@ -600,8 +576,8 @@
600576
}
601577
},
602578
{
603-
"type": "UncategorizedText",
604-
"element_id": "dd4a661e1a3c898a5cf6328ba56b924d",
579+
"type": "PageNumber",
580+
"element_id": "e5d48e29d989341ba281611d4eb9311a",
605581
"text": "2",
606582
"metadata": {
607583
"filetype": "application/pdf",

test_e2e/expected-structured-output/azure/IRS-form-1987.pdf.json

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1033,8 +1033,8 @@
10331033
},
10341034
{
10351035
"type": "NarrativeText",
1036-
"element_id": "0753dc21ca9fa94cb607cdb49bef3eed",
1037-
"text": "Item 6, page 2.—The term “gross receipts” includes total sales (net of returns and allowances) and all amounts received for services. In addition, gross receipts include any income from investments and from incidental or outside sources (e.g., interest, dividends, rents, royalties, and annuities). However, if you are a resaler of personal property, exclude from gross receipts any amounts not derived in the ordinary course of a trade or business. Gross receipts do not include amounts received for sales taxes if, under the applicable state or local law, the tax is legally imposed on the purchaser of the good or service, and the taxpayer merely collects and remits the tax to the taxing authority.",
1036+
"element_id": "5dbefdf9c5729f9d9a74a4b9c40bfb03",
1037+
"text": "Item 6, page 2.—The term “gross receipts” includes total sales (net of returns and allowances) and all amounts received for services. In addition, gross receipts include any income from investments and from incidental or outside sources (e.g., interest, dividends, rents, royalties, and annuities). However, if you are a resaler of personal property, exclude from gross recepts any amounts not derived in the ordinary course of a trae or business. Gross receipts do not include amounts received for sales taxes if, under the applicable state or local law, the tax is legally imposed on the purchaser of the good or service, and the taxpayer merely collects and remits the tax to the taxing authority.",
10381038
"metadata": {
10391039
"filetype": "application/pdf",
10401040
"languages": [
@@ -1081,8 +1081,8 @@
10811081
},
10821082
{
10831083
"type": "NarrativeText",
1084-
"element_id": "a34b5c633b40ae532a293aa5ece41ff6",
1085-
"text": "(manufacturing, retailer, wholesaler, etc.), employer identification number, overall method of accounting, and whether, in the last 6 years, that business has changed its accounting method, or is also changing its accounting method as part of this request or as a separate request.",
1084+
"element_id": "1fe93da70cb3544175c812a8fb231a93",
1085+
"text": "(manufacturing, retailer, wholesaler, etc.), employer identification number, overall method of accounting, and whether, in the last 6 years, that business has changed its accounting method, or s also changing its accounting method as part of this request or as a separate request.",
10861086
"metadata": {
10871087
"filetype": "application/pdf",
10881088
"languages": [
@@ -1273,8 +1273,8 @@
12731273
},
12741274
{
12751275
"type": "NarrativeText",
1276-
"element_id": "375f471287a32d216212e71c83efac13",
1277-
"text": "Item 1b, page 2.—Include any amounts reported as income In a prior year although the income had not been accrued (earned) or received In the prior year; for example, discount on instaliment loans reported as income for the year in which the loans were made instead of for the year or years in which the income was received or earned. Advance payments under Rev. Proc. 71-21 or Regulations section 1.451-5 must be fully explained and all pertinent information must be submitted with this application.",
1276+
"element_id": "4402466124ef06237b1c582818097ef5",
1277+
"text": "Item 1b, page 2.—Include any amounts reported as income In a prior year although the income had not been accrued (earned) or received In the prior year; for example, discount on instalment loans reported as income for the year in which the loans were made instead of for the year or years in which the income was received or earned. Advance payments under Rev. Proc. 71-21 or Regulations section 1.451-5 must be fully explained and all pertinent information must be submitted with this application.",
12781278
"metadata": {
12791279
"filetype": "application/pdf",
12801280
"languages": [
@@ -1753,8 +1753,8 @@
17531753
},
17541754
{
17551755
"type": "NarrativeText",
1756-
"element_id": "3c85924610889b0aea2755f36f32d151",
1757-
"text": "Section 460(f) provides that the term “long-term contract” means any contract for the manufacturing, building, installation, or construction of property that is not completed within the tax year in which it is entered into. However, a manufacturing contract will not qualify as a long-term contract unless the contract involves the manufacture of: (1) a unique item not normally included in your finished goods inventory, or (2) any item that normally requires more than 12 calendar months to complete.",
1756+
"element_id": "6641f860f560189f2a4f70a11bcb18a0",
1757+
"text": "Section 460(f) provides that the term “long-term contract” means any contract for the manufacturing, building, installation, or construction of property that is not completed within the tax year in which it is entered into. However, a manufacturing contract will not qualify as a long-term contract unless the contract involves the manufacture of: (1) a unique item not normally included n your finished goods inventory, or (2) any item that normally requires more than 12 calendar months to complete.",
17581758
"metadata": {
17591759
"filetype": "application/pdf",
17601760
"languages": [
@@ -1849,8 +1849,8 @@
18491849
},
18501850
{
18511851
"type": "NarrativeText",
1852-
"element_id": "211a99a9043cb86883fccf0990dca05a",
1853-
"text": "This section is to be used only to request a change in a method of accounting for depreciation under section 167.",
1852+
"element_id": "7a531ff803f3d44ea7844acad139239f",
1853+
"text": "This section s to be used only to request a change n a method of accounting for depreciation under section 167.",
18541854
"metadata": {
18551855
"filetype": "application/pdf",
18561856
"languages": [
@@ -1945,8 +1945,8 @@
19451945
},
19461946
{
19471947
"type": "NarrativeText",
1948-
"element_id": "7fa50d465048201b8120b4c992461622",
1949-
"text": "Generally, this section should be used for requesting changes in a method of accounting for which provision has not been made elsewhere on this form. Attach additional pages if more space 1s needed for a full explanation of the present method used and the proposed change requested.",
1948+
"element_id": "f6c9da2dd8c289bbfd9803b594cd42d4",
1949+
"text": "Generally, this section should be used for requesting changes n a method of accounting for which provision has not been made elsewhere on this form. Attach additional pages if more space 1s needed for a full explanation of the present method used and the proposed change requested.",
19501950
"metadata": {
19511951
"filetype": "application/pdf",
19521952
"languages": [

test_e2e/expected-structured-output/azure/IRS-form-1987.png.json

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -337,8 +337,8 @@
337337
},
338338
{
339339
"type": "NarrativeText",
340-
"element_id": "7906b3444dddda0127174fa5c39eeaaf",
341-
"text": "Generally, applicants must complete Section A. In addition, complete the appropriate sections (B-1 through H) for which a change 1s desired.",
340+
"element_id": "fddbf541191e6288740825d1ba2ad699",
341+
"text": "Generally, applicants must complete Section A. In addition, complete the appropriate sections (B-1 through H) for which a change s desired.",
342342
"metadata": {
343343
"filetype": "image/png",
344344
"languages": [
@@ -433,8 +433,8 @@
433433
},
434434
{
435435
"type": "NarrativeText",
436-
"element_id": "10e96c0c96cbb0372d9aa0c53e4e22fd",
437-
"text": "Uniform capitalization rules and limitation on cash method.—If you are required to change your method of accounting under sectiort 263A (relating to the capitalization and inclusion in inventory costs of certain expenses) or 448 (limiting the use of the cash method of accounting by certain taxpayers) as added by the Tax Reform Act of 1986 (“Act”), the change 1s treated as initiated by the taxpayer, approved by the Commissioner, and the period for taking the adjustments under section 481(a) into account will not exceed 4 years. (Hospitals required to change from the cash method under section 448 have 10 years to take the adjustments into account.) Complete Section A and the appropriate sections (B-1 or C and D) for which the change is required.",
436+
"element_id": "db1340e51c6f16b0d09662ef99b6a95c",
437+
"text": "Uniform capitalization rules and limitation on cash method.—If you are required to change your method of accounting under sectior 263A (relating to the capitalization and inclusion in inventory costs of certain expenses) or 448 (limiting the use of the cash method of accounting by certain taxpayers) as added by the Tax Reform Act of 1986 (“Act”), the change 1s treated as initiated by the taxpayer, approved by the Commissioner, and the period for taking the adjustments under section 481(a) nto account will not exceed 4 years. (Hospitals required to change from the cash method under section 448 have 10 years to take the adjustments into account.) Complete Section A and the appropriate sections (B-1 or C and D) for which the change is required.",
438438
"metadata": {
439439
"filetype": "image/png",
440440
"languages": [
@@ -457,8 +457,8 @@
457457
},
458458
{
459459
"type": "NarrativeText",
460-
"element_id": "b708681dc3b4972238aa4adb68ce9d58",
461-
"text": "Disregard the instructions under Time and Place for Filing and Late Applications. Instead, attach Form 3115 to your income tax return for the year of change; do not file it separately. Also include on a separate statement accompanying the Form 3115 the period over which the section 481(a) adjustment will be taken into account and the basis for that conclusion. Identify the automatic change being made at the top of page 1of Form 3115 (e “Automatic Change to Accrual Method— e tion 448\"). See Temporary Regulations sections 1.263A-1T and 1.448-1T for additional information.",
460+
"element_id": "9acc565afdfca753f8670453ee0bcbc9",
461+
"text": "Disregard the instructions under Time and Place for Filing and Late Applications. Instead, attach Form 3115 to your income tax return for the year of change; do not file it separately. Also include on a separate statement accompanying the Form 3115 the period over which the section 481(a) adjustment will be taken into account and the basis for that conclusion. Identify the automatic change being made at the top of page 1of Form 3115 (e “Automatic Change to Accrual Method— tion 448\"). See Temporary Regulations sections 1.263A-1T and 1.448-1T for additional information.",
462462
"metadata": {
463463
"filetype": "image/png",
464464
"languages": [
@@ -505,8 +505,8 @@
505505
},
506506
{
507507
"type": "NarrativeText",
508-
"element_id": "032a665732e8c3d746c4b9c1a008d806",
509-
"text": "Generally, applicants must file this form within the first 180 days of the tax year in which it is desired to make the change.",
508+
"element_id": "9f6e038f1b56fc0e2b240959e8e26380",
509+
"text": "Generally, applicants must file this form within the first 180 days of the tax year in which it is desire to make the change.",
510510
"metadata": {
511511
"filetype": "image/png",
512512
"languages": [
@@ -529,8 +529,8 @@
529529
},
530530
{
531531
"type": "NarrativeText",
532-
"element_id": "e801588e099fc459b0de871907003b6b",
533-
"text": "Taxpayers, other than exempt organizations, should file Form 3115 with the Commissioner of Internal Revenue, Attention: CC:C:4, 1111 Constitution Avenue, NW, Washington, DC 20224, Exempt organizations should file with the Assistant Commissioner (Employee Plans and Exempt Organizations), 1111 Constitution Avenue, NW, Washington, DC 20224.",
532+
"element_id": "52a2bedfff51dcae79b58b1bf03791f3",
533+
"text": "Taxpayers, other than exempt organizations, should file Form 3115 with the Commssioner of Internal Revenue, Attention: CC:C:4, 1111 Constitution Avenue, NW, Washington, DC 20224, Exempt organizations should file with the Assistant Commissioner (Employee Plans and Exempt Organizations), 1111 Constitution Avenue, NW, Washington, DC 20224.",
534534
"metadata": {
535535
"filetype": "image/png",
536536
"languages": [

0 commit comments

Comments
 (0)