Skip to content

Commit 55ad5fd

Browse files
fix chucking text None type has no attribute stripe (#4018)
### Summary To fix error `Error in chunk: 512: {"detail":"'NoneType' object has no attribute 'strip'"}` I found the logs under same org (could assume this is the same job) screenshot: ![Screenshot 2025-06-11 at 10 15 57 AM](https://github.com/user-attachments/assets/c50ada55-eef1-43f7-9e27-9b9ae339a6fb) stack trace from the `utic-api` ES log doc: ![Screenshot 2025-06-11 at 2 01 01 PM](https://github.com/user-attachments/assets/7e84fa24-4eb6-45e8-b195-a11d3d124bfa) ### Notes longer term we should make partitioner (vlm + utic-api) not return text with Null --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: yuming-long <[email protected]>
1 parent ec209c6 commit 55ad5fd

File tree

13 files changed

+132
-108
lines changed

13 files changed

+132
-108
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@ jobs:
133133
- name: Test
134134
env:
135135
UNS_API_KEY: ${{ secrets.UNS_API_KEY }}
136-
TESSERACT_VERSION : "5.4.1"
136+
TESSERACT_VERSION : "5.5.1"
137137
run: |
138138
source .venv/bin/activate
139139
sudo apt-get update

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
1-
## 0.17.11-dev0
1+
## 0.17.11-dev1
22

33
### Enhancements
44

55
### Features
66

77
### Fixes
8+
- Fix chunking for elements with None text that has AttributeError 'NoneType' object has no attribute 'strip'.
89
- Invalid elements IDs are not visible in VLM output. Parent-child hierarchy is now retrieved based on unstructured element ID, instead of id injected into HTML code of element.
910

1011
## 0.17.10

test_unstructured/chunking/test_base.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -416,6 +416,20 @@ def it_can_handle_element_with_none_as_text(self):
416416
)
417417
assert pre_chunk._text == "hello"
418418

419+
def it_can_chunk_elements_with_none_text_without_error(self):
420+
"""Regression test for AttributeError when Image elements have None text."""
421+
pre_chunk = PreChunk(
422+
[Image(None), Text("hello world"), Image(None)],
423+
overlap_prefix="",
424+
opts=ChunkingOptions(),
425+
)
426+
427+
# Should not raise AttributeError when generating chunks
428+
chunks = list(pre_chunk.iter_chunks())
429+
430+
assert len(chunks) == 1
431+
assert chunks[0].text == "hello world"
432+
419433
@pytest.mark.parametrize(
420434
("max_characters", "combine_text_under_n_chars", "expected_value"),
421435
[
@@ -1026,6 +1040,15 @@ def it_computes_the_original_elements_list_to_help(self):
10261040
# -- computation is only on first call, all chunks get exactly the same orig-elements --
10271041
assert table_chunker._orig_elements is orig_elements
10281042

1043+
def it_handles_table_with_none_text_without_error(self):
1044+
"""Regression test for AttributeError when Table elements have None text."""
1045+
table = Table(None) # Table with None text
1046+
1047+
# Should not raise AttributeError and should produce no chunks
1048+
chunks = list(_TableChunker.iter_chunks(table, "", ChunkingOptions()))
1049+
1050+
assert len(chunks) == 0
1051+
10291052

10301053
# ================================================================================================
10311054
# HTML SPLITTERS

test_unstructured_ingest/expected-structured-output-html/azure/IRS-form-1987.pdf.html

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@
77
</title>
88
</head>
99
<body>
10-
<h1 class="Title" id="33d8fd813310ae3e74efd7e17fef99df">
11-
a Department of the Treasury Internal Revenue Service Instructions for Form 3115 (Rev. November 1987) Application for Change in Accounting Method
10+
<h1 class="Title" id="9c3a63df0fa9649fd2065ebcc4922e18">
11+
gai) Department of the Treasury Internal Revenue Service Instructions for Form 3115 (Rev. November 1987) Application for Change in Accounting Method
1212
</h1>
1313
<p class="NarrativeText" id="5801c515b515aadfb7717e4c36a4cea4">
1414
(Section references are to the Internal Revenue Code unless otherwise noted.)
@@ -28,29 +28,29 @@ <h1 class="Title" id="85af235e687b4a6537e5542a42456d25">
2828
<p class="NarrativeText" id="8753b1907d0b40b882489a68baf3fe2c">
2929
File this form to request a change in your accounting method, including the accounting treatment of any item. If you are requesting a change in accounting period, use Form 1128, Application for Change in Accounting Period. For more information, see Publication 538, Accounting Periods and Methods.
3030
</p>
31-
<p class="NarrativeText" id="0cf9161971e9ea8feec111ff7d24f403">
32-
When filing Form 3115, taxpayers are reminded to determine if IRS has published a ruling or procedure dealing with the specific type of change since November 1987 (the current revision date of Form 3115),
31+
<p class="NarrativeText" id="7b5365f4534832bac87e1df792cf5b16">
32+
When filing Form 3115, taxpayers are reminded to determine if IRS has published a ruling or procedure dealing with the specific type of change since November 1987 (the current revision date of Form 3115).
3333
</p>
3434
<p class="NarrativeText" id="0fb8eb24db1b27f6f8b69213e3dd9b41">
3535
Long-term contracts. —If you are required to change your method of accounting for long-term contracts under section 460, see Notice 87-61 (9/21/87), 1987-38 IRB 40, for the notification procedures that must be followed.
3636
</p>
3737
<p class="NarrativeText" id="7282f497b067ed1e34176cc85d46ea8e">
3838
Other methods.—Unless the Service has published a regulation or procedure to the contrary, all other changes !n accounting methods required by the Act are automatically considered to be approved by the Commissioner. Examples of method changes automatically approved by the Commissioner are those changes required to effect: (1) the repeal of the reserve method for bad debts of taxpayers other than financial institutions (Act section 805); (2) the repeal of the installment method for sales under a revolving credit plan (Act section 812); (3) the Inclusion of income attributable to the sale or furnishing of utility services no later than the year In which the services were provided to customers (Act section 821); and (4) the repeal of the deduction for qualified discount coupons (Act section 823). Do not file Form 3115 for these changes.
3939
</p>
40-
<p class="NarrativeText" id="61f76478266283c91988a108081fc02e">
41-
Generally, applicants must complete Section A. In addition, complete the appropriate sections (B-1 through H) for which a change Is desired.
40+
<p class="NarrativeText" id="9218e8a34790d23be418f5c4ffaaf54c">
41+
Generally, applicants must complete Section A. \n addition, complete the appropriate sections (B-1 through H) for which a change Is desired.
4242
</p>
4343
<p class="NarrativeText" id="b8f9f1fdeffadd34472959092459fba9">
4444
You must give all relevant facts, including a detailed description of your present and proposed methods. You must also state the reason(s) you believe approval to make the requested change should be granted. Attach additional pages if more space is needed for explanations. Each page should show your name, address, and identifying number.
4545
</p>
46-
<p class="NarrativeText" id="6055008a5485b687b614551c78a89c6e">
47-
State whether you desire a conference in the National Office if the Service proposes to disapprove your application.
46+
<p class="NarrativeText" id="b7ac9f40a0b010ca0f9a6dedba12a95c">
47+
State whether you desire a conference In the National Office if the Service proposes to disapprove your application.
4848
</p>
4949
<h1 class="Title" id="45da2e5561453f7cdfcf31c1ace13cf0">
5050
Changes to Accounting Methods Required Under the Tax Reform Act of 1986
5151
</h1>
52-
<p class="NarrativeText" id="9256e7591256b6799035172da259b839">
53-
Uniform capitalization rules and limitation on cash method.—If you are required to change your method of accounting under section,263A (relating to the capitalization and inclusion in inventory costs of certain expenses) or 448 (limiting the use of the cash method of accounting by certain taxpayers) as added by the Tax Reform Act of 1986 (“Act”), the change 1s treated as initiated by the taxpayer, approved by the Commissioner, and the period for taking the adjustments under section 481(a) into account will not exceed 4 years. (Hospitals required to change from the cash method under section 448 have 10 years to take the adjustrnents into account.) Complete Section A and the appropriate sections (B-1 or C and D) for which the change is required.
52+
<p class="NarrativeText" id="0476fb3d546e315ae90c733259812973">
53+
Uniform capitalization rules and limitation on cash method.—If you are required to change your method of accounting under section,263A (relating to the capitalization and inclusion in inventory costs of certain expenses) or 448 (limiting the use of the cash method of accounting by certain taxpayers) as added by the Tax Reform Act of 1986 (“Act”), the change is treated as initiated by the taxpayer, approved by the Commissioner, and the period for taking the adjustments under section 481(a) into account will not exceed 4 years. (Hospitals required to change from the cash method under section 448 have 10 years to take the adjustrnents into account.) Complete Section A and the appropriate sections (B-1 or C and D) for which the change is required.
5454
</p>
5555
<p class="NarrativeText" id="9951e8eac8f909df08655f3bc100a586">
5656
Disregard the instructions under Time and Place for Filing and Late Applications. Instead, attach Form 3115 to your income tax return for the year of change; do not file it separately. Also include on a separate statement accompanying the Form 3115 the period over which the section 481(a) adjustment will be taken into account and the basis for that conclusion. Identify the automatic change being made at the top of page 1 of Form 3115 (e.g., “Automatic Change to Accrual Method—Section 448"). See Temporary Regulations sections 1.263A-1T and 1.448-1T for additional information.
@@ -76,8 +76,8 @@ <h1 class="Title" id="daacd181c8b4c9cdeaa9762e5efd3586">
7676
<h1 class="Title" id="9bac1c8a91f637da3c6114d95239ceee">
7777
Late Applications
7878
</h1>
79-
<p class="NarrativeText" id="c92c7f4def0263141b370bf307d6bcc0">
80-
If your application is filed after the 180-day period, it is late. The application will be considered for processing only upon a showing of “good cause” and if it can be shown to the satisfaction of the Commissioner that granting you an extension will not jeopardize the Government's interests. For further information, see Rev, Proc. 79-63.
79+
<p class="NarrativeText" id="adad72fa6ed1f3d66351440221c1ad23">
80+
If your application is filed after the 180-day period, it 1s late. The application will be considered for processing only upon a showing of “good cause” and if it can be shown to the satisfaction of the Commissioner that granting you an extension will not jeopardize the Government's interests. For further information, see Rev, Proc. 79-63.
8181
</p>
8282
<h1 class="Title" id="569b780f1a01b3fe19031adfd2ff6567">
8383
Identifying Number
@@ -118,8 +118,8 @@ <h1 class="Title" id="441fb1ede36ac4766833502b0400a14a">
118118
<h1 class="Title" id="5a646ca8e56ece623a47079b32e62fc6">
119119
Specific Instructions
120120
</h1>
121-
<h1 class="Title" id="e0e692b1f478333e3950f8cb2483a484">
122-
Section A
121+
<h1 class="Title" id="1505240fbe441adc4acdbc867689af29">
122+
SectionA
123123
</h1>
124124
<p class="NarrativeText" id="43c45bb43eaf69131bf2392df1239ef2">
125125
Item 5a, page 1.—“Taxable income or (loss) from operations” is to be entered before application of any net operating loss deduction under section 172(a).
@@ -166,8 +166,8 @@ <h1 class="Title" id="1f5704b56b007d890b634121c86d81ac">
166166
<p class="NarrativeText" id="454de5bfbdcba4385a21dd6261c57d53">
167167
The limitation on the use of the cash method (except for tax shelters) does not apply to—
168168
</p>
169-
<p class="NarrativeText" id="fc1f0d4d56acd27a18ba80ab0acfb9e9">
170-
(1) Farming businesses.—F or this purpose, the term “farming business” 1s defined in section 263A(e)(4), but it also includes the raising, harvesting, or growing of trees to which section 263A(c)(5) applies. Notwithstanding this exception, section 447 requires certain C corporations and partnerships with a C corporation as a partner to use the accrual method.
169+
<p class="NarrativeText" id="d268b0c2840319e1b229673523368cae">
170+
(1) Farming businesses.—For this purpose, the term “farming business” 1s defined in section 263A(e)(4), but it also includes the raising, harvesting, or growing of trees to which section 263A(c)(5) applies. Notwithstanding this exception, section 447 requires certain C corporations and partnerships with a C corporation as a partner to use the accrual method.
171171
</p>
172172
<p class="NarrativeText" id="51dcb59cd362d0003f609fdb43fbdfdc">
173173
(2) Qualified personal service corporations. — A “qualified personal service corporation” is any corporation: (a) substantially all of the activities of which involve the performance of services in the fields of health, law, engineering, architecture, accounting, actuarial science, performing arts, or consulting, and (b)
@@ -178,8 +178,8 @@ <h1 class="Title" id="80474543fe96478feeda72a22f019cd1">
178178
<p class="NarrativeText" id="e4776aaec9edf7383c95941623c47ff6">
179179
substantially all of the stock of which is owned by employees performing the services, retired employees who had performed the services, any estate of any individual who had performed the services listed above, or any person who acquired stock of the corporation as a result of the death of an employee or retiree described above if the acquisition occurred within 2 years of death.
180180
</p>
181-
<p class="NarrativeText" id="5f5c402f9ebefef3ba8eabf1b5f628b2">
182-
(3) Entities with gross receipts of $5,000,000 or less. —To qualify for this exception, the C corporation's or partnership’s annual average gross receipts for the three years ending with the prior tax year may not exceed $5,000,000. If the corporation or partnership was not in existence for the entire 3-year period, the period of existence is used to determine whether the corporation or partnership qualifies. If any tax year in the 3-year period is a short tax year, the corporation or partnership must annualize the gross receipts by multiplying the gross receipts by 12 and dividing the result by the number of months in the short period.
181+
<p class="NarrativeText" id="02eb85f4c80a008b9e03744e68528aff">
182+
(3) Entities with gross receipts of $5,000,000 or less. —To qualify for this exception, the C corporation's or partnership’s annual average gross receipts for the three years ending with the prior tax year may not exceed $5,000,000. If the corporation or partnership was not in existence for the entire 3-year period, the period of existence is used to determine whether the corporation or partnership qualifies. If any tax year in the 3-year period is a short tax year, the corporation or partnership must annualize the gross receipts by multiplying the gross receipts by 12 and dividing the result by the number of months tn the short period.
183183
</p>
184184
<p class="NarrativeText" id="427e5fe33c8c181ccb93c7de11946c13">
185185
For more information, see section 448 and Temporary Regulations section 1.448-1T.

0 commit comments

Comments
 (0)