Skip to content

Commit aa33210

Browse files
authored
fix: fix header and footer not parsed as Header/Footer types (#4041)
## Summary This PR fixes an issue where header/footer content in html are not partitioned as `unstructured` `Header` or `Footer` element types. Rather they are either `UncategorizedText` or taking on the type of the nested structure inside the header/footer. E.g., `<header class="Header"><h1 class="Title">Header Title</h1></header>` would be partitioned as a `Title` instead of `Header`. ## Bug description This behavior is because we treat header and footer as layout, i.e., containers, in the ontology definition. As a result, during parsing we [unwrap](https://github.com/Unstructured-IO/unstructured/blob/ec209c6b5f9f24b4aabfa3bc8145ab896e7afd66/unstructured/partition/html/transformations.py#L361-L378) the container and parse the contents as if they are from the main text even though they are still part of header/footer. The fix is to treat header/footer as text instead of layout in ontology so that all content inside of them are properly gathered under `Header`/`Footer` element types.
1 parent 45c3b63 commit aa33210

File tree

9 files changed

+503
-604
lines changed

9 files changed

+503
-604
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## 0.18.2-dev3
1+
## 0.18.2-dev4
22

33
### Enhancements
44

@@ -9,6 +9,7 @@
99
- **Failproof docx malformed or merged tables** This fix prevents docx file with complex or vertical merges or malformed tables from failing at `tc_at_grid_offset` and raised `ValueError: no tc element at grid_offset=X`.
1010
- **partition_md can read special characters on non- utf-8 files** `partition_md` reads the file as utf-8 previously. Now it uses `read_txt_file` that reads file with detected encoding.
1111
- xml code not getting escaped in a code block in a markdown file when in partition
12+
- **Fixes parsing HTML header and footer** Previously header and footer texts are partitioned as `UncategorizedText` or as the nested structure like `Title`. Now they are properly partitioned as `Header` and `Footer` element types.
1213

1314
## 0.18.1
1415

test_unstructured/documents/unstructured_json_output/example.json

Lines changed: 43 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -1,159 +1,143 @@
11
[
22
{
3-
"element_id": "3a6b156a81764e17be128264241f8136",
3+
"element_id": "eda37931eb954fcc8dec8804c7e8fa4c",
44
"metadata": {
55
"category_depth": 0,
6-
"filename": "example.pdf",
6+
"file_directory": "test_unstructured/documents/html_files",
7+
"filename": "example.html",
78
"filetype": "text/html",
89
"languages": [
910
"eng"
1011
],
12+
"last_modified": "2025-06-12T11:12:20",
1113
"page_number": 1,
12-
"parent_id": "897a8a47377c4ad6aab839a929879537",
14+
"parent_id": "037b418b76eb4ac1bd40326ff67e67b0",
1315
"text_as_html": "<div class=\"Page\" data-page-number=\"1\" />"
1416
},
1517
"text": "",
1618
"type": "UncategorizedText"
1719
},
1820
{
19-
"element_id": "45b3d0053468484ba1c7b53998115412",
21+
"element_id": "97eb491421584ad892074d039779fbfa",
2022
"metadata": {
2123
"category_depth": 1,
22-
"filename": "example.pdf",
24+
"file_directory": "test_unstructured/documents/html_files",
25+
"filename": "example.html",
2326
"filetype": "text/html",
2427
"languages": [
2528
"eng"
2629
],
30+
"last_modified": "2025-06-12T11:12:20",
2731
"page_number": 1,
28-
"parent_id": "3a6b156a81764e17be128264241f8136",
29-
"text_as_html": "<header class=\"Header\" />"
32+
"parent_id": "eda37931eb954fcc8dec8804c7e8fa4c",
33+
"text_as_html": "<header class=\"Header\"><h1 class=\"Title\">Header</h1><time class=\"CalendarDate\">Date: October 30, 2023</time></header>"
3034
},
31-
"text": "",
32-
"type": "UncategorizedText"
33-
},
34-
{
35-
"element_id": "c95473e8a3704fc2b418697f9fddb27b",
36-
"metadata": {
37-
"category_depth": 2,
38-
"filename": "example.pdf",
39-
"filetype": "text/html",
40-
"languages": [
41-
"eng"
42-
],
43-
"page_number": 1,
44-
"parent_id": "45b3d0053468484ba1c7b53998115412",
45-
"text_as_html": "<h1 class=\"Title\">Header</h1>"
46-
},
47-
"text": "Header",
48-
"type": "Title"
49-
},
50-
{
51-
"element_id": "379cbfdc16d44bd6a59e6cfabe6438d5",
52-
"metadata": {
53-
"category_depth": 2,
54-
"filename": "example.pdf",
55-
"filetype": "text/html",
56-
"languages": [
57-
"eng"
58-
],
59-
"page_number": 1,
60-
"parent_id": "45b3d0053468484ba1c7b53998115412",
61-
"text_as_html": "<time class=\"CalendarDate\">Date: October 30, 2023</time>"
62-
},
63-
"text": "Date: October 30, 2023",
64-
"type": "UncategorizedText"
35+
"text": "Header Date: October 30, 2023",
36+
"type": "Header"
6537
},
6638
{
67-
"element_id": "637c2f6935fb4353a5f73025ce04619d",
39+
"element_id": "4afb6e4a90e14835b958dadb77cd8331",
6840
"metadata": {
6941
"category_depth": 1,
70-
"filename": "example.pdf",
42+
"file_directory": "test_unstructured/documents/html_files",
43+
"filename": "example.html",
7144
"filetype": "text/html",
7245
"languages": [
7346
"eng"
7447
],
48+
"last_modified": "2025-06-12T11:12:20",
7549
"page_number": 1,
76-
"parent_id": "3a6b156a81764e17be128264241f8136",
50+
"parent_id": "eda37931eb954fcc8dec8804c7e8fa4c",
7751
"text_as_html": "<form class=\"Form\"><label class=\"FormField\" for=\"company-name\">From field name</label><input class=\"FormFieldValue\" value=\"Example value\" /></form>"
7852
},
7953
"text": "From field name Example value",
8054
"type": "UncategorizedText"
8155
},
8256
{
83-
"element_id": "592422373ed741b68a077e2003f8ed81",
57+
"element_id": "d8f996f2bc9a49f4979aac58a2a9ee93",
8458
"metadata": {
8559
"category_depth": 1,
86-
"filename": "example.pdf",
60+
"file_directory": "test_unstructured/documents/html_files",
61+
"filename": "example.html",
8762
"filetype": "text/html",
8863
"languages": [
8964
"eng"
9065
],
66+
"last_modified": "2025-06-12T11:12:20",
9167
"page_number": 1,
92-
"parent_id": "3a6b156a81764e17be128264241f8136",
68+
"parent_id": "eda37931eb954fcc8dec8804c7e8fa4c",
9369
"text_as_html": "<section class=\"Section\" />"
9470
},
9571
"text": "",
9672
"type": "UncategorizedText"
9773
},
9874
{
99-
"element_id": "dc3792d4422e444f90876b56d0cfb20d",
75+
"element_id": "d2c12f995ab248808900f66aec479e9d",
10076
"metadata": {
10177
"category_depth": 2,
102-
"filename": "example.pdf",
78+
"file_directory": "test_unstructured/documents/html_files",
79+
"filename": "example.html",
10380
"filetype": "text/html",
10481
"languages": [
10582
"eng"
10683
],
84+
"last_modified": "2025-06-12T11:12:20",
10785
"page_number": 1,
108-
"parent_id": "592422373ed741b68a077e2003f8ed81",
86+
"parent_id": "d8f996f2bc9a49f4979aac58a2a9ee93",
10987
"text_as_html": "<table class=\"Table\"><thead><tr><th>Description</th><th>Row header</th></tr></thead><tbody><tr><td>Value description</td><td><span>50 $</span><span>(1.32 %)</span></td></tr></tbody></table>"
11088
},
11189
"text": "Description Row header Value description 50 $ (1.32 %)",
11290
"type": "Table"
11391
},
11492
{
115-
"element_id": "1032242af75c4b37984ea7fea9aac74c",
93+
"element_id": "8e3f0d85329343008593f43afcad3327",
11694
"metadata": {
11795
"category_depth": 1,
118-
"filename": "example.pdf",
96+
"file_directory": "test_unstructured/documents/html_files",
97+
"filename": "example.html",
11998
"filetype": "text/html",
12099
"languages": [
121100
"eng"
122101
],
102+
"last_modified": "2025-06-12T11:12:20",
123103
"page_number": 1,
124-
"parent_id": "3a6b156a81764e17be128264241f8136",
104+
"parent_id": "eda37931eb954fcc8dec8804c7e8fa4c",
125105
"text_as_html": "<section class=\"Section\" />"
126106
},
127107
"text": "",
128108
"type": "UncategorizedText"
129109
},
130110
{
131-
"element_id": "2a4e2c4a689f4f9a8c180b6b521e45c3",
111+
"element_id": "5deaad75854741ccb69767881ef399db",
132112
"metadata": {
133113
"category_depth": 2,
134-
"filename": "example.pdf",
114+
"file_directory": "test_unstructured/documents/html_files",
115+
"filename": "example.html",
135116
"filetype": "text/html",
136117
"languages": [
137118
"eng"
138119
],
120+
"last_modified": "2025-06-12T11:12:20",
139121
"page_number": 1,
140-
"parent_id": "1032242af75c4b37984ea7fea9aac74c",
122+
"parent_id": "8e3f0d85329343008593f43afcad3327",
141123
"text_as_html": "<h2 class=\"Subtitle\">2. Subtitle</h2>"
142124
},
143125
"text": "2. Subtitle",
144126
"type": "Title"
145127
},
146128
{
147-
"element_id": "5591f7a4df01447e82515ce45f686fbe",
129+
"element_id": "9e61f29755bc4b6dbb41ea575d41edb6",
148130
"metadata": {
149131
"category_depth": 2,
150-
"filename": "example.pdf",
132+
"file_directory": "test_unstructured/documents/html_files",
133+
"filename": "example.html",
151134
"filetype": "text/html",
152135
"languages": [
153136
"eng"
154137
],
138+
"last_modified": "2025-06-12T11:12:20",
155139
"page_number": 1,
156-
"parent_id": "1032242af75c4b37984ea7fea9aac74c",
140+
"parent_id": "8e3f0d85329343008593f43afcad3327",
157141
"text_as_html": "<p class=\"NarrativeText\">Paragraph text</p>"
158142
},
159143
"text": "Paragraph text",

0 commit comments

Comments
 (0)