Commit c0457c1
authored
feat: include images when partitioning html (#3945)
Currently we [filter img
tags](https://github.com/Unstructured-IO/unstructured/blob/2addb19473ba9e27af995291f57d35fb50bec4b0/unstructured/partition/html/partition.py#L226-L229)
before tags are converted to Elements by the html partitioner. More
importantly we also don’t currently have a defined “block” / mapping to
support these. This adds these mappings and logic to process.
It also respects `extract_image_block_types` and
`extract_image_block_to_payload` (as we do with pdfs) to determine
whether base64 is included in the metadata.
The partitioned Image Elements sets the text to the img tag’s alt text
if available.
The partitioned Image Elements include the [url in the
metadata](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py#L209)
(rather than image_base64) if the img tag src is a url.
## Testing
unit tests have been added for explicit coverage.
existing integration tests and other unit test fixtures have been
updated to account for `Image` elements now present
---------
Co-authored-by: ryannikolaidis <[email protected]>1 parent 74b0647 commit c0457c1
File tree
23 files changed
+1014
-264
lines changed- test_unstructured_ingest
- expected-structured-output-html
- confluence-diff
- MFS
- testteamsp
- notion
- salesforce/EmailMessage
- expected-structured-output
- confluence-diff
- MFS
- testteamsp
- notion
- salesforce/EmailMessage
- test_unstructured/partition
- html
- unstructured
- partition/html
23 files changed
+1014
-264
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
1 | 11 | | |
2 | 12 | | |
3 | 13 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
9 | | - | |
| 9 | + | |
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
| |||
296 | 297 | | |
297 | 298 | | |
298 | 299 | | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
299 | 362 | | |
300 | 363 | | |
301 | 364 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
210 | 210 | | |
211 | 211 | | |
212 | 212 | | |
213 | | - | |
| 213 | + | |
214 | 214 | | |
215 | 215 | | |
216 | 216 | | |
217 | 217 | | |
218 | 218 | | |
219 | 219 | | |
220 | 220 | | |
221 | | - | |
| 221 | + | |
222 | 222 | | |
223 | 223 | | |
224 | 224 | | |
| |||
430 | 430 | | |
431 | 431 | | |
432 | 432 | | |
433 | | - | |
| 433 | + | |
434 | 434 | | |
435 | 435 | | |
436 | 436 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
17 | | - | |
| 17 | + | |
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
24 | | - | |
| 24 | + | |
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| |||
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
35 | | - | |
| 35 | + | |
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
| |||
Lines changed: 37 additions & 28 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
16 | | - | |
| 16 | + | |
| 17 | + | |
17 | 18 | | |
18 | 19 | | |
19 | | - | |
| 20 | + | |
20 | 21 | | |
21 | 22 | | |
22 | | - | |
| 23 | + | |
23 | 24 | | |
24 | 25 | | |
25 | | - | |
| 26 | + | |
26 | 27 | | |
27 | 28 | | |
28 | | - | |
| 29 | + | |
29 | 30 | | |
30 | 31 | | |
31 | | - | |
| 32 | + | |
32 | 33 | | |
33 | 34 | | |
34 | | - | |
| 35 | + | |
| 36 | + | |
35 | 37 | | |
36 | 38 | | |
37 | | - | |
| 39 | + | |
38 | 40 | | |
39 | 41 | | |
40 | | - | |
| 42 | + | |
41 | 43 | | |
42 | 44 | | |
43 | | - | |
| 45 | + | |
| 46 | + | |
44 | 47 | | |
45 | 48 | | |
46 | | - | |
| 49 | + | |
47 | 50 | | |
48 | 51 | | |
49 | | - | |
| 52 | + | |
50 | 53 | | |
51 | 54 | | |
52 | | - | |
| 55 | + | |
| 56 | + | |
53 | 57 | | |
54 | 58 | | |
55 | | - | |
| 59 | + | |
56 | 60 | | |
57 | 61 | | |
58 | | - | |
| 62 | + | |
59 | 63 | | |
60 | 64 | | |
61 | | - | |
| 65 | + | |
| 66 | + | |
62 | 67 | | |
63 | 68 | | |
64 | | - | |
| 69 | + | |
65 | 70 | | |
66 | 71 | | |
67 | | - | |
| 72 | + | |
68 | 73 | | |
69 | 74 | | |
70 | | - | |
| 75 | + | |
71 | 76 | | |
72 | 77 | | |
73 | | - | |
| 78 | + | |
74 | 79 | | |
75 | 80 | | |
76 | | - | |
| 81 | + | |
77 | 82 | | |
78 | 83 | | |
79 | | - | |
| 84 | + | |
| 85 | + | |
80 | 86 | | |
81 | 87 | | |
82 | | - | |
| 88 | + | |
83 | 89 | | |
84 | 90 | | |
85 | | - | |
| 91 | + | |
86 | 92 | | |
87 | 93 | | |
88 | | - | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
89 | 97 | | |
90 | 98 | | |
91 | | - | |
| 99 | + | |
92 | 100 | | |
93 | 101 | | |
94 | | - | |
| 102 | + | |
95 | 103 | | |
96 | 104 | | |
97 | | - | |
| 105 | + | |
98 | 106 | | |
99 | 107 | | |
| 108 | + | |
100 | 109 | | |
101 | 110 | | |
Lines changed: 33 additions & 27 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
49 | | - | |
| 49 | + | |
| 50 | + | |
50 | 51 | | |
51 | 52 | | |
52 | | - | |
| 53 | + | |
53 | 54 | | |
54 | 55 | | |
55 | | - | |
| 56 | + | |
56 | 57 | | |
57 | 58 | | |
58 | | - | |
| 59 | + | |
| 60 | + | |
59 | 61 | | |
60 | 62 | | |
61 | | - | |
| 63 | + | |
62 | 64 | | |
63 | 65 | | |
64 | | - | |
| 66 | + | |
65 | 67 | | |
66 | 68 | | |
67 | | - | |
| 69 | + | |
| 70 | + | |
68 | 71 | | |
69 | 72 | | |
70 | | - | |
| 73 | + | |
71 | 74 | | |
72 | 75 | | |
73 | | - | |
| 76 | + | |
74 | 77 | | |
75 | 78 | | |
76 | | - | |
| 79 | + | |
77 | 80 | | |
78 | 81 | | |
79 | | - | |
| 82 | + | |
80 | 83 | | |
81 | 84 | | |
82 | | - | |
| 85 | + | |
83 | 86 | | |
84 | 87 | | |
85 | | - | |
| 88 | + | |
86 | 89 | | |
87 | 90 | | |
88 | | - | |
| 91 | + | |
89 | 92 | | |
90 | 93 | | |
91 | | - | |
| 94 | + | |
92 | 95 | | |
93 | 96 | | |
94 | | - | |
| 97 | + | |
95 | 98 | | |
96 | 99 | | |
97 | | - | |
| 100 | + | |
98 | 101 | | |
99 | 102 | | |
100 | | - | |
| 103 | + | |
| 104 | + | |
101 | 105 | | |
102 | 106 | | |
103 | | - | |
| 107 | + | |
104 | 108 | | |
105 | 109 | | |
106 | | - | |
| 110 | + | |
| 111 | + | |
107 | 112 | | |
108 | 113 | | |
109 | | - | |
| 114 | + | |
110 | 115 | | |
111 | 116 | | |
112 | | - | |
| 117 | + | |
| 118 | + | |
113 | 119 | | |
114 | 120 | | |
115 | | - | |
| 121 | + | |
116 | 122 | | |
117 | 123 | | |
118 | | - | |
| 124 | + | |
119 | 125 | | |
120 | 126 | | |
121 | | - | |
| 127 | + | |
122 | 128 | | |
123 | 129 | | |
124 | | - | |
| 130 | + | |
125 | 131 | | |
126 | 132 | | |
127 | | - | |
| 133 | + | |
128 | 134 | | |
129 | 135 | | |
130 | 136 | | |
| |||
0 commit comments