Skip to content

Commit 4de42d6

Browse files
authored
Document elements and metadata: added some missing image and link metadata fields (#667)
1 parent 6fe3b57 commit 4de42d6

File tree

2 files changed

+27
-18
lines changed

2 files changed

+27
-18
lines changed

snippets/concepts/document-elements.mdx

Lines changed: 21 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -144,22 +144,27 @@ print(element.metadata.coordinates.system.height)
144144

145145
### Additional metadata fields by document type
146146

147-
| Field name | Applicable file types | Description |
148-
|------------------------|-----------------------|-------------------------------------------------------------------------------------------------------------------------------------|
149-
| `attached_to_filename` | MSG | The name of the file that the attached file is attached to. |
150-
| `bcc_recipient` | EML | The related [email](#email) BCC recipient. |
151-
| `cc_recipient` | EML | The related [email](#email) CC recipient. |
152-
| `email_message_id` | EML | The related [email](#email) message ID. |
153-
| `header_footer_type` | Word Doc | The pages that a header or footer applies to in a [Word document](#microsoft-word-files): `primary`, `even_only`, and `first_page`. |
154-
| `link_urls` | HTML | The URL that is associated with a link in a document. |
155-
| `link_texts` | HTML | The text that is associated with a link in a document. |
156-
| `page_name` | XLSX | The related sheet's name in an [Excel file](#microsoft-excel-files). |
157-
| `page_number` | DOCX, PDF, PPT, XLSX | The related file's page number. |
158-
| `section` | EPUB | The book section title corresponding to a table of contents. |
159-
| `sent_from` | EML | The related [email](#email) sender. |
160-
| `sent_to` | EML | The related [email](#email) recipient. |
161-
| `signature` | EML | The related [email](#email) signature. |
162-
| `subject` | EML | The related [email](#email) subject. |
147+
| Field name | Applicable file types | Description |
148+
|------------------------|-----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
149+
| `attached_to_filename` | MSG | The name of the file that the attached file is attached to. |
150+
| `bcc_recipient` | EML | The related [email](#email) BCC recipient. |
151+
| `cc_recipient` | EML | The related [email](#email) CC recipient. |
152+
| `email_message_id` | EML | The related [email](#email) message ID. |
153+
| `header_footer_type` | Word Doc | The pages that a header or footer applies to in a [Word document](#microsoft-word-files): `primary`, `even_only`, and `first_page`. |
154+
| `image_path` | PDF | The path to the image. This is useful when you want to extract the image and save it in a specified path instead of serializing the image within the processed data. |
155+
| `image_mime_type` | PDF | The MIME type of the image. |
156+
| `image_url` | HTML | The URL to the image. |
157+
| `link_start_indexes` | HTML, PDF | A list of the index locations within the extracted content where the `links` can be found. |
158+
| `link_texts` | HTML | A list of text strings that are associated with the `link_urls`. |
159+
| `link_urls` | HTML | A list of URLs within the extracted content. |
160+
| `links` | PDF | A list of links within the extracted content. |
161+
| `page_name` | XLSX | The related sheet's name in an [Excel file](#microsoft-excel-files). |
162+
| `page_number` | DOCX, PDF, PPT, XLSX | The related file's page number. |
163+
| `section` | EPUB | The book section title corresponding to a table of contents. |
164+
| `sent_from` | EML | The related [email](#email) sender. |
165+
| `sent_to` | EML | The related [email](#email) recipient. |
166+
| `signature` | EML | The related [email](#email) signature. |
167+
| `subject` | EML | The related [email](#email) subject. |
163168

164169
Notes on additional metadata by document type:
165170

ui/document-elements.mdx

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -135,8 +135,12 @@ The `coordinates` metadata field contains:
135135
| `cc_recipient` | EML | The related [email](#email) CC recipient. |
136136
| `email_message_id` | EML | The related [email](#email) message ID. |
137137
| `header_footer_type` | Word Doc | The pages that a header or footer applies to in a [Word document](#microsoft-word-files): `primary`, `even_only`, and `first_page`. |
138-
| `link_urls` | HTML | The URL that is associated with a link in a document. |
139-
| `link_texts` | HTML | The text that is associated with a link in a document. |
138+
| `image_mime_type` | HTML, image, PDF | The MIME type of the image. |
139+
| `image_url` | HTML | The URL to the image. |
140+
| `link_start_indexes` | HTML, PDF | A list of the index locations within the extracted content where the `links` can be found. |
141+
| `link_texts` | HTML | A list of text strings that are associated with the `link_urls`. |
142+
| `link_urls` | HTML | A list of URLs within the extracted content. |
143+
| `links` | PDF | A list of links within the extracted content. |
140144
| `page_name` | XLSX | The related sheet's name in an [Excel file](#microsoft-excel-files). |
141145
| `page_number` | DOCX, PDF, PPT, XLSX | The related file's page number. |
142146
| `section` | EPUB | The book section title corresponding to a table of contents. |

0 commit comments

Comments
 (0)