Document elements and metadata: added some missing image and link metadata fields (#667)

Paul-Cornell · web-flow · commit 4de42d668299 · 2025-06-24T13:44:48.000-07:00
diff --git a/snippets/concepts/document-elements.mdx b/snippets/concepts/document-elements.mdx
@@ -144,22 +144,27 @@ print(element.metadata.coordinates.system.height)
 
 ### Additional metadata fields by document type
 
-| Field name             | Applicable file types | Description                                                                                                                         |
-|------------------------|-----------------------|-------------------------------------------------------------------------------------------------------------------------------------|
-| `attached_to_filename` | MSG                   | The name of the file that the attached file is attached to.                                                                         |
-| `bcc_recipient`        | EML                   | The related [email](#email) BCC recipient.                                                                                          |
-| `cc_recipient`         | EML                   | The related [email](#email) CC recipient.                                                                                           |
-| `email_message_id`     | EML                   | The related [email](#email) message ID.                                                                                             |
-| `header_footer_type`   | Word Doc              | The pages that a header or footer applies to in a [Word document](#microsoft-word-files): `primary`, `even_only`, and `first_page`. |
-| `link_urls`            | HTML                  | The URL that is associated with a link in a document.                                                                               |
-| `link_texts`           | HTML                  | The text that is associated with a link in a document.                                                                              |
-| `page_name`            | XLSX                  | The related sheet's name in an [Excel file](#microsoft-excel-files).                                                                |
-| `page_number`          | DOCX, PDF, PPT, XLSX  | The related file's page number.                                                                                                     |
-| `section`              | EPUB                  | The book section title corresponding to a table of contents.                                                                        |
-| `sent_from`            | EML                   | The related [email](#email) sender.                                                                                                 |
-| `sent_to`              | EML                   | The related [email](#email) recipient.                                                                                              |
-| `signature`            | EML                   | The related [email](#email) signature.                                                                                              |
-| `subject`              | EML                   | The related [email](#email) subject.                                                                                                |
+| Field name             | Applicable file types | Description                                                                                                                                                          |
+|------------------------|-----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `attached_to_filename` | MSG                   | The name of the file that the attached file is attached to.                                                                                                          |
+| `bcc_recipient`        | EML                   | The related [email](#email) BCC recipient.                                                                                                                           |
+| `cc_recipient`         | EML                   | The related [email](#email) CC recipient.                                                                                                                            |
+| `email_message_id`     | EML                   | The related [email](#email) message ID.                                                                                                                              |
+| `header_footer_type`   | Word Doc              | The pages that a header or footer applies to in a [Word document](#microsoft-word-files): `primary`, `even_only`, and `first_page`.                                  |
+| `image_path`           | PDF                   | The path to the image. This is useful when you want to extract the image and save it in a specified path instead of serializing the image within the processed data. |
+| `image_mime_type`      | PDF                   | The MIME type of the image.                                                                                                                                          |
+| `image_url`            | HTML                  | The URL to the image.                                                                                                                                                |
+| `link_start_indexes`   | HTML, PDF             | A list of the index locations within the extracted content where the `links` can be found.                                                                           |
+| `link_texts`           | HTML                  | A list of text strings that are associated with the `link_urls`.                                                                                                     |
+| `link_urls`            | HTML                  | A list of URLs within the extracted content.                                                                                                                         |
+| `links`                | PDF                   | A list of links within the extracted content.                                                                                                                        |
+| `page_name`            | XLSX                  | The related sheet's name in an [Excel file](#microsoft-excel-files).                                                                                                 |
+| `page_number`          | DOCX, PDF, PPT, XLSX  | The related file's page number.                                                                                                                                      |
+| `section`              | EPUB                  | The book section title corresponding to a table of contents.                                                                                                         |
+| `sent_from`            | EML                   | The related [email](#email) sender.                                                                                                                                  |
+| `sent_to`              | EML                   | The related [email](#email) recipient.                                                                                                                               |
+| `signature`            | EML                   | The related [email](#email) signature.                                                                                                                               |
+| `subject`              | EML                   | The related [email](#email) subject.                                                                                                                                 |
 
 Notes on additional metadata by document type:
 
diff --git a/ui/document-elements.mdx b/ui/document-elements.mdx
@@ -135,8 +135,12 @@ The `coordinates` metadata field contains:
 | `cc_recipient`         | EML                   | The related [email](#email) CC recipient.                                                                                           |
 | `email_message_id`     | EML                   | The related [email](#email) message ID.                                                                                             |
 | `header_footer_type`   | Word Doc              | The pages that a header or footer applies to in a [Word document](#microsoft-word-files): `primary`, `even_only`, and `first_page`. |
-| `link_urls`            | HTML                  | The URL that is associated with a link in a document.                                                                               |
-| `link_texts`           | HTML                  | The text that is associated with a link in a document.                                                                              |
+| `image_mime_type`      | HTML, image, PDF      | The MIME type of the image.                                                                                                         |
+| `image_url`            | HTML                  | The URL to the image.                                                                                                               |
+| `link_start_indexes`   | HTML, PDF             | A list of the index locations within the extracted content where the `links` can be found.                                          |
+| `link_texts`           | HTML                  | A list of text strings that are associated with the `link_urls`.                                                                    |
+| `link_urls`            | HTML                  | A list of URLs within the extracted content.                                                                                        |
+| `links`                | PDF                   | A list of links within the extracted content.                                                                                       |
 | `page_name`            | XLSX                  | The related sheet's name in an [Excel file](#microsoft-excel-files).                                                                |
 | `page_number`          | DOCX, PDF, PPT, XLSX  | The related file's page number.                                                                                                     |
 | `section`              | EPUB                  | The book section title corresponding to a table of contents.                                                                        |