Skip to content

Commit 6bc8c38

Browse files
authored
Document elements and metadata overview: add missing email metadata fields (#664)
1 parent 237d8fa commit 6bc8c38

File tree

2 files changed

+47
-22
lines changed

2 files changed

+47
-22
lines changed

snippets/concepts/document-elements.mdx

Lines changed: 27 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -142,28 +142,40 @@ print(element.metadata.coordinates.system.height)
142142

143143
```
144144

145-
146145
### Additional metadata fields by document type
147146

148-
| Field Name | Applicable Doc Types | Description |
149-
|------------------------|----------------------|-------------------------------------------------------------------------------|
150-
| `page_number` | DOCX, PDF, PPT, XLSX | Page number |
151-
| `page_name` | XLSX | Sheet name in an Excel document |
152-
| `sent_from` | EML | Email sender |
153-
| `sent_to` | EML | Email recipient |
154-
| `subject` | EML | Email subject |
155-
| `attached_to_filename` | MSG | filename that attachment file is attached to |
156-
| `header_footer_type` | Word Doc | Pages a header or footer applies to: "primary", "even_only", and "first_page" |
157-
| `link_urls` | HTML | The url associated with a link in a document. |
158-
| `link_texts` | HTML | The text associated with a link in a document. |
159-
| `section` | EPUB | Book section title corresponding to table of contents |
147+
| Field name | Applicable file types | Description |
148+
|------------------------|-----------------------|-------------------------------------------------------------------------------------------------------------------------------------|
149+
| `attached_to_filename` | MSG | The name of the file that the attached file is attached to. |
150+
| `bcc_recipient` | EML | The related [email](#email) BCC recipient. |
151+
| `cc_recipient` | EML | The related [email](#email) CC recipient. |
152+
| `email_message_id` | EML | The related [email](#email) message ID. |
153+
| `header_footer_type` | Word Doc | The pages that a header or footer applies to in a [Word document](#microsoft-word-files): `primary`, `even_only`, and `first_page`. |
154+
| `link_urls` | HTML | The URL that is associated with a link in a document. |
155+
| `link_texts` | HTML | The text that is associated with a link in a document. |
156+
| `page_name` | XLSX | The related sheet's name in an [Excel file](#microsoft-excel-files). |
157+
| `page_number` | DOCX, PDF, PPT, XLSX | The related file's page number. |
158+
| `section` | EPUB | The book section title corresponding to a table of contents. |
159+
| `sent_from` | EML | The related [email](#email) sender. |
160+
| `sent_to` | EML | The related [email](#email) recipient. |
161+
| `signature` | EML | The related [email](#email) signature. |
162+
| `subject` | EML | The related [email](#email) subject. |
160163

161164
Notes on additional metadata by document type:
162165

163166
#### Email
164167

165-
Emails will include `sent_from`, `sent_to`, and `subject` metadata. `sent_from` is a list of strings because
166-
the [RFC 822](https://www.rfc-editor.org/rfc/rfc822) spec for emails allows for multiple sent from email addresses.
168+
For emails, metadata will contain the following fields, where available:
169+
170+
- `bcc_recipient`
171+
- `cc_recipient`
172+
- `email_message_id`
173+
- `sent_from`
174+
- `sent_to`
175+
- `signature`
176+
- `subject`
177+
178+
`sent_from` is a list of strings because the [RFC 822](https://www.rfc-editor.org/rfc/rfc822) spec for emails allows for multiple sent from email addresses.
167179

168180
#### Microsoft Excel documents
169181

ui/document-elements.mdx

Lines changed: 20 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -130,23 +130,36 @@ The `coordinates` metadata field contains:
130130

131131
| Field name | Applicable file types | Description |
132132
|------------------------|-----------------------|-------------------------------------------------------------------------------------------------------------------------------------|
133-
| `page_number` | DOCX, PDF, PPT, XLSX | The related file's page number. |
134-
| `page_name` | XLSX | The related sheet's name in an [Excel file](#microsoft-excel-files). |
135-
| `sent_from` | EML | The related [email](#email) sender. |
136-
| `sent_to` | EML | The related [email](#email) recipient. |
137-
| `subject` | EML | The related [email](#email) subject. |
138133
| `attached_to_filename` | MSG | The name of the file that the attached file is attached to. |
134+
| `bcc_recipient` | EML | The related [email](#email) BCC recipient. |
135+
| `cc_recipient` | EML | The related [email](#email) CC recipient. |
136+
| `email_message_id` | EML | The related [email](#email) message ID. |
139137
| `header_footer_type` | Word Doc | The pages that a header or footer applies to in a [Word document](#microsoft-word-files): `primary`, `even_only`, and `first_page`. |
140138
| `link_urls` | HTML | The URL that is associated with a link in a document. |
141139
| `link_texts` | HTML | The text that is associated with a link in a document. |
140+
| `page_name` | XLSX | The related sheet's name in an [Excel file](#microsoft-excel-files). |
141+
| `page_number` | DOCX, PDF, PPT, XLSX | The related file's page number. |
142142
| `section` | EPUB | The book section title corresponding to a table of contents. |
143+
| `sent_from` | EML | The related [email](#email) sender. |
144+
| `sent_to` | EML | The related [email](#email) recipient. |
145+
| `signature` | EML | The related [email](#email) signature. |
146+
| `subject` | EML | The related [email](#email) subject. |
143147

144148
Here are some notes on additional metadata fields by file type:
145149

146150
#### Email
147151

148-
Emails will include `sent_from`, `sent_to`, and `subject` metadata. `sent_from` is a list of strings because
149-
the [RFC 822](https://www.rfc-editor.org/rfc/rfc822) spec for emails allows for multiple sent from email addresses.
152+
For emails, metadata will contain the following fields, where available:
153+
154+
- `bcc_recipient`
155+
- `cc_recipient`
156+
- `email_message_id`
157+
- `sent_from`
158+
- `sent_to`
159+
- `signature`
160+
- `subject`
161+
162+
`sent_from` is a list of strings because the [RFC 822](https://www.rfc-editor.org/rfc/rfc822) spec for emails allows for multiple sent from email addresses.
150163

151164
#### Microsoft Excel files
152165

0 commit comments

Comments
 (0)