Skip to content

Commit ded99e2

Browse files
authored
Enrichment TOC refactoring (#470)
1 parent b80977f commit ded99e2

File tree

7 files changed

+365
-6
lines changed

7 files changed

+365
-6
lines changed

mint.json

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -593,7 +593,15 @@
593593
"platform/document-elements",
594594
"platform/partitioning",
595595
"platform/chunking",
596-
"platform/summarizing",
596+
{
597+
"group": "Enriching",
598+
"pages": [
599+
"platform/enriching/overview",
600+
"platform/enriching/image-descriptions",
601+
"platform/enriching/table-descriptions",
602+
"platform/enriching/table-to-html"
603+
]
604+
},
597605
"platform/embedding"
598606
]
599607
},
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
---
2+
title: Image descriptions
3+
---
4+
5+
After partitioning and chunking, you can have Unstructured generate text-based summaries of detected images.
6+
7+
This summarization is done by using models offered through these providers:
8+
9+
- [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI.
10+
- [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet), provided through Anthropic.
11+
- [Claude 3.5 Sonnet](https://aws.amazon.com/bedrock/claude/), provided through Amazon Bedrock.
12+
13+
Here is an example of the output of a detected image using GPT-4o. Note specifically the `text` field that is added.
14+
Line breaks have been inserted here for readability. The output will not contain these line breaks.
15+
16+
```json
17+
{
18+
"type": "Image",
19+
"element_id": "3303aa13098f5a26b9845bd18ee8c881",
20+
"text": "{\n \"type\": \"graph\",\n \"description\": \"The graph shows
21+
the relationship between Potential (V) and Current Density (A/cm2).
22+
The x-axis is labeled 'Current Density (A/cm2)' and ranges from
23+
0.0000001 to 0.1. The y-axis is labeled 'Potential (V)' and ranges
24+
from -2.5 to 1.5. There are six different data series represented
25+
by different colors: blue (10g), red (4g), green (6g), purple (2g),
26+
orange (Control), and light blue (8g). The data points for each series
27+
show how the potential changes with varying current density.\"\n}",
28+
"metadata": {
29+
"filetype": "application/pdf",
30+
"languages": [
31+
"eng"
32+
],
33+
"page_number": 1,
34+
"image_base64": "/9j...<full results omitted for brevity>...Q==",
35+
"image_mime_type": "image/jpeg",
36+
"filename": "7f239e1d4ef3556cc867a4bd321bbc41.pdf",
37+
"data_source": {}
38+
}
39+
}
40+
```
41+
42+
Any embeddings that are produced after these summaries are generated will be based on the `text` field's contents.
43+
44+
## Generate image descriptions
45+
46+
To generate image descriptions, in the **Task** drop-down list of an **Enrichment** node in a workflow, specify the following:
47+
48+
<Note>
49+
You can change a workflow's image description settings only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings.
50+
51+
Image summaries are generated only when the **Partitioner** node in a workflow is also set to use the **High Res** partitioning strategy. [Learn more](/platform/partitioning).
52+
</Note>
53+
54+
Select **Image Description**, and then choose one of the following provider (and model) combinations to use:
55+
56+
- **OpenAI (GPT-4o)**. [Learn more](https://openai.com/index/hello-gpt-4o/).
57+
- **Anthropic (Claude 3.5 Sonnet)**. [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet).
58+
- **Amazon Bedrock (Claude 3.5 Sonnet)**. [Learn more](https://aws.amazon.com/bedrock/claude/).

platform/enriching/ner.mdx

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
---
2+
title: Named entity recognition (NER)
3+
---
4+
5+
After partitioning and chunking, you can have Unstructured generate a list of recognized entities and their types (such as the names of organizations, products, and people) in the content, through a process known as _named entity recognition_ (NER).
6+
7+
This NER is done by using models offered through these providers:
8+
9+
- [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI.
10+
- [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet), provided through Anthropic.
11+
12+
Here is an example of a list of recognized entities and their types using GPT-4o. Note specifically the `entities` field that is added.
13+
14+
```json
15+
{
16+
"type": "CompositeElement",
17+
"element_id": "bc8333ea0d374670ff0bd03c6126e70d",
18+
"text": "SECTION. 3\n\nThe Senate of the United States shall be composed of two Senators from each State,
19+
[chosen by the Legislature there- of,]* for six Years; and each Senator shall have one Vote.\n\n
20+
Immediately after they shall be assembled in Consequence of the first Election, they shall be divided
21+
as equally as may be into three Classes. The Seats of the Senators of the first Class shall be vacated
22+
at the Expiration of the second Year, of the second Class at the Expiration of the fourth Year, and of
23+
the third Class at the Expiration of the sixth Year, so that one third may be chosen every second Year;
24+
[and if Vacan- cies happen by Resignation, or otherwise, during the Recess of the Legislature of any
25+
State, the Executive thereof may make temporary Appointments until the next Meeting of the Legislature,
26+
which shall then fill such Vacancies.]*\n\nC O N S T I T U T I O N O F T H E U N I T E D S T A T E S",
27+
"metadata": {
28+
"filename": "constitution.pdf",
29+
"filetype": "application/pdf",
30+
"languages": [
31+
"eng"
32+
],
33+
"page_number": 2,
34+
"entities": [
35+
{
36+
"entity": "Senate",
37+
"type": "ORGANIZATION"
38+
},
39+
{
40+
"entity": "United States",
41+
"type": "LOCATION"
42+
},
43+
{
44+
"entity": "Senators",
45+
"type": "PERSON"
46+
},
47+
{
48+
"entity": "State",
49+
"type": "LOCATION"
50+
},
51+
{
52+
"entity": "Legislature",
53+
"type": "ORGANIZATION"
54+
},
55+
{
56+
"entity": "six Years",
57+
"type": "DATE"
58+
},
59+
{
60+
"entity": "first Election",
61+
"type": "EVENT"
62+
},
63+
{
64+
"entity": "second Year",
65+
"type": "DATE"
66+
},
67+
{
68+
"entity": "fourth Year",
69+
"type": "DATE"
70+
},
71+
{
72+
"entity": "sixth Year",
73+
"type": "DATE"
74+
},
75+
{
76+
"entity": "Executive",
77+
"type": "PERSON"
78+
},
79+
{
80+
"entity": "C O N S T I T U T I O N O F T H E U N I T E D S T A T E S",
81+
"type": "ARTIFACT"
82+
}
83+
]
84+
}
85+
}
86+
```
87+
88+
# Generate a list of entities and their types
89+
90+
To generate a list of recognized entities and their types, in the **Task** drop-down list of an **Enrichment** node in a workflow, specify the following:
91+
92+
<Note>
93+
You can change a workflow's NER settings only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings.
94+
95+
Entities are only recognized when the **Partitioner** node in a workflow is also set to use the **High Res** partitioning strategy. [Learn more](/platform/partitioning).
96+
</Note>
97+
98+
1. Select **Named Entity Recognition (NER)**. By default, OpenAI's GPT-4o will follow a default set of instructions (called a _prompt_) to perform NER using a set of predefined entity types.
99+
2. To use Anthropic's Claude 3.5 Sonnet to perform NER instead, or to customize the prompt, click **Edit**.
100+
3. To switch to using Anthropic's Claude 3.5 Sonnet, click **Anthropic (Claude 3.5 Sonnet)**.
101+
4. To experiment with running the default prompt against some sample data, click **Run Prompt**. The selected **Model** uses the
102+
**Prompt** to run NER on the **Input sample** and shows the results in the **Output**. Look specifically at the `response_json` field for the
103+
entities that were recognized and their types.
104+
5. To customize the prompt, change the contents of **Prompt**.
105+
106+
<Note>
107+
For best results, Unstructured strongly recommends that you limit your changes only to certain portions of the default prompt, specifically:
108+
109+
- Adding, renaming, or deleting items in the list of predefined types (such as `PERSON`, `ORGANIZATION`, `LOCATION`, and so on).
110+
- As needed, adding any clarifying instructions only between these two lines:
111+
112+
```text
113+
...
114+
Provide the entities and their corresponding types as a structured JSON response.
115+
116+
(Add any clarifying instructions here only.)
117+
118+
[START OF TEXT]
119+
...
120+
```
121+
122+
- Changing any other portions of the default prompt could produce unexpected results.
123+
</Note>
124+
125+
6. To experiment with different data, change the contents of **Input sample**. For best results, Unstructured strongly recommends that the JSON structure in **Input sample** be preserved.
126+
7. When you are satisfied with the **Model** and **Prompt** that you want to use, click **Save**.
127+

platform/enriching/overview.mdx

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
title: Overview
3+
---
4+
5+
_Enriching_ adds enhancments to the processed data that Unstructured produces. These enrichments include:
6+
7+
- Providing a summarized description of the contents of a detected image. [Learn more](/platform/enriching/image-descriptions).
8+
- Providing a summarized description of the contents of a detected table. [Learn more](/platform/enriching/table-descriptions).
9+
- Providing a representation of a detected table in HTML markup format. [Learn more](/platform/enriching/table-to-html).
10+
11+
To add an enrichment, in the **Task** drop-down list of an **Enrichment** node in a workflow, select one of the following enrichment types:
12+
13+
<Note>
14+
You can change a workflow's table description settings only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings.
15+
16+
Enrichments work only when the **Partitioner** node in a workflow is also set to use the **High Res** partitioning strategy. [Learn more](/platform/partitioning).
17+
</Note>
18+
19+
- **Image Description** to provide a summarized description of the contents of each detected image. [Learn more](/platform/enriching/image-descriptions).
20+
- **Table Description** to provide a summarized description of the contents of each detected table. [Learn more](/platform/enriching/table-descriptions).
21+
- **Table to HTML** to provide a representation of each detected table in HTML markup format. [Learn more](/platform/enriching/table-to-html).
22+
23+
To add multiple enrichments, create an additional **Enrichment** node for each enrichment type that you want to add.
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
---
2+
title: Table descriptions
3+
---
4+
5+
After partitioning and chunking, you can have Unstructured generate text-based summaries of detected tables.
6+
7+
This summarization is done by using models offered through these providers:
8+
9+
- [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI.
10+
- [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet), provided through Anthropic.
11+
- [Claude 3.5 Sonnet](https://aws.amazon.com/bedrock/claude/), provided through Amazon Bedrock.
12+
13+
Here is an example of the output of a detected table using GPT-4o. Note specifically the `text` field that is added.
14+
Line breaks have been inserted here for readability. The output will not contain these line breaks.
15+
16+
```json
17+
{
18+
"type": "Table",
19+
"element_id": "5713c0e90194ac7f0f2c60dd614bd24d",
20+
"text": "The table consists of 6 rows and 7 columns. The columns represent
21+
inhibitor concentration (g), bc (V/dec), ba (V/dec), Ecorr (V), icorr
22+
(A/cm\u00b2), polarization resistance (\u03a9), and corrosion rate
23+
(mm/year). As the inhibitor concentration increases, the corrosion
24+
rate generally decreases, indicating the effectiveness of the
25+
inhibitor. Notably, the polarization resistance increases with higher
26+
inhibitor concentrations, peaking at 6 grams before slightly
27+
decreasing. This suggests that the inhibitor is most effective at
28+
6 grams, significantly reducing the corrosion rate and increasing
29+
polarization resistance. The data provides valuable insights into the
30+
optimal concentration of the inhibitor for corrosion prevention.",
31+
"metadata": {
32+
"text_as_html": "<table>...<full results omitted for brevity>...</table>",
33+
"filetype": "application/pdf",
34+
"languages": [
35+
"eng"
36+
],
37+
"page_number": 1,
38+
"image_base64": "/9j...<full results omitted for brevity>...//Z",
39+
"image_mime_type": "image/jpeg",
40+
"filename": "7f239e1d4ef3556cc867a4bd321bbc41.pdf",
41+
"data_source": {}
42+
}
43+
}
44+
```
45+
46+
The generated table's summary will overwrite any previous contents in the `text` field. The table's original content is available
47+
in the `image_base64` field.
48+
49+
Any embeddings that are produced after these summaries are generated will be based on the new `text` field's contents.
50+
51+
## Generate table descriptions
52+
53+
To generate table descriptions, in the **Task** drop-down list of an **Enrichment** node in a workflow, specify the following:
54+
55+
<Note>
56+
You can change a workflow's table description settings only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings.
57+
58+
Table summaries are generated only when the **Partitioner** node in a workflow is also set to use the **High Res** partitioning strategy. [Learn more](/platform/partitioning).
59+
</Note>
60+
61+
Select **Table Description**, and then choose one of the following provider (and model) combinations to use:
62+
63+
- **OpenAI (GPT-4o)**. [Learn more](https://openai.com/index/hello-gpt-4o/).
64+
- **Anthropic (Claude 3.5 Sonnet)**. [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet).
65+
- **Amazon Bedrock (Claude 3.5 Sonnet)**. [Learn more](https://aws.amazon.com/bedrock/claude/).
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
---
2+
title: Tables to HTML
3+
---
4+
5+
After partitioning and chunking, you can have Unstructured generate representations of each detected table in HTML markup format.
6+
7+
This table-to-HTML output is done by using [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI.
8+
9+
Here is an example of the HTML markup output of a detected table using GPT-4o. Note specifically the `text_as_html` field that is added.
10+
Line breaks have been inserted here for readability. The output will not contain these line breaks.
11+
12+
```json
13+
{
14+
"type": "Table",
15+
"element_id": "31aa654088742f1388d46ea9c8878272",
16+
"text": "Inhibitor Polarization Corrosion be (V/dec) ba (V/dec) Ecorr (V) icorr
17+
(AJcm?) concentration (g) resistance (Q) rate (mmj/year) 0.0335 0.0409
18+
\u20140.9393 0.0003 24.0910 2.8163 1.9460 0.0596 .8276 0.0002 121.440
19+
1.5054 0.0163 0.2369 .8825 0.0001 42121 0.9476 s NO 03233 0.0540
20+
\u20140.8027 5.39E-05 373.180 0.4318 0.1240 0.0556 .5896 5.46E-05
21+
305.650 0.3772 = 5 0.0382 0.0086 .5356 1.24E-05 246.080 0.0919",
22+
"metadata": {
23+
"text_as_html": "```html\n
24+
<table>\n
25+
<tr>\n<th>Inhibitor concentration (g)</th>\n
26+
<th>bc (V/dec)</th>\n<th>ba (V/dec)</th>\n<th>Ecorr (V)</th>\n
27+
<th>icorr (A/cm\u00b2)</th>\n<th>Polarization resistance (\u03a9)</th>\n
28+
<th>Corrosion rate (mm/year)</th>\n
29+
</tr>\n
30+
<tr>\n
31+
<td>0</td>\n<td>0.0335</td>\n<td>0.0409</td>\n<td>\u22120.9393</td>\n
32+
<td>0.0003</td>\n<td>24.0910</td>\n<td>2.8163</td>\n
33+
</tr>\n
34+
<tr>\n
35+
<td>2</td>\n<td>1.9460</td>\n<td>0.0596</td>\n<td>\u22120.8276</td>\n<td>0.0002</td>\n<td>121.440</td>\n<td>1.5054</td>\n
36+
</tr>\n
37+
<tr>\n
38+
<td>4</td>\n<td>0.0163</td>\n<td>0.2369</td>\n<td>\u22120.8825</td>\n<td>0.0001</td>\n<td>42.121</td>\n<td>0.9476</td>\n
39+
</tr>\n
40+
<tr>\n
41+
<td>6</td>\n<td>0.3233</td>\n<td>0.0540</td>\n<td>\u22120.8027</td>\n<td>5.39E-05</td>\n<td>373.180</td>\n<td>0.4318</td>\n
42+
</tr>\n
43+
<tr>\n
44+
<td>8</td>\n<td>0.1240</td>\n<td>0.0556</td>\n<td>\u22120.5896</td>\n<td>5.46E-05</td>\n<td>305.650</td>\n<td>0.3772</td>\n
45+
</tr>\n
46+
<tr>\n
47+
<td>10</td>\n<td>0.0382</td>\n<td>0.0086</td>\n<td>\u22120.5356</td>\n<td>1.24E-05</td>\n<td>246.080</td>\n<td>0.0919</td>\n
48+
</tr>\n
49+
</table>\n```",
50+
"filetype": "application/pdf",
51+
"languages": [
52+
"eng"
53+
],
54+
"page_number": 1,
55+
"image_base64": "/9j...<full results omitted for brevity>...//Z",
56+
"image_mime_type": "image/jpeg",
57+
"filename": "embedded-images-tables.pdf",
58+
"data_source": {}
59+
}
60+
}
61+
```
62+
63+
## Generate table-to-HTML output
64+
65+
To generate table-to-HTML output, in the **Task** drop-down list of an **Enrichment** node in a workflow, select **Table to HTML**.
66+
67+
<Note>
68+
You can change a workflow's table description settings only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings.
69+
70+
Table-to-HTML output is generated only when the **Partitioner** node in a workflow is set to use the **High Res** partitioning strategy. [Learn more](/platform/partitioning).
71+
</Note>

0 commit comments

Comments
 (0)