Enrichment TOC refactoring (#470)

Paul-Cornell · web-flow · commit ded99e2c5a74 · 2025-02-06T14:01:20.000-08:00
diff --git a/mint.json b/mint.json
@@ -593,7 +593,15 @@
           "platform/document-elements",
           "platform/partitioning",
           "platform/chunking",
-          "platform/summarizing",
+          {
+            "group": "Enriching",
+            "pages": [
+              "platform/enriching/overview",
+              "platform/enriching/image-descriptions",
+              "platform/enriching/table-descriptions",
+              "platform/enriching/table-to-html"
+            ]
+          },
           "platform/embedding"
         ]
       },
diff --git a/platform/enriching/image-descriptions.mdx b/platform/enriching/image-descriptions.mdx
@@ -0,0 +1,58 @@
+---
+title: Image descriptions
+---
+
+After partitioning and chunking, you can have Unstructured generate text-based summaries of detected images.
+
+This summarization is done by using models offered through these providers:
+
+- [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI.
+- [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet), provided through Anthropic. 
+- [Claude 3.5 Sonnet](https://aws.amazon.com/bedrock/claude/), provided through Amazon Bedrock.
+
+Here is an example of the output of a detected image using GPT-4o. Note specifically the `text` field that is added. 
+Line breaks have been inserted here for readability. The output will not contain these line breaks. 
+
+```json
+{
+    "type": "Image",
+    "element_id": "3303aa13098f5a26b9845bd18ee8c881",
+    "text": "{\n  \"type\": \"graph\",\n  \"description\": \"The graph shows 
+        the relationship between Potential (V) and Current Density (A/cm2). 
+        The x-axis is labeled 'Current Density (A/cm2)' and ranges from 
+        0.0000001 to 0.1. The y-axis is labeled 'Potential (V)' and ranges 
+        from -2.5 to 1.5. There are six different data series represented 
+        by different colors: blue (10g), red (4g), green (6g), purple (2g), 
+        orange (Control), and light blue (8g). The data points for each series 
+        show how the potential changes with varying current density.\"\n}",
+    "metadata": {
+        "filetype": "application/pdf",
+        "languages": [
+            "eng"
+        ],
+        "page_number": 1,
+        "image_base64": "/9j...<full results omitted for brevity>...Q==",
+        "image_mime_type": "image/jpeg",
+        "filename": "7f239e1d4ef3556cc867a4bd321bbc41.pdf",
+        "data_source": {}
+    }
+}
+```
+
+Any embeddings that are produced after these summaries are generated will be based on the `text` field's contents.
+
+## Generate image descriptions
+
+To generate image descriptions, in the **Task** drop-down list of an **Enrichment** node in a workflow, specify the following:
+
+<Note>
+    You can change a workflow's image description settings only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings.
+    
+    Image summaries are generated only when the **Partitioner** node in a workflow is also set to use the **High Res** partitioning strategy. [Learn more](/platform/partitioning).
+</Note>
+
+Select **Image Description**, and then choose one of the following provider (and model) combinations to use:
+
+- **OpenAI (GPT-4o)**. [Learn more](https://openai.com/index/hello-gpt-4o/).
+- **Anthropic (Claude 3.5 Sonnet)**. [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet).
+- **Amazon Bedrock (Claude 3.5 Sonnet)**. [Learn more](https://aws.amazon.com/bedrock/claude/).
diff --git a/platform/enriching/ner.mdx b/platform/enriching/ner.mdx
@@ -0,0 +1,127 @@
+---
+title: Named entity recognition (NER)
+---
+
+After partitioning and chunking, you can have Unstructured generate a list of recognized entities and their types (such as the names of organizations, products, and people) in the content, through a process known as _named entity recognition_ (NER).
+
+This NER is done by using models offered through these providers:
+
+- [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI.
+- [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet), provided through Anthropic. 
+
+Here is an example of a list of recognized entities and their types using GPT-4o. Note specifically the `entities` field that is added.
+
+```json
+{
+    "type": "CompositeElement",
+    "element_id": "bc8333ea0d374670ff0bd03c6126e70d",
+    "text": "SECTION. 3\n\nThe Senate of the United States shall be composed of two Senators from each State, 
+        [chosen by the Legislature there- of,]* for six Years; and each Senator shall have one Vote.\n\n
+        Immediately after they shall be assembled in Consequence of the first Election, they shall be divided
+        as equally as may be into three Classes. The Seats of the Senators of the first Class shall be vacated
+        at the Expiration of the second Year, of the second Class at the Expiration of the fourth Year, and of
+        the third Class at the Expiration of the sixth Year, so that one third may be chosen every second Year;
+        [and if Vacan- cies happen by Resignation, or otherwise, during the Recess of the Legislature of any
+        State, the Executive thereof may make temporary Appointments until the next Meeting of the Legislature,
+        which shall then fill such Vacancies.]*\n\nC O N S T I T U T I O N O F T H E U N I T E D S T A T E S",
+    "metadata": {
+        "filename": "constitution.pdf",
+        "filetype": "application/pdf",
+        "languages": [
+            "eng"
+        ],
+        "page_number": 2,
+        "entities": [
+            {
+                "entity": "Senate",
+                "type": "ORGANIZATION"
+            },
+            {
+                "entity": "United States",
+                "type": "LOCATION"
+            },
+            {
+                "entity": "Senators",
+                "type": "PERSON"
+            },
+            {
+                "entity": "State",
+                "type": "LOCATION"
+            },
+            {
+                "entity": "Legislature",
+                "type": "ORGANIZATION"
+            },
+            {
+                "entity": "six Years",
+                "type": "DATE"
+            },
+            {
+                "entity": "first Election",
+                "type": "EVENT"
+            },
+            {
+                "entity": "second Year",
+                "type": "DATE"
+            },
+            {
+                "entity": "fourth Year",
+                "type": "DATE"
+            },
+            {
+                "entity": "sixth Year",
+                "type": "DATE"
+            },
+            {
+                "entity": "Executive",
+                "type": "PERSON"
+            },
+            {
+                "entity": "C O N S T I T U T I O N O F T H E U N I T E D S T A T E S",
+                "type": "ARTIFACT"
+            }
+        ]
+    }
+}
+```
+
+# Generate a list of entities and their types
+
+To generate a list of recognized entities and their types, in the **Task** drop-down list of an **Enrichment** node in a workflow, specify the following:
+
+<Note>
+    You can change a workflow's NER settings only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings.
+    
+    Entities are only recognized when the **Partitioner** node in a workflow is also set to use the **High Res** partitioning strategy. [Learn more](/platform/partitioning).
+</Note>
+
+1. Select **Named Entity Recognition (NER)**. By default, OpenAI's GPT-4o will follow a default set of instructions (called a _prompt_) to perform NER using a set of predefined entity types.
+2. To use Anthropic's Claude 3.5 Sonnet to perform NER instead, or to customize the prompt, click **Edit**.
+3. To switch to using Anthropic's Claude 3.5 Sonnet, click **Anthropic (Claude 3.5 Sonnet)**.
+4. To experiment with running the default prompt against some sample data, click **Run Prompt**. The selected **Model** uses the 
+   **Prompt** to run NER on the **Input sample** and shows the results in the **Output**. Look specifically at the `response_json` field for the 
+   entities that were recognized and their types.
+5. To customize the prompt, change the contents of **Prompt**.
+
+   <Note>
+       For best results, Unstructured strongly recommends that you limit your changes only to certain portions of the default prompt, specifically:
+
+       - Adding, renaming, or deleting items in the list of predefined types (such as `PERSON`, `ORGANIZATION`, `LOCATION`, and so on).
+       - As needed, adding any clarifying instructions only between these two lines:
+
+         ```text
+         ...
+         Provide the entities and their corresponding types as a structured JSON response.
+
+         (Add any clarifying instructions here only.)
+
+         [START OF TEXT]
+         ...
+         ```
+
+       - Changing any other portions of the default prompt could produce unexpected results.
+   </Note>
+
+6. To experiment with different data, change the contents of **Input sample**. For best results, Unstructured strongly recommends that the JSON structure in **Input sample** be preserved.
+7. When you are satisfied with the **Model** and **Prompt** that you want to use, click **Save**.
+
diff --git a/platform/enriching/overview.mdx b/platform/enriching/overview.mdx
@@ -0,0 +1,23 @@
+---
+title: Overview
+---
+
+_Enriching_ adds enhancments to the processed data that Unstructured produces. These enrichments include:
+
+- Providing a summarized description of the contents of a detected image. [Learn more](/platform/enriching/image-descriptions).
+- Providing a summarized description of the contents of a detected table. [Learn more](/platform/enriching/table-descriptions).
+- Providing a representation of a detected table in HTML markup format. [Learn more](/platform/enriching/table-to-html).
+
+To add an enrichment, in the **Task** drop-down list of an **Enrichment** node in a workflow, select one of the following enrichment types:
+
+<Note>
+    You can change a workflow's table description settings only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings.
+
+    Enrichments work only when the **Partitioner** node in a workflow is also set to use the **High Res** partitioning strategy. [Learn more](/platform/partitioning).
+</Note>
+
+- **Image Description** to provide a summarized description of the contents of each detected image. [Learn more](/platform/enriching/image-descriptions).
+- **Table Description** to provide a summarized description of the contents of each detected table.  [Learn more](/platform/enriching/table-descriptions).
+- **Table to HTML** to provide a representation of each detected table in HTML markup format. [Learn more](/platform/enriching/table-to-html).
+
+To add multiple enrichments, create an additional **Enrichment** node for each enrichment type that you want to add.
diff --git a/platform/enriching/table-descriptions.mdx b/platform/enriching/table-descriptions.mdx
@@ -0,0 +1,65 @@
+---
+title: Table descriptions
+---
+
+After partitioning and chunking, you can have Unstructured generate text-based summaries of detected tables.
+
+This summarization is done by using models offered through these providers:
+
+- [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI.
+- [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet), provided through Anthropic. 
+- [Claude 3.5 Sonnet](https://aws.amazon.com/bedrock/claude/), provided through Amazon Bedrock.
+
+Here is an example of the output of a detected table using GPT-4o. Note specifically the `text` field that is added.
+Line breaks have been inserted here for readability. The output will not contain these line breaks.
+
+```json
+{
+    "type": "Table",
+    "element_id": "5713c0e90194ac7f0f2c60dd614bd24d",
+    "text": "The table consists of 6 rows and 7 columns. The columns represent 
+        inhibitor concentration (g), bc (V/dec), ba (V/dec), Ecorr (V), icorr 
+        (A/cm\u00b2), polarization resistance (\u03a9), and corrosion rate 
+        (mm/year). As the inhibitor concentration increases, the corrosion 
+        rate generally decreases, indicating the effectiveness of the 
+        inhibitor. Notably, the polarization resistance increases with higher 
+        inhibitor concentrations, peaking at 6 grams before slightly 
+        decreasing. This suggests that the inhibitor is most effective at 
+        6 grams, significantly reducing the corrosion rate and increasing 
+        polarization resistance. The data provides valuable insights into the 
+        optimal concentration of the inhibitor for corrosion prevention.",
+    "metadata": {
+        "text_as_html": "<table>...<full results omitted for brevity>...</table>",
+        "filetype": "application/pdf",
+        "languages": [
+            "eng"
+        ],
+        "page_number": 1,
+        "image_base64": "/9j...<full results omitted for brevity>...//Z",
+        "image_mime_type": "image/jpeg",
+        "filename": "7f239e1d4ef3556cc867a4bd321bbc41.pdf",
+        "data_source": {}
+    }
+}
+```
+
+The generated table's summary will overwrite any previous contents in the `text` field. The table's original content is available 
+in the `image_base64` field. 
+
+Any embeddings that are produced after these summaries are generated will be based on the new `text` field's contents.
+
+## Generate table descriptions
+
+To generate table descriptions, in the **Task** drop-down list of an **Enrichment** node in a workflow, specify the following:
+
+<Note>
+    You can change a workflow's table description settings only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings.
+    
+    Table summaries are generated only when the **Partitioner** node in a workflow is also set to use the **High Res** partitioning strategy. [Learn more](/platform/partitioning).
+</Note>
+
+Select **Table Description**, and then choose one of the following provider (and model) combinations to use:
+
+- **OpenAI (GPT-4o)**. [Learn more](https://openai.com/index/hello-gpt-4o/).
+- **Anthropic (Claude 3.5 Sonnet)**. [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet).
+- **Amazon Bedrock (Claude 3.5 Sonnet)**. [Learn more](https://aws.amazon.com/bedrock/claude/).
diff --git a/platform/enriching/table-to-html.mdx b/platform/enriching/table-to-html.mdx
@@ -0,0 +1,71 @@
+---
+title: Tables to HTML
+---
+
+After partitioning and chunking, you can have Unstructured generate representations of each detected table in HTML markup format.
+
+This table-to-HTML output is done by using [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI.
+
+Here is an example of the HTML markup output of a detected table using GPT-4o. Note specifically the `text_as_html` field that is added.
+Line breaks have been inserted here for readability. The output will not contain these line breaks.
+
+```json
+{
+    "type": "Table",
+    "element_id": "31aa654088742f1388d46ea9c8878272",
+    "text": "Inhibitor Polarization Corrosion be (V/dec) ba (V/dec) Ecorr (V) icorr 
+        (AJcm?) concentration (g) resistance (Q) rate (mmj/year) 0.0335 0.0409 
+        \u20140.9393 0.0003 24.0910 2.8163 1.9460 0.0596 .8276 0.0002 121.440 
+        1.5054 0.0163 0.2369 .8825 0.0001 42121 0.9476 s NO 03233 0.0540 
+        \u20140.8027 5.39E-05 373.180 0.4318 0.1240 0.0556 .5896 5.46E-05 
+        305.650 0.3772 = 5 0.0382 0.0086 .5356 1.24E-05 246.080 0.0919",
+    "metadata": {
+        "text_as_html": "```html\n
+            <table>\n
+                <tr>\n<th>Inhibitor concentration (g)</th>\n
+                    <th>bc (V/dec)</th>\n<th>ba (V/dec)</th>\n<th>Ecorr (V)</th>\n
+                    <th>icorr (A/cm\u00b2)</th>\n<th>Polarization resistance (\u03a9)</th>\n
+                    <th>Corrosion rate (mm/year)</th>\n
+                </tr>\n  
+                <tr>\n
+                    <td>0</td>\n<td>0.0335</td>\n<td>0.0409</td>\n<td>\u22120.9393</td>\n
+                    <td>0.0003</td>\n<td>24.0910</td>\n<td>2.8163</td>\n  
+                </tr>\n
+                <tr>\n   
+                    <td>2</td>\n<td>1.9460</td>\n<td>0.0596</td>\n<td>\u22120.8276</td>\n<td>0.0002</td>\n<td>121.440</td>\n<td>1.5054</td>\n  
+                </tr>\n
+                <tr>\n
+                    <td>4</td>\n<td>0.0163</td>\n<td>0.2369</td>\n<td>\u22120.8825</td>\n<td>0.0001</td>\n<td>42.121</td>\n<td>0.9476</td>\n  
+                </tr>\n  
+                <tr>\n
+                    <td>6</td>\n<td>0.3233</td>\n<td>0.0540</td>\n<td>\u22120.8027</td>\n<td>5.39E-05</td>\n<td>373.180</td>\n<td>0.4318</td>\n  
+                </tr>\n  
+                <tr>\n
+                    <td>8</td>\n<td>0.1240</td>\n<td>0.0556</td>\n<td>\u22120.5896</td>\n<td>5.46E-05</td>\n<td>305.650</td>\n<td>0.3772</td>\n  
+                </tr>\n  
+                <tr>\n
+                    <td>10</td>\n<td>0.0382</td>\n<td>0.0086</td>\n<td>\u22120.5356</td>\n<td>1.24E-05</td>\n<td>246.080</td>\n<td>0.0919</td>\n
+                </tr>\n
+            </table>\n```",
+        "filetype": "application/pdf",
+        "languages": [
+            "eng"
+        ],
+        "page_number": 1,
+        "image_base64": "/9j...<full results omitted for brevity>...//Z",
+        "image_mime_type": "image/jpeg",
+        "filename": "embedded-images-tables.pdf",
+        "data_source": {}
+    }
+}
+```
+
+## Generate table-to-HTML output
+
+To generate table-to-HTML output, in the **Task** drop-down list of an **Enrichment** node in a workflow, select **Table to HTML**.
+
+<Note>
+    You can change a workflow's table description settings only through [Custom](/platform/workflows#create-a-custom-workflow) workflow settings.
+    
+    Table-to-HTML output is generated only when the **Partitioner** node in a workflow is set to use the **High Res** partitioning strategy. [Learn more](/platform/partitioning).
+</Note>
diff --git a/platform/workflows.mdx b/platform/workflows.mdx