Skip to content

Commit 9662956

Browse files
authored
Workflow DAG node order and behaviors: enrichments before chunking, chunking affects enrichments (#698)
1 parent 8040968 commit 9662956

File tree

9 files changed

+282
-245
lines changed

9 files changed

+282
-245
lines changed

api-reference/workflow/workflows.mdx

Lines changed: 232 additions & 227 deletions
Large diffs are not rendered by default.

snippets/quickstarts/single-file-ui.mdx

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -132,17 +132,17 @@ import EnrichmentImagesTablesHiResOnly from '/snippets/general-shared-text/enric
132132
allowfullscreen
133133
></iframe>
134134

135-
- Add a **Chunker** node after the **Partitioner** node, to chunk the partitioned data into smaller pieces for your retrieval-augmented generation (RAG) applications.
136-
To do this, click the add (**+**) button to the right of the **Partitioner** node, and then click **Enrich > Chunker**. Click the new **Chunker** node and
137-
specify its settings. For help, click the **FAQ** button in the **Chunker** node's pane. [Learn more about chunking and chunker settings](/ui/chunking).
138-
- Add an **Enrichment** node after the **Chunker** node, to apply enrichments to the chunked data such as image summaries, table summaries, table-to-HTML transforms, and
139-
named entity recognition (NER). To do this, click the add (**+**) button to the right of the **Chunker** node, and then click **Enrich > Enrichment**.
135+
- Add an **Enrichment** node after the **Partitioner** node, to apply enrichments to the partitioned data such as image summaries, table summaries, table-to-HTML transforms, and
136+
named entity recognition (NER). To do this, click the add (**+**) button to the right of the **Partitioner** node, and then click **Enrich > Enrichment**.
140137
Click the new **Enrichment** node and specify its settings. For help, click the **FAQ** button in the **Enrichment** node's pane. [Learn more about enrichments and enrichment settings](/ui/enriching/overview).
141138

142139
<EnrichmentImagesTablesHiResOnly />
143140

144-
- Add an **Embedder** node after the **Enrichment** node, to generate vector embeddings for performing vector-based searches. To do this, click the add (**+**) button to the
145-
right of the **Enrichment** node, and then click **Transform > Embedder**. Click the new **Embedder** node and specify its settings. For help, click the **FAQ** button
141+
- Add a **Chunker** node after the **Enrichment** node, to chunk the enriched data into smaller pieces for your retrieval-augmented generation (RAG) applications.
142+
To do this, click the add (**+**) button to the right of the **Enrichment** node, and then click **Enrich > Chunker**. Click the new **Chunker** node and
143+
specify its settings. For help, click the **FAQ** button in the **Chunker** node's pane. [Learn more about chunking and chunker settings](/ui/chunking).
144+
- Add an **Embedder** node after the **Chunker** node, to generate vector embeddings for performing vector-based searches. To do this, click the add (**+**) button to the
145+
right of the **Chunker** node, and then click **Transform > Embedder**. Click the new **Embedder** node and specify its settings. For help, click the **FAQ** button
146146
in the **Embedder** node's pane. [Learn more about embedding and embedding settings](/ui/embedding).
147147

148148
2. Each time you add a node or change its settings, you can click **Test** above the **Source** node again to test the current workflow end to end and see the results of the changes, if any.

ui/chunking.mdx

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,10 @@ text element that was too big to fit in one chunk and required splitting.
2222
- `Table`: A table element is not combined with other elements, and if it fits within the max characters setting it will remain as is.
2323
- `TableChunk`: Large tables that exceed the max characters setting are split into special `TableChunk` elements.
2424

25+
<Note>
26+
During chunking, Unstructured removes all detected `Image` elements from the output.
27+
</Note>
28+
2529
Here are a few examples:
2630

2731
```json

ui/enriching/image-descriptions.mdx

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: Image descriptions
33
---
44

5-
After partitioning and chunking, you can have Unstructured generate text-based summaries of detected images.
5+
After partitioning, you can have Unstructured generate text-based summaries of detected images.
66

77
This summarization is done by using models offered through these providers:
88

@@ -39,6 +39,12 @@ Line breaks have been inserted here for readability. The output will not contain
3939
}
4040
```
4141

42+
For workflows that use [chunking](/ui/chunking), note the following changes:
43+
44+
- Each `Image` element is replaced by a `CompositeElement` element.
45+
- This `CompositeElement` element will contain the image's summary description as part of the element's `text` field.
46+
- This `CompositeElement` element will not contain an `image_base64` field.
47+
4248
Here are three examples of the descriptions for detected images. These descriptions are generated with GPT-4o by OpenAI:
4349

4450
![Description of an image showing a scatter plot graph](/img/enriching/Image-Description-1.png)
@@ -57,7 +63,9 @@ To generate image descriptions, in an **Enrichment** node in a workflow, specify
5763

5864
<Note>
5965
You can change a workflow's image description settings only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings.
60-
66+
67+
For workflows that use [chunking](/ui/chunking), the **Chunker** node should be placed after all **Enrichment** nodes. Placing the
68+
**Chunker** node before an image descriptions **Enrichment** node could cause incomplete or no image descriptions to be generated.
6169
</Note>
6270

6371
<EnrichmentImageSummaryHiResOnly />

ui/enriching/ner.mdx

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: Named entity recognition (NER)
33
---
44

5-
After partitioning and chunking, you can have Unstructured generate a list of recognized entities and their types (such as the names of organizations, products, and people) in the content, through a process known as _named entity recognition_ (NER).
5+
After partitioning, you can have Unstructured generate a list of recognized entities and their types (such as the names of organizations, products, and people) in the content, through a process known as _named entity recognition_ (NER).
66
You can also have Unstructured generate a list of relationships between the entities that are recognized.
77

88
This NER is done by using models offered through these providers:
@@ -144,8 +144,6 @@ To generate a list of recognized entities and their relationships, in an **Enric
144144

145145
<Note>
146146
You can change a workflow's NER settings only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings.
147-
148-
Entities are only recognized when the **Partitioner** node in a workflow is also set to use the **High Res** partitioning strategy. [Learn more](/ui/partitioning).
149147
</Note>
150148

151149
1. Select **Text**.

ui/enriching/table-descriptions.mdx

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: Table descriptions
33
---
44

5-
After partitioning and chunking, you can have Unstructured generate text-based summaries of detected tables.
5+
After partitioning, you can have Unstructured generate text-based summaries of detected tables.
66

77
This summarization is done by using models offered through these providers:
88

@@ -49,8 +49,14 @@ Here are two examples of the descriptions for detected tables. These description
4949

5050
![Description of a table with information about potentiodynamic polarization of stainless steel](/img/enriching/Table-Description-2.png)
5151

52-
The generated table's summary will overwrite any previous contents in the `text` field. The table's original content is available
53-
in the `image_base64` field.
52+
The generated table's summary will overwrite any text that Unstructured had previously extracted from that table into the `text` field.
53+
The table's original content is available in the `image_base64` field.
54+
55+
For workflows that use [chunking](/ui/chunking), note the following changes:
56+
57+
- If a `Table` element must be chunked, the `Table` element is replaced by a set of related `TableChunk` elements.
58+
- Each of these `TableChunk` elements will contain a summary description only for its own element, as part of the element's `text` field.
59+
- These `TableChunk` elements will not contain an `image_base64` field.
5460

5561
Any embeddings that are produced after these summaries are generated will be based on the new `text` field's contents.
5662

@@ -63,6 +69,8 @@ To generate table descriptions, in an **Enrichment** node in a workflow, specify
6369
<Note>
6470
You can change a workflow's table description settings only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings.
6571

72+
For workflows that use [chunking](/ui/chunking), the **Chunker** node should be placed after all **Enrichment** nodes. Placing the
73+
**Chunker** node before a table descriptions **Enrichment** node could cause incomplete or no table descriptions to be generated.
6674
</Note>
6775

6876
<EnrichmentTableSummaryHiResOnly />

ui/enriching/table-to-html.mdx

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: Tables to HTML
33
---
44

5-
After partitioning and chunking, you can have Unstructured generate representations of each detected table in HTML markup format.
5+
After partitioning, you can have Unstructured generate representations of each detected table in HTML markup format.
66

77
This table-to-HTML output is done by using [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI.
88

@@ -60,6 +60,14 @@ Line breaks have been inserted here for readability. The output will not contain
6060
}
6161
```
6262

63+
For workflows that use [chunking](/ui/chunking), note the following changes:
64+
65+
- If a `Table` element must be chunked, the `Table` element is replaced by a set of related `TableChunk` elements.
66+
- Each of these `TableChunk` elements will contain HTML table output for only its own element.
67+
- None of the these `TableChunk` elements will contain an `image_base64` field.
68+
69+
70+
6371
## Generate table-to-HTML output
6472

6573
import EnrichmentTableToHTMLHiResOnly from '/snippets/general-shared-text/enrichment-table-to-html-hi-res-only.mdx';
@@ -71,6 +79,8 @@ Make sure after you choose this provider and model, that **Table to HTML** is al
7179
<Note>
7280
You can change a workflow's table description settings only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings.
7381

82+
For workflows that use [chunking](/ui/chunking), the **Chunker** node should be placed after all **Enrichment** nodes. Placing the
83+
**Chunker** node before a table-to-HTML output **Enrichment** node could cause incomplete or no table-to-HTML output to be generated.
7484
</Note>
7585

7686
<EnrichmentTableToHTMLHiResOnly />

ui/summarizing.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: Summarizing
33
---
44

5-
After partitioning and chunking, _summarizing_ generates text-based summaries of images and tables.
5+
After partitioning, _summarizing_ generates text-based summaries of images and tables.
66
This summarization is done by using models offered through these providers:
77

88
- [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI.

ui/workflows.mdx

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,6 @@ If you did not previously set the workflow to run on a schedule, you can [run th
157157

158158
8. The workflow begins with the following layout:
159159

160-
161160
```mermaid
162161
flowchart LR
163162
Source-->Partitioner-->Destination
@@ -182,6 +181,11 @@ If you did not previously set the workflow to run on a schedule, you can [run th
182181
Source-->Partitioner-->Enrichment-->Chunker-->Embedder-->Destination
183182
```
184183

184+
<Note>
185+
For workflows that use **Chunker** and **Enrichment** nodes together, the **Chunker** node should be placed after all **Enrichment** nodes. Placing the
186+
**Chunker** node before any **Enrichment** nodes could cause incomplete or no enrichment results to be generated.
187+
</Note>
188+
185189
9. In the pipeline designer, click the **Source** node. In the **Source** pane, select the source location. Then click **Save**.
186190

187191
![Workflow designer](/img/ui/Workflow-Designer.png)

0 commit comments

Comments
 (0)