You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/logic-apps/parse-document-chunk-text.md
+92-27Lines changed: 92 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
---
2
2
title: Parse document or chunk text
3
-
description: Parse a document or chunk text to use with Azure AI operations for Standard workflows in Azure Logic Apps.
3
+
description: Parse a document or chunk text for Standard workflows in Azure Logic Apps.
4
4
services: logic-apps
5
5
ms.suite: integration
6
6
ms.reviewer: estfan, azla
@@ -9,27 +9,29 @@ ms.date: 07/26/2024
9
9
# Customer intent: As a developer using Azure Logic Apps, I want to parse a document or chunk text that I want to use with Azure AI operations for my Standard workflow in Azure Logic Apps.
10
10
---
11
11
12
-
# Parse or chunk content to use with Azure AI operations for Standard workflows in Azure Logic Apps (Preview)
12
+
# Parse or chunk content for Standard workflows in Azure Logic Apps (Preview)
> This capability is in preview and is subject to the
18
18
> [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).
19
19
20
-
To convert content, such as a PDF document, CSV file, or even an Excel file, into a format that you can more easily use with an Azure AI operation, such as **Azure AI Search** or **Azure OpenAI**, you can include the **Data Operations** actions named **Parse a document** and **Chunk text** in your Standard logic app workflow.
20
+
Sometimes you have to convert content into token form or break down a large document into smaller pieces before you can use this content with some actions. For example, actions such as **Azure AI Search** or **Azure OpenAI** expect tokenized input and can handle only a limited number of tokens, which are words or chunks of characters.
21
21
22
-
The following table describes these data operations:
22
+
For these scenarios, use the **Data Operations** actions named **Parse a document** and **Chunk text** in your Standard logic app workflow. These actions respectively convert content, such as a PDF document, CSV file, Excel file, and so on, into a tokenized string and then split the string into pieces, based on the number of tokens or characters. You can then reference and use these outputs with subsequent actions in your workflow.
23
23
24
-
| Data operation | Description |
25
-
|----------------|-------------|
26
-
|**Parse a document**| Convert the specified content into a string with tokens that represent outputs, which you can reference and use with subsequent actions in your workflow. |
27
-
|**Chunk text**| Split the specified content into pieces, based on the selected strategy: <br><br>- **FixedLength** - number of characters: Provide the maximum number of characters per chunk and the language to use. <br><br>- **TokenSize** - number of tokens: Provide the maximum number of tokens per chunk and the encoding model to use. |
28
-
29
-
> [!NOTE]
24
+
> [!TIP]
30
25
>
31
-
> Preceding actions that use chunking don't affect the **Chunk text** action,
32
-
> nor does the **Chunk text** action affect subsequent actions that use chunking.
26
+
> To learn more, you can ask Azure Copilot these questions:
27
+
>
28
+
> -*What is a token in AI?*
29
+
> -*What is tokenized input?*
30
+
> -*What is parsing in AI?*
31
+
> -*What is a tokenized string?*
32
+
> -*What is chunking in AI?*
33
+
>
34
+
> To find Azure Copilot, on the [Azure portal](https://portal.azure.com) toolbar, select **Copilot**.
33
35
34
36
This how-to guide shows how to add and set up these operations in your workflow.
35
37
@@ -41,15 +43,17 @@ This how-to guide shows how to add and set up these operations in your workflow.
41
43
42
44
## Parse a document
43
45
44
-
For this example, suppose your workflow starts with the **Request** trigger named **When a HTTP request is received**. This trigger waits to receive an HTTP request sent from another component, such as an Azure function or another logic app workflow. The HTTP request indicates that content is available for the workflow to retrieve and parse. An **HTTP** action immediately follows the trigger and gets the content from its storage location.
46
+
The **Parse a document** action converts content, such as a PDF document, CSV file, Excel file, and so on, into a tokenized string. For this example, suppose your workflow starts with the **Request** trigger named **When a HTTP request is received**. This trigger waits to receive an HTTP request sent from another component, such as an Azure function, another logic app workflow, and so on. The HTTP request includes the URL for a new uploaded document that is available for the workflow to retrieve and parse. An **HTTP** action immediately follows the trigger, and sends an HTTP reqeust to the document's URL, and returns with the document content from its storage location.
45
47
46
-
If you use other content sources, such as Azure Blob Storage, Office 365 Outlook, or other services, you can check whether they include appropriate triggers. You can also check for other actions that can retrieve content, such as Azure Blob Storage, File System, FTP, and so on.
48
+
If you use other content sources, such as Azure Blob Storage, SharePoint, OneDrive, File System, FTP, and so on, you can check whether triggers are available for these sources. You can also check whether actions are available to retrieve the content for these sources. For more information, see [Built-in operations](/azure/logic-apps/connectors/built-in/reference/) and [Managed connectors](/connectors/connector-reference/connector-reference-logicapps-connectors).
47
49
48
50
1. In the [Azure portal](https://portal.azure.com), open your Standard logic app resource and workflow in the designer.
49
51
50
-
1. Under the existing trigger and actions, [follow these general steps to add the **Data Operations** action named **Parse a document**](create-workflow-with-trigger-or-action.md#add-action).
52
+
1. Under the existing trigger and actions, [follow these general steps to add the **Data Operations** action named **Parse a document**](create-workflow-with-trigger-or-action.md#add-action) to your workflow.
53
+
54
+
1. On the designer, select the **Parse a document** action.
51
55
52
-
1.On the designer, select the **Parse a document** action. After the action information pane opens, on the **Parameters** tab, in the **Document Content** property, specify the content to parse by following these steps:
56
+
1. After the action information pane opens, on the **Parameters** tab, in the **Document Content** property, specify the content to parse by following these steps:
53
57
54
58
1. Select inside the **Document Content** box.
55
59
@@ -71,22 +75,43 @@ If you use other content sources, such as Azure Blob Storage, Office 365 Outlook
71
75
72
76
:::image type="content" source="media/parse-document-chunk-text/parse-document.png" alt-text="Screenshot shows sample workflow with Body output in the action named Parse a document." lightbox="media/parse-document-chunk-text/parse-document.png":::
73
77
74
-
1. Under the **Parse a document** action, add the actions that you want to work with the tokenized output string, for example, **Chunk text**.
78
+
1. Under the **Parse a document** action, add the actions that you want to work with the tokenized output string, for example, **Chunk text**, which this guide describes later.
79
+
80
+
## Parse a document - Reference
81
+
82
+
#### Parameters
83
+
84
+
| Name | Value | Data type | Description | Limit |
|**Document Content**| <*content-to-parse*> | Any | The content to parse. ||
87
+
88
+
#### Outputs
89
+
90
+
| Name | Data type | Description |
91
+
|------|-----------|-------------|
92
+
|**Parsed result text**| String ||
75
93
76
94
## Chunk text
77
95
78
-
This example builds on the preceding section by using the **Chunk text** operation to split the tokenized output string into pieces that subsequent actions in the workflow can more easily use.
96
+
The **Chunk text** action splits content into smaller pieces for subsequent actions to more easily use in the current workflow. The following steps build on the example from the **Parse a document** section and splits token string output for use with Azure AI operations that expect tokenized, small content chunks.
97
+
98
+
> [!NOTE]
99
+
>
100
+
> Preceding actions that use chunking don't affect the **Chunk text** action,
101
+
> nor does the **Chunk text** action affect subsequent actions that use chunking.
79
102
80
103
1. In the [Azure portal](https://portal.azure.com), open your Standard logic app resource and workflow in the designer.
81
104
82
105
1. Under the **Parse a document** action, [follow these general steps to add the **Data Operations** action named **Chunk text**](create-workflow-with-trigger-or-action.md#add-action).
83
106
84
-
1. On the designer, select the **Chunk text** action. After the action information pane opens, on the **Parameters** tab, for the **Chunking Strategy** property, select the strategy to use for chunking and provide the corresponding property values:
107
+
1. On the designer, select the **Chunk text** action.
108
+
109
+
1. After the action information pane opens, on the **Parameters** tab, for the **Chunking Strategy** property, select either **FixedLength** or **TokenSize** as the chunking method.
85
110
86
111
| Strategy | Description |
87
112
|----------|-------------|
88
-
|**FixedLength**| Split the specified content into pieces based on number of characters. <br><br>**Text**: The content to chunk. <br><br>**MaxPageLength**: The maximum number of characters per content chunk. <br><br>**PageOverlapLength** (optional): The number of characters to overlap in each chunk. The default value is **0**. <br><br>- **Language**: The language to use for the resulting chunks. |
89
-
|**TokenSize**| Split the specified content into pieces based on number of tokens. <br><br>**Text**: The content to chunk. <br><br>- **TokenSize**: The maximum number of tokens per content chunk. <br><br>- **Encoding model**: The encoding model to use. |
113
+
|**FixedLength**| Split the specified content, based on the number of characters. |
114
+
|**TokenSize**| Split the specified content, based on the number of tokens. |
90
115
91
116
1. After you select the strategy, select inside the **Text** box to specify the content for chunking.
92
117
@@ -108,18 +133,57 @@ This example builds on the preceding section by using the **Chunk text** operati
108
133
109
134
:::image type="content" source="media/parse-document-chunk-text/chunk-text.png" alt-text="Screenshot shows sample workflow with selected parsed result text output in the action named Chunk text." lightbox="media/parse-document-chunk-text/chunk-text.png":::
110
135
111
-
1. Complete the setup for the **Chunk text** action, based on your selected strategy.
136
+
1. Complete the setup for the **Chunk text** action, based on your selected strategy and scenario. For more information, see [Chunk text - Reference](#chunk-text---reference).
137
+
138
+
Now, when you add other actions that expect and use tokenized input, such as the Azure AI actions, the input content is formatted for easier consumption.
139
+
140
+
## Chunk text - Reference
141
+
142
+
#### Parameters
143
+
144
+
| Name | Value | Data type | Description | Limit |
|**Chunking Strategy**|**FixedLength** or **TokenSize**| String |**FixedLength**: Split the content, based on the number of characters <br><br>**TokenSize**: Split the content, based on the number of tokens. <br><br>Default: **FixedLength**||
147
+
|**Text**| <*content-to-chunk*> | Any | The content to chunk. ||
148
+
149
+
For **Chunking Strategy** set to **FixedLength**:
150
+
151
+
| Name | Value | Data type | Description | Limit |
|**MaxPageLength**| <*max-char-per-chunk*> | Integer | The maximum number of characters per content chunk. <br><br>Default: **5000**||
154
+
|**PageOverlapLength**| <*number-of-overlapping-characters*> | Integer | The number of characters from the end of the previous chunk to include in the next chunk. This setting helps you avoid losing important information when splitting content into chunks and preserves continuity and context across chunks. <br><br>Default: **0** - No overlapping characters exist. ||
155
+
|**Language**| <*language*> | String | The [language](/azure/ai-services/language-service/language-detection/language-support) to use for the resulting chunks. <br><br>Default: **en-us**. | Not applicable |
156
+
157
+
For **Chunking Strategy** set to **TokenSize**:
158
+
159
+
| Name | Value | Data type | Description | Limit |
|**TokenSize**| <*max-tokens-per-chunk*> | Integer | The maximum number of tokens per content chunk. <br><br>Default: None ||
162
+
|**Encoding model**| <*encoding-method*> | String | The encoding method to use. <br><br>Default: None | Not applicable |
163
+
164
+
> [!TIP]
165
+
>
166
+
> To learn more, you can ask Azure Copilot these questions:
167
+
>
168
+
> -*What is PageOverlapLength in chunking?*
169
+
> -*What is encoding in Azure AI?*
170
+
>
171
+
> To find Azure Copilot, on the [Azure portal](https://portal.azure.com) toolbar, select **Copilot**.
172
+
173
+
#### Outputs
174
+
175
+
112
176
113
-
Now, when you add Azure AI operations, the content is formatted for easier consumption.
177
+
## Example workflow
114
178
115
-
The following example includes other actions to create a complete workflow pattern to ingest data from any source:
179
+
The following example includes other actions that create a complete workflow pattern to ingest data from any source:
116
180
117
-
:::image type="content" source="media/parse-document-chunk-text/complete-example.png" alt-text="Screenshot shows sample workflow with selected parsed result text output in the action named Chunk text." lightbox="media/parse-document-chunk-text/complete-example.png":::
| 1 |Check for new data. |**When an HTTP request is received**| A trigger that either polls or waits for new data to arrive, either based on a scheduled recurrence or in response to specific events respectively. Such an event might be a new file that's uploaded to a specific storage system, such as SharePoint, OneDrive, or Azure Blob Storage. <br><br>In this example, the **Request** trigger operation waits for an HTTP or HTTPS request sent from another endpoint. The request includes the URL for a new uploaded document. |
122
-
| 2 | Get the data. |**HTTP**| An **HTTP** action that retrieves the uploaded document using the file URL from the trigger output. |
185
+
| 1 |Wait or check for new content. |**When an HTTP request is received**| A trigger that either polls or waits for new data to arrive, either based on a scheduled recurrence or in response to specific events respectively. Such an event might be a new file that's uploaded to a specific storage system, such as Azure Blob Storage, SharePoint, OneDrive, File System, FTP, and so on. <br><br>In this example, the **Request** trigger operation waits for an HTTP or HTTPS request sent from another endpoint. The request includes the URL for a new uploaded document. |
186
+
| 2 | Get the content. |**HTTP**| An **HTTP** action that retrieves the uploaded document using the file URL from the trigger output. |
123
187
| 3 | Compose document details. |**Compose**| A **Data Operations** action that concatenates various items. <br><br>This example concatenates key-value information about the document. |
124
188
| 4 | Create token string. |**Parse a document**| A **Data Operations** action that produces a tokenized string using the output from the **Compose** action. |
125
189
| 5 | Create content chunks. |**Chunk text**| A **Data Operations** action that splits the token string into pieces, based on either the number of characters or tokens per content chunk. |
@@ -132,3 +196,4 @@ The following example includes other actions to create a complete workflow patte
132
196
## Related content
133
197
134
198
[Integrate Azure AI services with Standard workflows in Azure Logic Apps](connectors/azure-ai.md)
199
+
[Chunking large documents for vector search](/azure/search/vector-search-how-to-chunk-documents)
0 commit comments