You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/logic-apps/parse-document-chunk-text.md
+15-10Lines changed: 15 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@ ms.date: 07/26/2024
18
18
> This capability is in preview and is subject to the
19
19
> [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).
20
20
21
-
Sometimes you have to convert content into token form or break down a large document into smaller pieces before you can use this content with some actions. For example, actions such as **Azure AI Search** or **Azure OpenAI** expect tokenized input and can handle only a limited number of tokens, which are words or chunks of characters.
21
+
Sometimes you have to convert content into token form or break down a large document into smaller pieces before you can use this content with some actions. For example, the such as **Azure AI Search** or **Azure OpenAI** expect tokenized input and can handle only a limited number of tokens, which are words or chunks of characters.
22
22
23
23
For these scenarios, use the **Data Operations** actions named **Parse a document** and **Chunk text** in your Standard logic app workflow. These actions respectively convert content, such as a PDF document, CSV file, Excel file, and so on, into tokenized string output and then split the string into pieces, based on the number of tokens or characters. You can then reference and use these outputs with subsequent actions in your workflow.
24
24
@@ -84,13 +84,14 @@ If you use other content sources, such as Azure Blob Storage, SharePoint, OneDri
84
84
85
85
| Name | Value | Data type | Description | Limit |
|**Chunking Strategy**|**FixedLength** or **TokenSize**| String |**FixedLength**: Split the content, based on the number of characters <br><br>**TokenSize**: Split the content, based on the number of tokens. <br><br>Default: **FixedLength**||
148
-
|**Text**| <*content-to-chunk*> | Any | The content to chunk. ||
148
+
|**Chunking Strategy**|**FixedLength** or **TokenSize**| String enum |**FixedLength**: Split the content, based on the number of characters <br><br>**TokenSize**: Split the content, based on the number of tokens. <br><br>Default: **FixedLength**||
149
+
|**Text**| <*content-to-chunk*> | Any | The content to chunk. | See [Limits and configuration reference guide](logic-apps-limits-and-config.md#character-limits)|
149
150
150
151
For **Chunking Strategy** set to **FixedLength**:
151
152
152
153
| Name | Value | Data type | Description | Limit |
|**MaxPageLength**| <*max-char-per-chunk*> | Integer | The maximum number of characters per content chunk. <br><br>Default: **5000**||
155
-
|**PageOverlapLength**| <*number-of-overlapping-characters*> | Integer | The number of characters from the end of the previous chunk to include in the next chunk. This setting helps you avoid losing important information when splitting content into chunks and preserves continuity and context across chunks. <br><br>Default: **0** - No overlapping characters exist. ||
155
+
|**MaxPageLength**| <*max-char-per-chunk*> | Integer | The maximum number of characters per content chunk. <br><br>Default: **5000**| Minimum: **1**|
156
+
|**PageOverlapLength**| <*number-of-overlapping-characters*> | Integer | The number of characters from the end of the previous chunk to include in the next chunk. This setting helps you avoid losing important information when splitting content into chunks and preserves continuity and context across chunks. <br><br>Default: **0** - No overlapping characters exist. | Minimum: **0**|
156
157
|**Language**| <*language*> | String | The [language](/azure/ai-services/language-service/language-detection/language-support) to use for the resulting chunks. <br><br>Default: **en-us**| Not applicable |
157
158
158
159
For **Chunking Strategy** set to **TokenSize**:
159
160
160
161
| Name | Value | Data type | Description | Limit |
|**TokenSize**| <*max-tokens-per-chunk*> | Integer | The maximum number of tokens per content chunk. <br><br>Default: None ||
163
-
|**Encoding model**| <*encoding-method*> | String | The encoding method to use. <br><br>Default: None | Not applicable |
163
+
|**TokenSize**| <*max-tokens-per-chunk*> | Integer | The maximum number of tokens per content chunk. <br><br>Default: None |- Minimum: 1 <br><br>- Maximum: 8000 |
164
+
|**Encoding model**| <*encoding-method*> | String enum | The [encoding method]() to use: **cl100k_base**, **cl200k_base**, **p50k_base**, **p50k_edit**, **r50k_base** <br><br>Default: None | Not applicable |
164
165
165
166
> [!TIP]
166
167
>
@@ -173,7 +174,11 @@ For **Chunking Strategy** set to **TokenSize**:
173
174
174
175
#### Outputs
175
176
176
-
177
+
| Name | Data type | Description |
178
+
|------|-----------|-------------|
179
+
|**Chunked result Text items**| String array | An array of strings. |
180
+
|**Chunked result Text items Item**| String | A single string in the array. |
181
+
|**Chunked result**| Object | An object that contains the entire chunked text. |
0 commit comments