You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: platform/overview.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,7 +35,7 @@ To get your data RAG-ready, the Unstructured Platform moves it through the follo
35
35
Routing determines which strategy Unstructured Platform uses to transforming your documents into Unstructured's canonical JSON schema. The Unstructured Platform provides these [partitioning](/platform/partitioning) strategies for document transformation:
36
36
37
37
- **Fast** is great for when there is extractable text available, like in HTML files or in the Microsoft Office Document format.
38
-
- **Hi-Res** is best for PDFs and tables and where accurate classification of document elements is critical.
38
+
- **HiRes** is best for PDFs and tables and where accurate classification of document elements is critical.
39
39
- If you're unsure which strategy to use, choose **Auto**, and the Unstructured Platform will handle the decision for you.
@@ -130,7 +130,7 @@ The following workflow settings can be customized:
130
130
1. For **Strategy**, choose one of the following:
131
131
132
132
-**Fast**: This strategy uses traditional NLP extraction techniques to quickly pull in all text elements. This strategy is not good for image-based file types. [Learn more](/platform/partitioning).
133
-
- **Hi-Res**: This strategy uses document layout to gain additional information about document elements. Unstructured recommends using this strategy if your use case is highly sensitive to correct classifications for document elements. [Learn more](/platform/partitioning).
133
+
- **HiRes**: This strategy uses document layout to gain additional information about document elements. Unstructured recommends using this strategy if your use case is highly sensitive to correct classifications for document elements. [Learn more](/platform/partitioning).
134
134
- **Auto**: This strategy chooses the partitioning strategy based on detected document characteristics. [Learn more](/platform/partitioning).
135
135
136
136
2. For **Image summarization**, choose one of the following:
@@ -152,7 +152,7 @@ The following workflow settings can be customized:
152
152
4. For **Connector Settings**, check one or more of the following boxes:
153
153
154
154
-**Include Page Breaks**: Include page breaks in the output, if the file type supports it.
155
-
-**Infer Table Structure**: If you also set **Strategy** to **Hi-Res**, any table elements extracted from a PDF will include an additional metadata field, `text_as_html`, that contains a transformation of the data into an HTML `<table>`.
155
+
-**Infer Table Structure**: If you also set **Strategy** to **HiRes**, any table elements extracted from a PDF will include an additional metadata field, `text_as_html`, that contains a transformation of the data into an HTML `<table>`.
156
156
157
157
5. For **Elements to Exclude**, select one or more standard Unstructured element types to not include in the output. [Learn more](/platform/document-elements).
158
158
</Accordion>
@@ -190,7 +190,7 @@ The following workflow settings can be customized:
190
190
191
191
-**Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk.
192
192
-**Max Characters** (_required_): Cut off new sections after reaching a length of this many characters. This is a strict limit.
193
-
-**Similarity Threshold** (_required_): Specify a threshold between 0 and 1, where 0 indicates completely dissimilar vectors and 1 indicates identical vectors, taking into consider the trade-offs between precision (a higher threshold) and recall (a lower threshold). [Learn more](https://towardsdatascience.com/introduction-to-embedding-clustering-and-similarity-11dd80b00061).
193
+
-**Similarity Threshold** (_required_): Specify a threshold between 0 and 1 exclusive (0.01 to 0.99 inclusive), where 0 indicates completely dissimilar vectors and 1 indicates identical vectors, taking into consider the trade-offs between precision (a higher threshold) and recall (a lower threshold). [Learn more](https://towardsdatascience.com/introduction-to-embedding-clustering-and-similarity-11dd80b00061).
Copy file name to clipboardExpand all lines: snippets/quickstarts/platform.mdx
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -47,8 +47,8 @@ You will need:
47
47
48
48
6. In the **Workflow Settings** section, choose one of these predefined workflow settings groups:
49
49
50
-
-**Basic** is a good choice if you have documents that have no images or tables in them.
51
-
- **Advanced** is a good choice if you have documents that have images or tables or both in them.
50
+
-**Basic** is a good choice if you have text-only documents that have no images or tables in them.
51
+
- **Advanced** is a good choice if you have complex documents that have images or tables or both in them.
52
52
53
53
Learn about the predefined settings for [Basic](/platform/workflows#basic-workflow-settings) and [Advanced](/platform/workflows#advanced-workflow-settings).
0 commit comments