Platform UI: latest list of available connectors, workflow settings (#255)

Paul-Cornell · web-flow · commit b3c60da998bf · 2024-09-25T16:56:58.000-07:00
diff --git a/mint.json b/mint.json
@@ -432,29 +432,16 @@
               "pages": [
                 "platform/sources/overview",
                 "platform/sources/azure-blob-storage",
-                "platform/sources/elasticsearch",
-                "platform/sources/google-drive",
-                "platform/sources/onedrive-cloud-storage",
-                "platform/sources/opensearch",
-                "platform/sources/s3",
-                "platform/sources/salesforce",
-                "platform/sources/sftp-storage",
-                "platform/sources/sharepoint"
+                "platform/sources/s3"
               ]
           },
           {
             "group": "Destinations",
             "pages": [
               "platform/destinations/overview",
               "platform/destinations/azure-cognitive-search",
-              "platform/destinations/chroma",
-              "platform/destinations/databricks",
-              "platform/destinations/elasticsearch",
-              "platform/destinations/mongodb",
-              "platform/destinations/opensearch",
               "platform/destinations/pinecone",
-              "platform/destinations/s3",
-              "platform/destinations/weaviate"
+              "platform/destinations/s3"
             ]
           },
           "platform/workflows",
diff --git a/platform/connectors.mdx b/platform/connectors.mdx
@@ -12,14 +12,7 @@ The Unstructured Platform supports connecting to the following source and destin
 ## Sources
 
 - [Azure](/platform/sources/azure-blob-storage)
-- [Elasticsearch](/platform/sources/elasticsearch)
-- [Google Drive](/platform/sources/google-drive)
-- [OneDrive](/platform/sources/onedrive-cloud-storage)
-- [OpenSearch](/platform/sources/opensearch)
 - [S3](/platform/sources/s3)
-- [Salesforce](/platform/sources/salesforce)
-- [SFTP](/platform/sources/sftp-storage)
-- [SharePoint](/platform/sources/sharepoint)
 
 If your source is not listed here, you might still be able to connect Unstructured to it through scripts or code by using the 
 [Unstructured Ingest CLI](/ingestion/overview#unstructured-ingest-cli) or the 
@@ -29,14 +22,8 @@ If your source is not listed here, you might still be able to connect Unstructur
 ## Destinations
 
 - [Azure Cognitive Search](/platform/destinations/azure-cognitive-search)
-- [Chroma](/platform/destinations/chroma)
-- [Databricks Volumes](/platform/destinations/databricks)
-- [Elasticsearch](/platform/destinations/elasticsearch)
-- [MongoDB](/platform/destinations/mongodb)
-- [OpenSearch](/platform/destinations/opensearch)
 - [Pinecone](/platform/destinations/pinecone)
 - [S3](/platform/destinations/s3)
-- [Weaviate](/platform/destinations/weaviate)
 
 If your destination is not listed here, you might still be able to connect Unstructured to it through scripts or code by using the 
 [Unstructured Ingest CLI](/ingestion/overview#unstructured-ingest-cli) or the 
diff --git a/platform/destinations/overview.mdx b/platform/destinations/overview.mdx
@@ -15,14 +15,8 @@ To create a destination connector:
 4. Fill in the fields according to your connector type. To learn how, click your connector type in the following list:
 
    - [Azure Cognitive Search](/platform/destinations/azure-cognitive-search)
-   - [Chroma](/platform/destinations/chroma)
-   - [Databricks Volumes](/platform/destinations/databricks)
-   - [Elasticsearch](/platform/destinations/elasticsearch)
-   - [MongoDB](/platform/destinations/mongodb)
-   - [OpenSearch](/platform/destinations/opensearch)
    - [Pinecone](/platform/destinations/pinecone)
    - [S3](/platform/destinations/s3)
-   - [Weaviate](/platform/destinations/weaviate)
 
 5. Click **Save and Test**.
 6. Click **Close**.
diff --git a/platform/overview.mdx b/platform/overview.mdx
@@ -35,7 +35,7 @@ To get your data RAG-ready, the Unstructured Platform moves it through the follo
     Routing determines which strategy Unstructured Platform uses to transforming your documents into Unstructured's canonical JSON schema. The Unstructured Platform provides these [partitioning](/platform/partitioning) strategies for document transformation:
     
     - **Fast** is great for when there is extractable text available, like in HTML files or in the Microsoft Office Document format.
-    - **Hi-Res** is best for PDFs and tables and where accurate classification of document elements is critical.
+    - **Hi Res** is best for PDFs and tables and where accurate classification of document elements is critical.
     - If you're unsure which strategy to use, choose **Auto**, and the Unstructured Platform will handle the decision for you.
 
   </Step>
diff --git a/platform/partitioning.mdx b/platform/partitioning.mdx
@@ -25,7 +25,7 @@ To choose one of these strategies, select one of the **Strategy** options in the
   - You have only PDF files, and you know that none of them have embedded images or tables in them, or 
   - You have no PDF files or image files at all.
 
-- **Hi-Res**: This strategy uses an image-to-text model for inference. It is slower and costlier than **Fast** but can provide 
+- **Hi Res**: This strategy uses an image-to-text model for inference. It is slower and costlier than **Fast** but can provide 
   higher-quality resolution. You should choose this strategy if you know that:
 
   - All of the files are only image files, or
diff --git a/platform/sources/overview.mdx b/platform/sources/overview.mdx
@@ -16,14 +16,7 @@ To create a source connector:
 4. Fill in the fields according to your connector type. To learn how, click your connector type in the following list:
 
    - [Azure](/platform/sources/azure-blob-storage)
-   - [Elasticsearch](/platform/sources/elasticsearch)
-   - [Google Drive](/platform/sources/google-drive)
-   - [OneDrive](/platform/sources/onedrive-cloud-storage)
-   - [OpenSearch](/platform/sources/opensearch)
    - [S3](/platform/sources/s3)
-   - [Salesforce](/platform/sources/salesforce)
-   - [SFTP](/platform/sources/sftp-storage)
-   - [SharePoint](/platform/sources/sharepoint)
 
 5. Click **Save and Test**.
 6. Click **Close**.
diff --git a/platform/workflows.mdx b/platform/workflows.mdx
@@ -14,8 +14,6 @@ Workflows are crucial for establishing a systematic approach to managing data fl
 
 ## Create a workflow
 
-![Create workflow](/img/platform/Create-Workflow.png)
-
 <Warning>
     You must first have an existing source connector and destination connector to add to the workflow.
 
@@ -36,8 +34,8 @@ To create a workflow:
 
 6. In the **Workflow Settings** section, choose one of these predefined workflow settings groups:
 
-   - **Basic** is a good choice if you have documents that have no images or tables in them.
-   - **Advanced** is a good choice if you have documents that have images or tables or both in them.
+   - **Basic** is a good choice if you have text-only documents that have no images or tables in them.
+   - **Advanced** is a good choice if you have complex documents that have images or tables or both in them.
            
    Learn about the predefined settings for [Basic](#basic-workflow-settings) and [Advanced](#advanced-workflow-settings).
 
@@ -63,26 +61,28 @@ To learn more about these settings, see the descriptions for [Custom workflow se
 - **Transform** section:
 
   - [Strategy](/platform/partitioning): **Fast**
-  - [Image Summarization](/platform/summarizing): **None**
-  - [Table Summarization](/platform/summarizing): **None**
+  - [Image summarization](/platform/summarizing): **None**
+  - [Table summarization](/platform/summarizing): **None**
   - **Connector Settings**:
 
     - **Include Page Breaks**: No (unchecked)
-    - **Infer Table Structure**: No (unchecked)
+    - **Infer Table Structure**: Yes (checked)
+
+  - **Elements to Exclude**: None (nothing selected)
 
 - [Chunk](/platform/chunking) section: 
 
   - **Chunker Type**: **Basic**
   - **Include Original Elements**: No (unchecked)
-  - **Max Characters**: **500**
-  - **New After N Characters**: **1000**
-  - **Overlap**: **100**
+  - **Max Characters**: **2048**
+  - **New After N Characters**: **1500**
+  - **Overlap**: **160**
   - **Overlap All**: No (unchecked)
 
 - [Embed](/platform/embedding) section:
 
   - **Vendor**: **OpenAI**
-  - **Embedding Model**: [text-embedding-3-small](https://platform.openai.com/docs/guides/embeddings)
+  - **Embedding Model**: [text-embedding-3-small](https://platform.openai.com/docs/guides/embeddings) (1536 dimensions)
 
 ## Advanced workflow settings
 
@@ -92,31 +92,31 @@ To learn more about these settings, see the descriptions for [Custom workflow se
 
 - **Transform** section:
 
-  - [Strategy](/platform/partitioning): **Hi-Res**
-  - [Image Summarization](/platform/summarizing): **Claude 3.5 Sonnet**
-  - [Table Summarization](/platform/summarizing): **GPT-4o**
+  - [Strategy](/platform/partitioning): **Hi Res**
+  - [Image summarization](/platform/summarizing): **Claude 3.5 Sonnet**
+  - [Table summarization](/platform/summarizing): **GPT-4o**
   - **Connector Settings**:
 
-    - **Include Page Breaks**: Yes (checked)
-    - **Infer Table Structure**: Yes (checked)
+    - **Include Page Breaks**: No (unchecked)
+    - **Infer Table Structure**: No (unchecked)
 
-  - **Elements to Exclude**: None
+  - **Elements to Exclude**: None (nothing selected)
 
 - [Chunk](/platform/chunking) section:
 
-  - **Chunker Type**: **By Title**
-  - **Combine Text Under N Characters**: **1000**
-  - **Include Original Elements**: Yes (checked)
-  - **Max Characters**: **500**
+  - **Chunker Type**: **Chunk By Title**
+  - **Combine Text Under N Characters**: **0**
+  - **Include Original Elements**: No (unchecked)
+  - **Max Characters**: **2048**
   - **Multipage Sections**: Yes (checked)
-  - **New After N Characters**: **1000**
-  - **Overlap**: **100**
-  - **Overlap All**: Yes (checked)
+  - **New After N Characters**: **1500**
+  - **Overlap**: **160**
+  - **Overlap All**: No (unchecked)
 
 - [Embed](/platform/embedding) section:
 
   - **Vendor**: **OpenAI**
-  - **Embedding Model**: [text-embedding-3-large](https://platform.openai.com/docs/guides/embeddings)
+  - **Embedding Model**: [text-embedding-3-large](https://platform.openai.com/docs/guides/embeddings) (3072 dimensions)
 
 ## Custom workflow settings
 
@@ -130,7 +130,7 @@ The following workflow settings can be customized:
         1. For **Strategy**, choose one of the following:
 
            - **Fast**: This strategy uses traditional NLP extraction techniques to quickly pull in all text elements. This strategy is not good for image-based file types. [Learn more](/platform/partitioning).
-           - **Hi-Res**: This strategy uses document layout to gain additional information about document elements. Unstructured recommends using this strategy if your use case is highly sensitive to correct classifications for document elements. [Learn more](/platform/partitioning).
+           - **Hi Res**: This strategy uses document layout to gain additional information about document elements. Unstructured recommends using this strategy if your use case is highly sensitive to correct classifications for document elements. [Learn more](/platform/partitioning).
            - **Auto**: This strategy chooses the partitioning strategy based on detected document characteristics. [Learn more](/platform/partitioning).
 
         2. For **Image summarization**, choose one of the following:
@@ -152,7 +152,7 @@ The following workflow settings can be customized:
         4. For **Connector Settings**, check one or more of the following boxes:
 
            - **Include Page Breaks**: Include page breaks in the output, if the file type supports it.
-           - **Infer Table Structure**: If you also set **Strategy** to **Hi-Res**, any table elements extracted from a PDF will include an additional metadata field, `text_as_html`, that contains a transformation of the data into an HTML `<table>`.
+           - **Infer Table Structure**: If you also set **Strategy** to **Hi Res**, any table elements extracted from a PDF will include an additional metadata field, `text_as_html`, that contains a transformation of the data into an HTML `<table>`.
 
         5. For **Elements to Exclude**, select one or more standard Unstructured element types to not include in the output. [Learn more](/platform/document-elements).
     </Accordion>
@@ -190,7 +190,7 @@ The following workflow settings can be customized:
 
           - **Include Original Elements**: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk.
           - **Max Characters** (_required_): Cut off new sections after reaching a length of this many characters. This is a strict limit.
-          - **Similarity Threshold** (_required_): Specify a threshold between 0 and 1, where 0 indicates completely dissimilar vectors and 1 indicates identical vectors, taking into consider the trade-offs between precision (a higher threshold) and recall (a lower threshold). [Learn more](https://towardsdatascience.com/introduction-to-embedding-clustering-and-similarity-11dd80b00061).
+          - **Similarity Threshold** (_required_): Specify a threshold between 0 and 1 exclusive (0.01 to 0.99 inclusive), where 0 indicates completely dissimilar vectors and 1 indicates identical vectors, taking into consider the trade-offs between precision (a higher threshold) and recall (a lower threshold). [Learn more](https://towardsdatascience.com/introduction-to-embedding-clustering-and-similarity-11dd80b00061).
 
         [Learn more](https://unstructured.io/blog/chunking-for-rag-best-practices).
     </Accordion>
@@ -200,33 +200,26 @@ The following workflow settings can be customized:
         - **Off**: Do not generate embeddings.
         - **OpenAI**: Use OpenAI to generate embeddings. Also choose the embedding model to use, from one of the following:
 
-          - **text-embedding-3-small**: [Learn more](https://platform.openai.com/docs/guides/embeddings).
-          - **text-embedding-3-large**: [Learn more](https://platform.openai.com/docs/guides/embeddings).
-          - **Ada 002 (Text)**: [Learn more](https://platform.openai.com/docs/guides/embeddings).
+          - **text-embedding-3-small** (1536 dimensions): [Learn more](https://platform.openai.com/docs/guides/embeddings).
+          - **text-embedding-3-large** (3072 dimensions): [Learn more](https://platform.openai.com/docs/guides/embeddings).
+          - **Ada 002 (Text)** (1536 dimensions): [Learn more](https://platform.openai.com/docs/guides/embeddings).
 
-        - **Anthropic**: Use Anthropic to generate embeddings. Also choose the embedding model to use, from one of the following:
-
-          - **voyage-2**: [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
-          - **voyage-large-2**: [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
-          - **voyage-code-2**: [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
-          - **voyage-lite-02-instruct**: [Learn more](https://docs.anthropic.com/en/docs/build-with-claude/embeddings#available-voyage-models).
-        
         - **Hugging Face**: Use Hugging Face to generate embeddings. Also choose the embedding model to use, from one of the following:
 
-          - **nvidia/NV-Embed-v1**: [Learn more](https://huggingface.co/nvidia/NV-Embed-v1).
-          - **voyage-large-2-instruct**: [Learn more](https://huggingface.co/voyageai/voyage-large-2-instruct).
-          - **stella_en_400M_v5**: [Learn more](https://huggingface.co/dunzhang/stella_en_400M_v5).
-          - **stella_en_1.5B_v5**: [Learn more](https://huggingface.co/dunzhang/stella_en_1.5B_v5).
-          - **Alibaba-NLP/gte-Qwen2-7B-instruct**: [Learn more](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct).
+          - **nvidia/NV-Embed-v1** (4096 dimensions): [Learn more](https://huggingface.co/nvidia/NV-Embed-v1).
+          - **voyage-large-2-instruct** (1024 dimensions): [Learn more](https://huggingface.co/voyageai/voyage-large-2-instruct).
+          - **stella_en_400M_v5** (1024 dimensions): [Learn more](https://huggingface.co/dunzhang/stella_en_400M_v5).
+          - **stella_en_1.5B_v5** (1024 dimensions): [Learn more](https://huggingface.co/dunzhang/stella_en_1.5B_v5).
+          - **Alibaba-NLP/gte-Qwen2-7B-instruct** (3584 dimensions): [Learn more](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct).
 
         - **OctoAI**: Use OctoAI to generate embeddings. Also choose the embedding model to use, from one of the following:
 
-          - **GTE Large**: [Learn more](https://octo.ai/blog/introducing-octoais-embedding-api-to-power-your-rag-needs/).
+          - **GTE Large** (1024 dimensions): [Learn more](https://octo.ai/blog/introducing-octoais-embedding-api-to-power-your-rag-needs/).
 
         - **Vertex AI**: Use Vertex AI to generate embeddings. Also choose the embedding model to use, from one of the following:
 
-          - **textembedding-gecko@003**: [Learn more](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions#embeddings_stable_model_versions).
-          - **text-embedding-004**: [Learn more](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions#embeddings_stable_model_versions).
+          - **textembedding-gecko@003** (768 dimensions): [Learn more](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions#embeddings_stable_model_versions).
+          - **text-embedding-004** (768 dimensions): [Learn more](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions#embeddings_stable_model_versions).
 
         Learn more:
         
diff --git a/snippets/quickstarts/platform.mdx b/snippets/quickstarts/platform.mdx
@@ -47,8 +47,8 @@ You will need:
 
         6. In the **Workflow Settings** section, choose one of these predefined workflow settings groups:
 
-           - **Basic** is a good choice if you have documents that have no images or tables in them.
-           - **Advanced** is a good choice if you have documents that have images or tables or both in them.
+           - **Basic** is a good choice if you have text-only documents that have no images or tables in them.
+           - **Advanced** is a good choice if you have complex documents that have images or tables or both in them.
            
            Learn about the predefined settings for [Basic](/platform/workflows#basic-workflow-settings) and [Advanced](/platform/workflows#advanced-workflow-settings).