Skip to content

Commit 8f0c78d

Browse files
How to use the Ingest Python code generator (#278)
Co-authored-by: Maria Khalusova <[email protected]>
1 parent 86a3a33 commit 8f0c78d

File tree

4 files changed

+54
-1
lines changed

4 files changed

+54
-1
lines changed

api-reference/ingest/overview.mdx

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,12 @@ An Unstructured ingest pipeline contains the following logical steps:
106106
</Step>
107107
</Steps>
108108

109+
## Generate Python code examples
110+
111+
import GeneratePythonCodeExamples from '/snippets/ingestion/code-generator.mdx';
112+
113+
<GeneratePythonCodeExamples />
114+
109115
## Learn more
110116

111117
- [Ingest configuration](/api-reference/ingest/ingest-configuration/overview) settings enable you to control how batches are sent and processed.

ingestion/overview.mdx

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,6 +183,12 @@ To begin using the Unstructured Ingest Python library, see the code examples for
183183

184184
<Info>To migrate from older, deprecated versions of the Ingest Python library that used `pip install unstructured`, see the [migration guide](#migration-guide).</Info>
185185

186+
### Generate Python code examples
187+
188+
import GeneratePythonCodeExamples from '/snippets/ingestion/code-generator.mdx';
189+
190+
<GeneratePythonCodeExamples />
191+
186192
## Migration guide
187193

188194
import MigrationGuideSteps from '/snippets/general-shared-text/ingest-migration.mdx';

open-source/ingest/overview.mdx

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,4 +90,10 @@ To install the Unstructured Ingest CLI and the Unstructured Ingest Python librar
9090

9191
## Configuration
9292

93-
The Unstructured Python Ingest library requires configuration to define data sources, ingestion processes, and destination targets. For the CLI, configuration is done through the various cli parameters supported. When the library is run in python, those parameters that are exposed in the CLI map to python config classes, which are described in more detail in the configs section.
93+
The Unstructured Python Ingest library requires configuration to define data sources, ingestion processes, and destination targets. For the CLI, configuration is done through the various cli parameters supported. When the library is run in python, those parameters that are exposed in the CLI map to python config classes, which are described in more detail in the configs section.
94+
95+
## Generate Python code examples
96+
97+
import GeneratePythonCodeExamples from '/snippets/ingestion/code-generator.mdx';
98+
99+
<GeneratePythonCodeExamples />
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
You can connect any available source connector to any available destination connector. However, the source connector code examples in the
2+
documentation show connecting only to the local destination connector. Similarly, the destination connector code examples in the
3+
documentation show connecting only to the local source connector.
4+
5+
To quickly generate an Unstructured Ingest Python library code example that connects _any_ available source connector to _any_ available destination connector,
6+
do the following:
7+
8+
1. Open the [Unstructured Ingest Code Generator](https://huggingface.co/spaces/MariaK/unstructured-pipeline-builder) webpage.
9+
2. Select your input (source) location type from the **Get unstructured documents from** drop-down list.
10+
3. Select your output (destination) location type from the **Upload RAG-ready documents to** drop-down list.
11+
4. Select your chunking strategy from the **Chunking strategy** drop-down list:
12+
13+
- **None** - Do not chunk the data elements' content.
14+
- **basic** - Combine sequential data elements to maximally fill each chunk. However, do not mix `Table` and non-`Table` elements in the same chunk.
15+
- **by_title** - Use the `basic` strategy and also preserve section boundaries. Optionally preserve page boundaries as well.
16+
- **by_page** - Use the `basic` strategy and also preserve page boundaries.
17+
- **by_similarity** - Use the `sentence-transformers/multi-qa-mpnet-base-dot-v1` embedding model to identify topically similar sequential elements and combine them into chunks. This strategy is availably only when calling Unstructured API services.
18+
19+
To learn more, see [Chunking strategies](/api-reference/api-services/chunking) and [Chunking configuration](/api-reference/ingest/ingest-configuration/chunking-configuration).
20+
21+
5. For any chunking strategy other than **None**:
22+
23+
- Enter your chunk size in the **Chunk size (characters)** box, or leave the default of **1000** characters.
24+
- If you need to apply overlapping to the chunks, enter the chunk overlap size in the **Chunk overlap (characters)** box, or leave default of **20** characters.
25+
26+
To learn more, see [Chunking configuration](/api-reference/ingest/ingest-configuration/chunking-configuration).
27+
28+
6. To generate vector embeddings, select the provider in the **Embedding provider** drop-down list.
29+
30+
To learn more, see [Embedding configuraton](/api-reference/ingest/ingest-configuration/embedding-configuration).
31+
32+
7. Click **Generate code**.
33+
8. Copy the example code from the **Generated Code** pane into your code project.
34+
9. The code example will contain one or more environment variables that you must set for the code to run correctly. To learn what to
35+
set these variables to, click the documentation links that are below the **Generated Code** pane.

0 commit comments

Comments
 (0)