Skip to content

Commit b6a65c3

Browse files
authored
Move Ingest-related content from Partition Endpoint docs over into Ingest docs, retrofit remaining POST/Python examples to use VLM (#498)
1 parent 75824af commit b6a65c3

26 files changed

+765
-657
lines changed
File renamed without changes.

ingestion/how-to/examples.mdx

Lines changed: 318 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,318 @@
1+
---
2+
title: Examples
3+
description: This page provides some examples of accessing Unstructured by using the Unstructured Ingest CLI and the Unstructured Ingest Python library.
4+
---
5+
6+
These examples assume that you have already followed the instructured to set up the
7+
[Unstructured Ingest CLI](/ingestion/ingest-cli) and the [Unstructured Ingest Python library](/ingestion/python-ingest).
8+
9+
### Changing partition strategy for a PDF
10+
11+
Here's how you can modify partition strategy for a PDF file, and select an alternative model to use with Unstructured API.
12+
The `hi_res` strategy supports different models, and the default is `layout_v1.1.0`.
13+
14+
<iframe
15+
width="560"
16+
height="315"
17+
src="https://www.youtube.com/embed/SwJVB_kPqTc"
18+
title="YouTube video player"
19+
frameborder="0"
20+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
21+
allowfullscreen
22+
></iframe>
23+
24+
<AccordionGroup>
25+
<Accordion title="Ingest CLI">
26+
```bash CLI
27+
unstructured-ingest \
28+
local \
29+
--input-path $LOCAL_FILE_INPUT_DIR \
30+
--output-dir $LOCAL_FILE_OUTPUT_DIR \
31+
--strategy hi_res \
32+
--hi-res-model-name layout_v1.1.0 \
33+
--partition-by-api \
34+
--api-key $UNSTRUCTURED_API_KEY \
35+
--partition-endpoint $UNSTRUCTURED_API_URL \
36+
--additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}"
37+
```
38+
</Accordion>
39+
<Accordion title="Ingest Python">
40+
```python Python
41+
import os
42+
43+
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
44+
from unstructured_ingest.v2.interfaces import ProcessorConfig
45+
from unstructured_ingest.v2.processes.connectors.local import (
46+
LocalIndexerConfig,
47+
LocalDownloaderConfig,
48+
LocalConnectionConfig,
49+
LocalUploaderConfig
50+
)
51+
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
52+
53+
if __name__ == "__main__":
54+
Pipeline.from_configs(
55+
context=ProcessorConfig(),
56+
indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
57+
downloader_config=LocalDownloaderConfig(),
58+
source_connection_config=LocalConnectionConfig(),
59+
partitioner_config=PartitionerConfig(
60+
strategy="hi_res",
61+
hi_res_model_name="layout_v1.0.0",
62+
partition_by_api=True,
63+
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
64+
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
65+
additional_partition_args={
66+
"split_pdf_page": True,
67+
"split_pdf_allow_failed": True,
68+
"split_pdf_concurrency_level": 15
69+
}
70+
),
71+
uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
72+
).run()
73+
```
74+
</Accordion>
75+
</AccordionGroup>
76+
77+
If you have a local deployment of the Unstructured API, you can use other supported models, such as `yolox`.
78+
79+
### Specifying the language of a document for better OCR results
80+
81+
For better OCR results, you can specify what languages your document is in using the `languages` parameter.
82+
[View the list of available languages](https://github.com/tesseract-ocr/tessdata).
83+
84+
<AccordionGroup>
85+
<Accordion title="Ingest CLI">
86+
```bash CLI
87+
unstructured-ingest \
88+
local \
89+
--input-path $LOCAL_FILE_INPUT_DIR \
90+
--output-dir $LOCAL_FILE_OUTPUT_DIR \
91+
--strategy ocr_only \
92+
--ocr-languages kor \
93+
--partition-by-api \
94+
--api-key $UNSTRUCTURED_API_KEY \
95+
--partition-endpoint $UNSTRUCTURED_API_URL \
96+
--additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}"
97+
```
98+
</Accordion>
99+
<Accordion title="Ingest Python">
100+
```python Python
101+
import os
102+
103+
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
104+
from unstructured_ingest.v2.interfaces import ProcessorConfig
105+
from unstructured_ingest.v2.processes.connectors.local import (
106+
LocalIndexerConfig,
107+
LocalDownloaderConfig,
108+
LocalConnectionConfig,
109+
LocalUploaderConfig
110+
)
111+
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
112+
113+
if __name__ == "__main__":
114+
Pipeline.from_configs(
115+
context=ProcessorConfig(),
116+
indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
117+
downloader_config=LocalDownloaderConfig(),
118+
source_connection_config=LocalConnectionConfig(),
119+
partitioner_config=PartitionerConfig(
120+
strategy="ocr_only",
121+
ocr_languages=["kor"],
122+
partition_by_api=True,
123+
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
124+
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
125+
additional_partition_args={
126+
"split_pdf_page": True,
127+
"split_pdf_allow_failed": True,
128+
"split_pdf_concurrency_level": 15
129+
}
130+
),
131+
uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
132+
).run()
133+
```
134+
</Accordion>
135+
</AccordionGroup>
136+
137+
### Saving bounding box coordinates
138+
139+
When elements are extracted from PDFs or images, it may be useful to get their bounding boxes as well.
140+
Set the `coordinates` parameter to `true` to add this field to the elements in the response.
141+
142+
<AccordionGroup>
143+
<Accordion title="Ingest CLI">
144+
```bash CLI
145+
unstructured-ingest \
146+
local \
147+
--input-path $LOCAL_FILE_INPUT_DIR \
148+
--output-dir $LOCAL_FILE_OUTPUT_DIR \
149+
--partition-by-api \
150+
--api-key $UNSTRUCTURED_API_KEY \
151+
--partition-endpoint $UNSTRUCTURED_API_URL \
152+
--strategy hi_res \
153+
--additional-partition-args="{\"coordinates\":\"true\", \"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}"
154+
```
155+
</Accordion>
156+
<Accordion title="Ingest Python">
157+
```python Python
158+
import os
159+
160+
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
161+
from unstructured_ingest.v2.interfaces import ProcessorConfig
162+
from unstructured_ingest.v2.processes.connectors.local import (
163+
LocalIndexerConfig,
164+
LocalDownloaderConfig,
165+
LocalConnectionConfig,
166+
LocalUploaderConfig
167+
)
168+
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
169+
170+
if __name__ == "__main__":
171+
Pipeline.from_configs(
172+
context=ProcessorConfig(),
173+
indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
174+
downloader_config=LocalDownloaderConfig(),
175+
source_connection_config=LocalConnectionConfig(),
176+
partitioner_config=PartitionerConfig(
177+
partition_by_api=True,
178+
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
179+
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
180+
strategy="hi_res",
181+
additional_partition_args={
182+
"coordinates": True,
183+
"split_pdf_page": True,
184+
"split_pdf_allow_failed": True,
185+
"split_pdf_concurrency_level": 15
186+
}
187+
),
188+
uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
189+
).run()
190+
```
191+
</Accordion>
192+
</AccordionGroup>
193+
194+
### Returning unique element IDs
195+
196+
By default, the element ID is a SHA-256 hash of the element text. This is to ensure that
197+
the ID is deterministic. One downside is that the ID is not guaranteed to be unique.
198+
Different elements with the same text will have the same ID, and there could also be hash collisions.
199+
To use UUIDs in the output instead, set `unique_element_ids=true`. Note: this means that the element IDs
200+
will be random, so with every partition of the same file, you will get different IDs.
201+
This can be helpful if you'd like to use the IDs as a primary key in a database, for example.
202+
203+
<AccordionGroup>
204+
<Accordion title="Ingest CLI">
205+
```bash CLI
206+
unstructured-ingest \
207+
local \
208+
--input-path $LOCAL_FILE_INPUT_DIR \
209+
--output-dir $LOCAL_FILE_OUTPUT_DIR \
210+
--partition-by-api \
211+
--api-key $UNSTRUCTURED_API_KEY \
212+
--partition-endpoint $UNSTRUCTURED_API_URL \
213+
--strategy hi_res \
214+
--additional-partition-args="{\"unique_element_ids\":\"true\", \"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}"
215+
```
216+
</Accordion>
217+
<Accordion title="Ingest Python">
218+
```python Python
219+
import os
220+
221+
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
222+
from unstructured_ingest.v2.interfaces import ProcessorConfig
223+
from unstructured_ingest.v2.processes.connectors.local import (
224+
LocalIndexerConfig,
225+
LocalDownloaderConfig,
226+
LocalConnectionConfig,
227+
LocalUploaderConfig
228+
)
229+
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
230+
231+
if __name__ == "__main__":
232+
Pipeline.from_configs(
233+
context=ProcessorConfig(),
234+
indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
235+
downloader_config=LocalDownloaderConfig(),
236+
source_connection_config=LocalConnectionConfig(),
237+
partitioner_config=PartitionerConfig(
238+
partition_by_api=True,
239+
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
240+
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
241+
strategy="hi_res",
242+
additional_partition_args={
243+
"unique_element_ids": True,
244+
"split_pdf_page": True,
245+
"split_pdf_allow_failed": True,
246+
"split_pdf_concurrency_level": 15
247+
}
248+
),
249+
uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
250+
).run()
251+
```
252+
</Accordion>
253+
</AccordionGroup>
254+
255+
### Adding the chunking step after partitioning
256+
257+
You can combine partitioning and subsequent chunking in a single request by setting the `chunking_strategy` parameter.
258+
By default, the `chunking_strategy` is set to `None`, and no chunking is performed.
259+
260+
[//]: # (TODO: add a link to the concepts section about chunking strategies. Need to create the shared Concepts section first)
261+
262+
<AccordionGroup>
263+
<Accordion title="Ingest CLI">
264+
```bash CLI
265+
unstructured-ingest \
266+
local \
267+
--input-path $LOCAL_FILE_INPUT_DIR \
268+
--output-dir $LOCAL_FILE_OUTPUT_DIR \
269+
--chunking-strategy by_title \
270+
--chunk-max-characters 1024 \
271+
--partition-by-api \
272+
--api-key $UNSTRUCTURED_API_KEY \
273+
--partition-endpoint $UNSTRUCTURED_API_URL \
274+
--strategy hi_res \
275+
--additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}"
276+
```
277+
</Accordion>
278+
<Accordion title="Ingest Python">
279+
```python Python
280+
import os
281+
282+
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
283+
from unstructured_ingest.v2.interfaces import ProcessorConfig
284+
from unstructured_ingest.v2.processes.connectors.local import (
285+
LocalIndexerConfig,
286+
LocalDownloaderConfig,
287+
LocalConnectionConfig,
288+
LocalUploaderConfig
289+
)
290+
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
291+
from unstructured_ingest.v2.processes.chunker import ChunkerConfig
292+
293+
if __name__ == "__main__":
294+
Pipeline.from_configs(
295+
context=ProcessorConfig(),
296+
indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
297+
downloader_config=LocalDownloaderConfig(),
298+
source_connection_config=LocalConnectionConfig(),
299+
partitioner_config=PartitionerConfig(
300+
partition_by_api=True,
301+
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
302+
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
303+
strategy="hi_res",
304+
additional_partition_args={
305+
"split_pdf_page": True,
306+
"split_pdf_allow_failed": True,
307+
"split_pdf_concurrency_level": 15
308+
}
309+
),
310+
chunker_config=ChunkerConfig(
311+
chunking_strategy="by_title",
312+
chunk_max_characters=1024
313+
),
314+
uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
315+
).run()
316+
```
317+
</Accordion>
318+
</AccordionGroup>
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
---
2+
title: Extract images and tables from documents
3+
---
4+
5+
## Task
6+
7+
You want to get, decode, and show elements, such as images and tables, that are embedded in a PDF document.
8+
9+
## Approach
10+
11+
Extract the Base64-encoded representation of specific elements, such as images and tables, in the document.
12+
For each of these extracted elements, decode the Base64-encoded representation of the element into its original visual representation
13+
and then show it.
14+
15+
## To run this example
16+
17+
You will need a document that is one of the document types supported by the `extract_image_block_types` argument.
18+
See the `extract_image_block_types` entry in [API Parameters](/platform-api/partition-api/api-parameters).
19+
This example uses a PDF file with embedded images and tables.
20+
21+
import SharedAPIKeyURL from '/snippets/general-shared-text/api-key-url.mdx';
22+
import ExtractImageBlockTypesIngestPy from '/snippets/how-to-api/extract_image_block_types_ingest.py.mdx';
23+
24+
## Code
25+
26+
For the [Unstructured Ingest Python library](/ingestion/python-ingest), you can use the standard Python
27+
[json.load](https://docs.python.org/3/library/json.html#json.load) function to load into a Python dictionary the contents of a JSON
28+
file that the Ingest Python library outputs after the processing is complete.
29+
<ExtractImageBlockTypesIngestPy />
File renamed without changes.

0 commit comments

Comments
 (0)