Skip to content

Commit ffbaa0a

Browse files
authored
Unstructured API Quickstart (#712)
1 parent 1ce5f28 commit ffbaa0a

File tree

1 file changed

+134
-0
lines changed

1 file changed

+134
-0
lines changed
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
---
2+
title: Unstructured API Quickstart
3+
---
4+
5+
<Tip>Just need to copy the sample code? [Skip ahead](#sample-code) to it now!</Tip>
6+
7+
The following code shows how to use the [Unstructured Python SDK](/api-reference/partition/sdk-python)
8+
to have Unstructured process one or more local files by using
9+
the [Unstructured Partition Endpoint](/api-reference/partition/overview).
10+
11+
To run this code, you will need the following:
12+
13+
- An Unstructured account and an Unstructured API key for your account. [Learn how](/api-reference/partition/overview#get-started).
14+
- Python 3.9 or higher installed on your local machine.
15+
- A Python virtual environment is recommended for isolating and versioning Python project code dependencies, but this is not required.
16+
To create and activate a virtual environment, you can use a framework such as
17+
[uv](https://docs.astral.sh/uv/) (recommended). Python provides a built-in framework named
18+
[venv](https://docs.python.org/3/library/venv.html).
19+
- You must install the Unstructured Python SDK on your local machine, for example by running one of the
20+
following commands:
21+
22+
- For `uv`, run `uv add unstructured-client`
23+
- For `venv` (or for no virtual environment), run `pip install unstructured-client`
24+
25+
- Add the following code to a Python file on your local machine; make the following code changes; and then run the code file to see the results.
26+
27+
- Replace `<unstructured-api-key>` with your Unstructured API key.
28+
- To process all files within a directory, change `None` for `input_dir` to a string that contains the path to the directory on your local machine. This can be a relative or absolute path.
29+
- To process specific files within a directory or across multiple directories, change `None` for `input_file` to a string that contains
30+
a comma-separated list of filepaths on your local machine, for example `"./input/2507.13305v1.pdf,./input2/table-multi-row-column-cells.pdf"`. These filepaths
31+
can be relative or absolute.
32+
33+
<Note>
34+
If `input_dir` and `input_file` are both set to something other than `None`, then the `input_dir` setting takes precedence, and the `input_file` setting is ignored.
35+
</Note>
36+
37+
- For the `output_dir` parameter, specify a string that contains the path to the directory on your local machine that you want Unstructured to send its JSON output files. If the specified directory does not exist at that location, the code will create the missing directory for you. This path can be relative or absolute.
38+
39+
## Sample code
40+
41+
```python Python SDK
42+
import asyncio
43+
import os
44+
import json
45+
import unstructured_client
46+
from unstructured_client.models import shared, errors
47+
48+
client = unstructured_client.UnstructuredClient(
49+
api_key_auth="<unstructured-api-key>"
50+
)
51+
52+
async def partition_file_via_api(filename):
53+
req = {
54+
"partition_parameters": {
55+
"files": {
56+
"content": open(filename, "rb"),
57+
"file_name": os.path.basename(filename),
58+
},
59+
"strategy": shared.Strategy.AUTO,
60+
"vlm_model": "gpt-4o",
61+
"vlm_model_provider": "openai",
62+
"languages": ['eng'],
63+
"split_pdf_page": True,
64+
"split_pdf_allow_failed": True,
65+
"split_pdf_concurrency_level": 15
66+
}
67+
}
68+
69+
try:
70+
res = await client.general.partition_async(request=req)
71+
return res.elements
72+
except errors.UnstructuredClientError as e:
73+
print(f"Error partitioning {filename}: {e.message}")
74+
return []
75+
76+
async def process_file_and_save_result(input_filename, output_dir):
77+
elements = await partition_file_via_api(input_filename)
78+
79+
if elements:
80+
results_name = f"{os.path.basename(input_filename)}.json"
81+
output_filename = os.path.join(output_dir, results_name)
82+
83+
with open(output_filename, "w") as f:
84+
json.dump(elements, f)
85+
86+
def load_filenames_in_directory(input_dir):
87+
filenames = []
88+
for root, _, files in os.walk(input_dir):
89+
for file in files:
90+
if not file.endswith('.json'):
91+
filenames.append(os.path.join(root, file))
92+
93+
return filenames
94+
95+
async def process_files():
96+
# Initialize with either a directory name, to process everything in the dir,
97+
# or a comma-separated list of filepaths.
98+
input_dir = None # "path/to/input/directory"
99+
input_files = None # "path/to/file,path/to/file,path/to/file"
100+
101+
# Set to the directory for output json files. This dir
102+
# will be created if needed.
103+
output_dir = "./output/"
104+
105+
if input_dir:
106+
filenames = load_filenames_in_directory(input_dir)
107+
else:
108+
filenames = input_files.split(",")
109+
110+
os.makedirs(output_dir, exist_ok=True)
111+
112+
tasks = []
113+
for filename in filenames:
114+
tasks.append(
115+
process_file_and_save_result(filename, output_dir)
116+
)
117+
118+
await asyncio.gather(*tasks)
119+
120+
if __name__ == "__main__":
121+
asyncio.run(process_files())
122+
```
123+
124+
## Next steps
125+
126+
This quickstart shows how to use the Unstructured Partition Endpoint, which is intended for rapid prototyping of
127+
some of Unstructured's [partitioning](/api-reference/partition/partitioning) strategies, with limited support for [chunking](/api-reference/partition/chunking).
128+
It is designed to work only with processing of local files.
129+
130+
Take your code to the next level by switching over to the [Unstructured Workflow Endpoint](/api-reference/workflow/overview)
131+
for production-level scenarios, file processing in batches, files and data in remote locations, full support for [chunking](/ui/chunking),
132+
generating [embeddings](/ui/embedding), applying post-transform [enrichments](/ui/enriching/overview),
133+
using the latest and highest-performing models, and much more.
134+
[Get started](/api-reference/workflow/overview).

0 commit comments

Comments
 (0)