You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The purpose of this PR is to support parsing all list type parameters,
including `extract_image_block_types` when calling unstructured API via
unstructured client SDK (Python/JS) generated by `speakeasy`.
Currently, the `speakeasy` doesn't generate proper client code to pass
list type parameters to unstructured API because they do not expect to
support specific client code for `FastAPI` that the unstructured API
relies on. To address this issue, I updated the unstructured API code to
parse all list type parameters passed as JSON-formatted lists (e.g.
`'["image", "table"]'`).
**NOTE:** You must pass the list type parameter as a JSON-formatted list
when calling unstructured API via unstructured client SDK.
(e.g. `extract_image_block_types = '["image", "table"]'`,
`skip_infer_table_types='["docx", "xlsx"]'`...)
### Summary
- update `SmartValueParser.value_or_first_element()` to parse JSON
format string (e.g. `'["image", "table"]'`) that is convertible to a
list
- apply `SmartValueParser.value_or_first_element()` to all list type
parameters
- remove existing `extract_image_block_types` parsing logic
### Testing
- via unstructured_client_sdk (Python)
```
s = UnstructuredClient(
server_url="http://localhost:8000/general/v0/general",
api_key_auth="YOUR-API-KEY"
)
filename = "sample-docs/embedded-images-tables.pdf"
with open(filename, "rb") as f:
# Note that this currently only supports a single file
files = shared.Files(
content=f.read(),
file_name=filename,
)
req = shared.PartitionParameters(
files=files,
# Other partition params
strategy="hi_res",
extract_image_block_types='["image", "table"]',
languages=["pdf"],
)
try:
resp = s.general.partition(req)
print([el.get("metadata").get("image_mime_type") for el in resp.elements if el.get("metadata").get("image_mime_type")])
except SDKError as e:
print(e)
```
- via unstructured_client_sdk (JS)
```
import { UnstructuredClient } from "unstructured-client";
import * as fs from "fs";
const key = "YOUR-API-KEY";
const client = new UnstructuredClient({
serverURL: "http://localhost:8000",
security: {
apiKeyAuth: key,
},
});
const filename = "sample-docs/embedded-images-tables.pdf";
const data = fs.readFileSync(filename);
client.general.partition({
// Note that this currently only supports a single file
files: {
content: data,
fileName: filename,
},
// Other partition params
strategy: "hi_res",
extractImageBlockTypes: '["image", "table"]',
}).then((res) => {
if (res.statusCode == 200) {
console.log(res.elements);
}
}).catch((e) => {
console.log(e.statusCode);
console.log(e.body);
});
```
- via default `requests` client (Python)
```
url = "http://localhost:8000/general/v0/general"
headers = {
'accept': 'application/json',
'unstructured-api-key': "YOUR-API-KEY"
}
data = {
"strategy": "hi_res",
"extract_image_block_types": ["Image", "Table"],
}
filename = "sample-docs/embedded-images-tables.pdf"
file_data = {'files': open(filename, 'rb')}
response = requests.post(url, headers=headers, data=data, files=file_data)
file_data['files'].close()
elements = response.json()
print([el.get("metadata").get("image_mime_type") for el in elements if el.get("metadata").get("image_mime_type")])
```
- via `curl` command
```
curl -X 'POST' \
'http://localhost:8000/general/v0/general' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'files=@sample-docs/embedded-images-tables.pdf' \
-F 'strategy=hi_res' \
-F 'extract_image_block_types=["image", "table"]' \
| jq -C . | less -R
```
0 commit comments