Skip to content

Commit 238f985

Browse files
authored
feat: add --images support to unstructured-get-json.sh (#3888)
E.g., now can run: ```bash # extracts base64 encoded image data for `Table` and `Image` elements $ unstructured-get-json.sh --trace --verbose --images /t/docs/Captur-1317-5_ENG-p5.pdf # also extracts `Title` elements (see screenshot) $ IMAGE_BLOCK_TYPES='"title","table","image"' unstructured-get-json.sh --trace --verbose --images /t/docs/Captur-1317-5_ENG-p5.pdf ``` It was discovered during testing that "narrativetext" does not work, probably due to camel casing of NarrativeText 😬 ![image](https://github.com/user-attachments/assets/e6414a57-81e1-4560-b1b2-dce3b1c2c804)
1 parent b5b1307 commit 238f985

File tree

1 file changed

+10
-1
lines changed

1 file changed

+10
-1
lines changed

scripts/user/unstructured-get-json.sh

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ Options:
1717
--fast fast strategy: No OCR, just extract embedded text
1818
--ocr-only ocr_only strategy: Perform OCR (Optical Character Recognition) only. No layout segmentation.
1919
--tables Enable table extraction: tables are represented as html in metadata
20+
--images Include base64images in json
2021
--coordinates Include coordinates in the output
2122
--trace Enable trace logging for debugging, useful to cut and paste the executed curl call
2223
--verbose Enable verbose logging including printing first 8 elements to stdout
@@ -39,6 +40,7 @@ if [ "$#" -eq 0 ]; then
3940
exit 1
4041
fi
4142

43+
IMAGE_BLOCK_TYPES=${IMAGE_BLOCK_TYPES:-'"image", "table"'}
4244
API_KEY=${UNST_API_KEY:-""}
4345
TMP_DOWNLOADS_DIR="$HOME/tmp/unst-downloads"
4446
TMP_OUTPUTS_DIR="$HOME/tmp/unst-outputs"
@@ -68,6 +70,7 @@ TRACE=false
6870
COORDINATES=false
6971
FREEMIUM=false
7072
TABLES=true
73+
IMAGES=false
7174
S3=""
7275

7376
while [[ "$#" -gt 0 ]]; do
@@ -100,6 +103,10 @@ while [[ "$#" -gt 0 ]]; do
100103
TABLES=true
101104
shift
102105
;;
106+
--images)
107+
IMAGES=true
108+
shift
109+
;;
103110
--coordinates)
104111
COORDINATES=true
105112
shift
@@ -180,12 +187,14 @@ CURL_COORDINATES=()
180187
[[ "$COORDINATES" == "true" ]] && CURL_COORDINATES=(-F "coordinates=true")
181188
CURL_TABLES=()
182189
[[ "$TABLES" == "true" ]] && CURL_TABLES=(-F "skip_infer_table_types='[]'")
190+
CURL_IMAGES=()
191+
[[ "$IMAGES" == "true" ]] && CURL_IMAGES=(-F "extract_image_block_types=[$IMAGE_BLOCK_TYPES]")
183192

184193
curl -q -X 'POST' \
185194
"$API_ENDPOINT" \
186195
"${CURL_API_KEY[@]}" -H 'accept: application/json' \
187196
-H 'Content-Type: multipart/form-data' \
188-
"${CURL_STRATEGY[@]}" "${CURL_COORDINATES[@]}" "${CURL_TABLES[@]}" -F "files=@${INPUT_FILEPATH}" \
197+
"${CURL_STRATEGY[@]}" "${CURL_COORDINATES[@]}" "${CURL_TABLES[@]}" "${CURL_IMAGES[@]}" -F "files=@${INPUT_FILEPATH}" \
189198
-o "${JSON_OUTPUT_FILEPATH}"
190199

191200
JSON_FILE_SIZE=$(wc -c <"${JSON_OUTPUT_FILEPATH}")

0 commit comments

Comments
 (0)