Feat: Ability to accept gzip compressed files #86 (#106)

kravetsmic · tabossert · awalker4 · web-flow · commit b019c6f4a61b · 2023-08-02T21:11:25.000Z
* feat(pipeline_api): Add returning text/csv Added additional smoketests * chore(smoketest): don't check json for pdf and jpg files with `text/csv` response type * refactor(notebook): update notebook and smoketests * feat(gzip): Add ability to accept gzipped files Added tests with gzipped files Bump version to 0.0.20 * Chore: Improve image build time (#99) * Fixes issue where detectron2 would not install on OSX Tested on Apple silicon based MacBook Pro. This installs tensorboard which is required on OSX and arm based cpu’s for detectron2. * Modify readme with updated desc * Add version and comment * First attempt at faster build times Build from custom base image * add ecr login * remove ecr-login from build We should push at another point and provide credentials then. GH secrets already added for that * add secrets to allow docker login to ecr * bad autocomplete syntax :) * build from gold base image * remove ecr stuff * Modify install steps in Readme Also changed name of detectron2 makefile command to be more generic * add apt-get update * style: regenerated api * update tests and version * update api version * update tests, notebook, and api * update smoketests and regenerated gzip files * chore: update smoketests, readme, removed gz files * style: code refactor * don't push * chore: fix notebook, update docker test sccript * Delete =2.12.2 * Delete output.json --------- Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com> Co-authored-by: Austin Walker <awalk89@gmail.com> Co-authored-by: Austin Walker <austin@unstructured.io>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,8 @@
-## 0.0.34-dev0
+## 0.0.34-dev1
 
 * Add table support for image with parameter `skip_infer_table_types`
+* Add support for gzipped files
+  
 ## 0.0.33
 
 * Image tweak, move application entrypoint to scripts/app-start.sh
diff --git a/README.md b/README.md
@@ -28,6 +28,7 @@ This repo implements a pre-processing pipeline for the following documents. Curr
 | Plaintext | `.txt`, `.eml`, `.msg`, `.xml`, `.html`, `.md`, `.rst`, `.json`, `.rtf` |
 | Images    | `.jpeg`, `.png`               |
 | Documents | `.doc`, `.docx`, `.ppt`, `.pptx`, `.pdf`, `.odt`, `.epub`, `.csv`, `.tsv`, `.xlsx` |
+| Zipped    | `.gz`                         |
 
 
 ## :rocket: Unstructured API
diff --git a/sample-docs/layout-parser-paper.pdf.gz b/sample-docs/layout-parser-paper.pdf.gz
diff --git a/scripts/smoketest.py b/scripts/smoketest.py
@@ -20,7 +20,10 @@ def send_document(
     pdf_infer_table_structure="false",
 ):
     # Note: `content_type` is not passed into request since fast API will overwrite it.
-    files = {"files": (str(filename), open(filename, "rb"))}
+    if str(filename).endswith(".gz"):
+        files = {"files": (str(filename), open(filename, "rb"), "application/gzip")}
+    else:
+        files = {"files": (str(filename), open(filename, "rb"))}
     return requests.post(
         API_URL,
         files=files,
@@ -78,6 +81,7 @@ def send_document(
             "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
         ),
         ("fake-xml.xml", "text/xml"),
+        ("layout-parser-paper.pdf.gz", "application/gzip"),
     ],
 )
 def test_happy_path(example_filename, content_type):
@@ -88,8 +92,6 @@ def test_happy_path(example_filename, content_type):
     test_file = Path("sample-docs") / example_filename
     print(f"sending {content_type}")
     json_response = send_document(test_file, content_type)
-
-    print(json_response.content)
     assert json_response.status_code == 200
     assert len(json_response.json()) > 0
     assert len("".join(elem["text"] for elem in json_response.json())) > 20