Skip to content

Commit b019c6f

Browse files
kravetsmictabossertawalker4
authored
Feat: Ability to accept gzip compressed files #86 (#106)
* feat(pipeline_api): Add returning text/csv Added additional smoketests * chore(smoketest): don't check json for pdf and jpg files with `text/csv` response type * refactor(notebook): update notebook and smoketests * feat(gzip): Add ability to accept gzipped files Added tests with gzipped files Bump version to 0.0.20 * Chore: Improve image build time (#99) * Fixes issue where detectron2 would not install on OSX Tested on Apple silicon based MacBook Pro. This installs tensorboard which is required on OSX and arm based cpu’s for detectron2. * Modify readme with updated desc * Add version and comment * First attempt at faster build times Build from custom base image * add ecr login * remove ecr-login from build We should push at another point and provide credentials then. GH secrets already added for that * add secrets to allow docker login to ecr * bad autocomplete syntax :) * build from gold base image * remove ecr stuff * Modify install steps in Readme Also changed name of detectron2 makefile command to be more generic * add apt-get update * style: regenerated api * update tests and version * update api version * update tests, notebook, and api * update smoketests and regenerated gzip files * chore: update smoketests, readme, removed gz files * style: code refactor * don't push * chore: fix notebook, update docker test sccript * Delete =2.12.2 * Delete output.json --------- Co-authored-by: Trevor Bossert <[email protected]> Co-authored-by: Austin Walker <[email protected]> Co-authored-by: Austin Walker <[email protected]>
1 parent 76c068b commit b019c6f

File tree

4 files changed

+9
-4
lines changed

4 files changed

+9
-4
lines changed

CHANGELOG.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
1-
## 0.0.34-dev0
1+
## 0.0.34-dev1
22

33
* Add table support for image with parameter `skip_infer_table_types`
4+
* Add support for gzipped files
5+
46
## 0.0.33
57

68
* Image tweak, move application entrypoint to scripts/app-start.sh

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ This repo implements a pre-processing pipeline for the following documents. Curr
2828
| Plaintext | `.txt`, `.eml`, `.msg`, `.xml`, `.html`, `.md`, `.rst`, `.json`, `.rtf` |
2929
| Images | `.jpeg`, `.png` |
3030
| Documents | `.doc`, `.docx`, `.ppt`, `.pptx`, `.pdf`, `.odt`, `.epub`, `.csv`, `.tsv`, `.xlsx` |
31+
| Zipped | `.gz` |
3132

3233

3334
## :rocket: Unstructured API
4.17 MB
Binary file not shown.

scripts/smoketest.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,10 @@ def send_document(
2020
pdf_infer_table_structure="false",
2121
):
2222
# Note: `content_type` is not passed into request since fast API will overwrite it.
23-
files = {"files": (str(filename), open(filename, "rb"))}
23+
if str(filename).endswith(".gz"):
24+
files = {"files": (str(filename), open(filename, "rb"), "application/gzip")}
25+
else:
26+
files = {"files": (str(filename), open(filename, "rb"))}
2427
return requests.post(
2528
API_URL,
2629
files=files,
@@ -78,6 +81,7 @@ def send_document(
7881
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
7982
),
8083
("fake-xml.xml", "text/xml"),
84+
("layout-parser-paper.pdf.gz", "application/gzip"),
8185
],
8286
)
8387
def test_happy_path(example_filename, content_type):
@@ -88,8 +92,6 @@ def test_happy_path(example_filename, content_type):
8892
test_file = Path("sample-docs") / example_filename
8993
print(f"sending {content_type}")
9094
json_response = send_document(test_file, content_type)
91-
92-
print(json_response.content)
9395
assert json_response.status_code == 200
9496
assert len(json_response.json()) > 0
9597
assert len("".join(elem["text"] for elem in json_response.json())) > 20

0 commit comments

Comments
 (0)