Skip to content

Commit c9909f2

Browse files
authored
feat: add support for API key validation for self-hosted (#322)
This PR gives people the ability to use the API key header to validate requests when self-hosting Unstructured. If the optional environment variable of `UNSTRUCTURED_API_KEY` does not match the request header of `unstructured-api-key`, then we fail with a `401` response instead of fulfilling the request. This will allow people to self-host unstructured with confidence that only internal applications that have access to the shared key can use the service. Closes #321
1 parent c956a6c commit c9909f2

File tree

6 files changed

+44
-3
lines changed

6 files changed

+44
-3
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
## 0.0.60
2+
3+
* Enable self-hosted authorization using UNSTRUCTURED_API_KEY env variable
4+
15
## 0.0.59
26

37
* Bump unstructured to 0.11.0

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -323,6 +323,9 @@ As mentioned above, processing a pdf using `hi_res` is currently a slow operatio
323323
* `UNSTRUCTURED_PARALLEL_MODE_SPLIT_SIZE` - the number of pages to be processed in one request, default is `1`.
324324
* `UNSTRUCTURED_PARALLEL_RETRY_ATTEMPTS` - the number of retry attempts on a retryable error, default is `2`. (i.e. 3 attempts are made in total)
325325

326+
#### Security
327+
You may also set the optional `UNSTRUCTURED_API_KEY` env variable to enable request validation for your self-hosted instance of Unstructured. If set, only requests including an `unstructured-api-key` header with the same value will be fulfilled. Otherwise, the server will return a 401 indicating that the request is unauthorized.
328+
326329
#### Controlling Server Load
327330
Some documents will use a lot of memory as they're being processed. To mitigate OOM errors, the server will return a 503 if the host's available memory drops below 2GB. This is configurable with `UNSTRUCTURED_MEMORY_FREE_MINIMUM_MB`.
328331

prepline_general/api/app.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
app = FastAPI(
1212
title="Unstructured Pipeline API",
1313
description="""""",
14-
version="0.0.59",
14+
version="0.0.60",
1515
docs_url="/general/docs",
1616
openapi_url="/general/openapi.json",
1717
)

prepline_general/api/general.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -675,7 +675,7 @@ def return_content_type(filename):
675675

676676

677677
@router.post("/general/v0/general")
678-
@router.post("/general/v0.0.59/general")
678+
@router.post("/general/v0.0.60/general")
679679
def pipeline_1(
680680
request: Request,
681681
gz_uncompressed_content_type: Optional[str] = Form(default=None),
@@ -697,6 +697,13 @@ def pipeline_1(
697697
new_after_n_chars: List[str] = Form(default=[]),
698698
max_characters: List[str] = Form(default=[]),
699699
):
700+
if api_key_env := os.environ.get("UNSTRUCTURED_API_KEY"):
701+
api_key = request.headers.get("unstructured-api-key")
702+
if api_key != api_key_env:
703+
raise HTTPException(
704+
detail=f"API key {api_key} is invalid", status_code=status.HTTP_401_UNAUTHORIZED
705+
)
706+
700707
if files:
701708
for file_index in range(len(files)):
702709
if files[file_index].content_type == "application/gzip":

preprocessing-pipeline-family.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
name: general
2-
version: 0.0.59
2+
version: 0.0.60

test_general/api/test_app.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -439,6 +439,33 @@ def test_general_api_returns_503(monkeypatch):
439439
assert response.status_code == 503
440440

441441

442+
def test_general_api_returns_401(monkeypatch):
443+
"""
444+
When UNSTRUCTURED_API_KEY is set, return a 401 if the unstructured-api-key header does not match
445+
"""
446+
monkeypatch.setenv("UNSTRUCTURED_API_KEY", "foobar")
447+
448+
client = TestClient(app)
449+
test_file = Path("sample-docs") / "fake-xml.xml"
450+
response = client.post(
451+
MAIN_API_ROUTE,
452+
files=[("files", (str(test_file), open(test_file, "rb")))],
453+
headers={"unstructured-api-key": "foobar"},
454+
)
455+
456+
assert response.status_code == 200
457+
458+
client = TestClient(app)
459+
test_file = Path("sample-docs") / "fake-xml.xml"
460+
response = client.post(
461+
MAIN_API_ROUTE,
462+
files=[("files", (str(test_file), open(test_file, "rb")))],
463+
headers={"unstructured-api-key": "helloworld"},
464+
)
465+
466+
assert response.status_code == 401
467+
468+
442469
class MockResponse:
443470
def __init__(self, status_code):
444471
self.status_code = status_code

0 commit comments

Comments
 (0)