-
Notifications
You must be signed in to change notification settings - Fork 0
Add support for kreuzberg
#38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 19 commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
5a1ed8f
Add support for kreuzberg
p-j-smith 0d3de16
Add router for kreuzberg
p-j-smith 85c948d
Add aiohttp as a requirement for the ocr forwarding api
p-j-smith 0ec967f
Add pyonb-kreuzberg workspace member
p-j-smith 79b6713
Add kreuzberg docker service
p-j-smith f5becfc
Define KREUZBERG_API_PORT for tests
p-j-smith 3c0dbc3
Add tests for kreuzberg healthchecks and single file inference
p-j-smith 171b5f6
Update lock file
p-j-smith 956869b
Merge branch 'main' into paul/add-kreuzberg
p-j-smith b2f9f2c
Remove unused function from __init__ file
p-j-smith dbb84bc
Make linters happy
p-j-smith 3931a22
use relative import of routers module in api app
p-j-smith 288e3fc
Add test for kreuzberg healthcheck
p-j-smith 45445d8
Remove easyocr and paddleocr dependencies for kreuzberg-pyonb
p-j-smith ac2574f
Use Kreuzberg 3.11.1 for kreuzberg-pyonb
p-j-smith 350a70f
Add DATA_FOLDER variable to .env.tests and update .env.sample
p-j-smith e1d37e7
Use kreuzberg 3.13.3
p-j-smith 6e14208
Increase request_max_body_size to 100 MB for Litestar
p-j-smith 5ed1f26
Merge branch 'main' into paul/add-kreuzberg
p-j-smith 962aec5
Cast KREUZBERG_API_PORT to int before passing to uvicorn
p-j-smith fa1ece4
Fix typos docstrings
p-j-smith File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| FROM ghcr.io/astral-sh/uv:python3.13-bookworm AS app | ||
|
|
||
| SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"] | ||
|
|
||
| WORKDIR /app | ||
| ENV PYTHONDONTWRITEBYTECODE=1 | ||
| ENV PYTHONUNBUFFERED=1 | ||
|
|
||
| RUN apt-get update && apt-get install -y --no-install-recommends \ | ||
| libgl1 \ | ||
| libglib2.0-0 \ | ||
| libgomp1 \ | ||
| libsm6 \ | ||
| libxext6 \ | ||
| libxrender-dev \ | ||
| pandoc \ | ||
| tesseract-ocr \ | ||
| tesseract-ocr-eng \ | ||
| && apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* | ||
|
|
||
| COPY ./pyproject.toml ./README.md . | ||
| COPY ./src src/ | ||
|
|
||
| RUN uv venv | ||
| RUN --mount=type=cache,target=/root/.cache/uv,sharing=locked uv sync --no-editable --no-dev | ||
|
|
||
| # make uvicorn etc available | ||
| ENV PATH="/app/.venv/bin:$PATH" | ||
|
|
||
| CMD uvicorn pyonb_kreuzberg.api:app --host 0.0.0.0 --port "$KREUZBERG_API_PORT" --workers 4 --reload --use-colors |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,52 @@ | ||
| # Instructions | ||
|
|
||
| Before using the `kreuzberg` API for OCR, you will need to set the `KREUZBERG_API_PORT` | ||
| environment variable in the top-level `.env` file. | ||
|
|
||
| ## Python | ||
|
|
||
| First install the `kreuzberg` API. From the top-level `pyonb` directory: | ||
|
|
||
| ```shell | ||
| uv sync -extra kreuzberg | ||
| ``` | ||
|
|
||
| Then start the `kreuzberg` API: | ||
|
|
||
| ```shell | ||
| python src/pyonb_kreuzberg/api.py | ||
| ``` | ||
|
|
||
| You can then use `curl` to send a PDF to the API: | ||
|
|
||
| ```shell | ||
| curl -v -X POST http://127.0.0.1:8111/extract \ | ||
| -F "file_upload=@document.pdf" \ | ||
| -H "accept: application/json" | ||
| ``` | ||
|
|
||
| Note, this assumes you have set `KREUZBERG_API_PORT=8111`. | ||
|
|
||
| Currently, this returns the response from the | ||
| [`kreuzberg` API](https://kreuzberg.dev/user-guide/api-server/#extract-files) | ||
| directly, rather than the standard `pyonb` response. | ||
|
|
||
| ## Docker Compose | ||
|
|
||
| You will need to define the `OCR_FORWARDING_API_PORT` in the `.env` file. | ||
|
|
||
| Then, spin up the `ocr-forwarding-api` and `kreuzberg` services: | ||
|
|
||
| ```shell | ||
| docker-compose --profile kreuzberg up --build --detach | ||
| ``` | ||
|
|
||
| You can then use `curl` to send a PDF to the forwarding API: | ||
|
|
||
| ```shell | ||
| curl -v -X POST http://127.0.0.1:8110/kreuzberg-ocr/inference_single \ | ||
| -F "file_upload=@document.pdf" \ | ||
| -H "accept: application/json" | ||
| ``` | ||
|
|
||
| Note, this assumes you have set `OCR_FORWARD_API_PORT` to `8110`. | ||
p-j-smith marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| [build-system] | ||
| build-backend = "hatchling.build" | ||
| requires = ["hatchling"] | ||
|
|
||
| [project] | ||
| dependencies = [ | ||
| "kreuzberg[api]==3.13.3", | ||
| "uvicorn", | ||
| ] | ||
| description = "pyonb wrapper around kreuzberg" | ||
| name = "pyonb-kreuzberg" | ||
| readme = "README.md" | ||
| requires-python = ">=3.11" | ||
| version = "0.1.0" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| """Package for converting PDFs to structured text using Kreuzberg OCR.""" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| """API for Kreuzberg OCR.""" | ||
|
|
||
| import os | ||
|
|
||
| import uvicorn | ||
| from kreuzberg._api.main import ( | ||
| KreuzbergError, | ||
| Litestar, | ||
| OpenTelemetryConfig, | ||
| OpenTelemetryPlugin, | ||
| StructLoggingConfig, | ||
| exception_handler, | ||
| general_exception_handler, | ||
| get_configuration, | ||
| handle_files_upload, | ||
| health_check, | ||
| ) | ||
|
|
||
| KREUZBERG_API_PORT = os.getenv("KREUZBERG_API_PORT") | ||
|
|
||
| app = Litestar( | ||
| route_handlers=[handle_files_upload, health_check, get_configuration], | ||
| request_max_body_size=100_000_000, | ||
| plugins=[OpenTelemetryPlugin(OpenTelemetryConfig())], | ||
| logging_config=StructLoggingConfig(), | ||
| exception_handlers={ | ||
| KreuzbergError: exception_handler, | ||
| Exception: general_exception_handler, | ||
| }, | ||
| ) | ||
|
|
||
| if __name__ == "__main__": | ||
| uvicorn.run( | ||
| app, | ||
| host="127.0.0.1", | ||
| port=KREUZBERG_API_PORT, | ||
| workers=4, | ||
| reload=True, | ||
| use_colors=True, | ||
| ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| """Routers for Kreuzberg OCR.""" | ||
|
|
||
| import logging | ||
| import os | ||
| import time | ||
| from typing import Annotated, Any | ||
|
|
||
| import aiohttp | ||
| from fastapi import APIRouter, File, UploadFile, status | ||
| from fastapi.responses import JSONResponse | ||
|
|
||
| # Creating an object | ||
| logger = logging.getLogger() | ||
|
|
||
| router = APIRouter() | ||
|
|
||
| KREUZBERG_API_PORT = os.getenv("KREUZBERG_API_PORT") | ||
|
|
||
|
|
||
| @router.get("/kreuzberg/health") | ||
| async def healthcheck() -> dict[str, Any]: | ||
| """Test aliveness endpoint for Kreuzberg.""" | ||
| logger.info("[GET] /kreuzberg/health") | ||
| url = f"http://kreuzberg:{KREUZBERG_API_PORT}/health" | ||
|
|
||
| try: | ||
| async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=60 * 60)) as session: # noqa: SIM117 | ||
| async with session.get(url) as response: | ||
| response.raise_for_status() | ||
| except aiohttp.ClientError: | ||
| logger.exception("Failed to connect to kreuzberg service") | ||
| raise | ||
|
|
||
| return JSONResponse( | ||
| status_code=status.HTTP_200_OK, | ||
| content={"service": "kreuzberg", "status": "healthy"}, | ||
| ) | ||
|
|
||
|
|
||
| @router.post("/kreuzberg-ocr/inference_single", status_code=status.HTTP_200_OK) | ||
| async def inference_single_doc(file_upload: Annotated[UploadFile, File()] = None) -> JSONResponse: | ||
| """ | ||
| Runs Kreuzberg OCR inference on a single document. | ||
|
|
||
| UploadFile object forwarded onto inference API. | ||
| """ | ||
| logger.info("[POST] /kreuzberg-ocr/extract") | ||
| url = f"http://kreuzberg:{KREUZBERG_API_PORT}/extract" # fwd request to kreuzberg service | ||
|
|
||
| data = aiohttp.FormData() | ||
| data.add_field( | ||
| "data", # field name expected by Kreuzberg's /extract API | ||
| file_upload.file, | ||
| filename=file_upload.filename, | ||
| content_type=file_upload.content_type, | ||
| ) | ||
| headers = {"accept": "application/json"} | ||
|
|
||
| logger.info("post request - url: %s", url) | ||
| logger.info("post request - data: %s", data) | ||
| logger.info("post request - headers: %s", headers) | ||
|
|
||
| t1 = time.perf_counter() | ||
| try: | ||
| async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=60 * 60)) as session: # noqa: SIM117 | ||
| async with session.post(url, data=data, headers=headers) as response: | ||
| response.raise_for_status() | ||
| ocr_results = await response.json() | ||
| except aiohttp.ClientError: | ||
| logger.exception("Request Exception") | ||
| raise | ||
| t2 = time.perf_counter() | ||
|
|
||
| # Kreuzberg's /extract API expects a list of documents and always returns a list of extracted text | ||
| # We only ever extract and return content for a single document | ||
| ocr_result = ocr_results[0]["content"] | ||
|
|
||
| response_json = { | ||
| "filename": str(file_upload.filename), | ||
| "duration_in_second": t2 - t1, | ||
| "ocr-result": ocr_result, | ||
| } | ||
|
|
||
| return JSONResponse(status_code=status.HTTP_200_OK, content=response_json) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,4 @@ | ||
| fastapi[standard] | ||
| uvicorn | ||
| requests | ||
| aiohttp |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.