Skip to content

Commit 2a36054

Browse files
authored
Merge pull request #42 from SAFEHR-data/paul/restructure-docling
Restructure `docling`
2 parents 0c3a94b + 68e0daf commit 2a36054

File tree

14 files changed

+145
-129
lines changed

14 files changed

+145
-129
lines changed

README.md

Lines changed: 23 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -13,14 +13,13 @@
1313
1414
END COMMENT OUT-->
1515

16-
**pyonb** is two things:
16+
`pyonb` is a Python library and suite of APIs that wrap open-source Optical Character Recognition (OCR) tools. It it designed for local deployment and can convert PDFs to structured text using the several
17+
OCR tools:
1718

18-
- a Python SDK for document extraction via the Hyland OnBase REST API (_work in progress_)
19-
- a suite of APIs wrapped around open-source Optical Character Recognition (OCR) tools, designed for local deployment, for converting PDFs to structured text including:
20-
- [Marker](https://github.com/VikParuchuri/marker)
21-
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
22-
- [Docling](https://github.com/docling-project/docling)
23-
- [Kreuzberg](https://github.com/Goldziher/kreuzberg)
19+
- [Marker](https://github.com/VikParuchuri/marker)
20+
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
21+
- [Docling](https://github.com/docling-project/docling)
22+
- [Kreuzberg](https://github.com/Goldziher/kreuzberg)
2423

2524
## Getting Started
2625

@@ -30,24 +29,35 @@ END COMMENT OUT-->
3029

3130
### Installation & Usage
3231

33-
1. Rename `.env.sample` to `.env`.
32+
1. Clone `pyonb`
3433

35-
2. Edit `.env` with the correct `HOST_DATA_FOLDER` location, e.g.:
34+
```sh
35+
git clone git@github.com:SAFEHR-data/pyonb.git
36+
cd pyonb
37+
```
38+
39+
2. Rename `.env.sample` to `.env`.
3640

3741
```sh
38-
HOST_DATA_FOLDER="/absolute/path/to/documents/folder"
42+
mv .env.sample .env
43+
```
3944

40-
# e.g. for unit tests on GAE:
41-
# HOST_DATA_FOLDER="/gae/pyonb/tests/data/single_synthetic_doc"
45+
3. Edit `.env` with the correct `DATA_FOLDER` location, e.g.:
46+
47+
```sh
48+
DATA_FOLDER="path/to/documents/folder"
4249
```
4350

51+
where the path is relative to the `docker-compose.yml` file in the top-level `pyonb` directory.
52+
4453
4. Set OCR service ports, e.g.:
4554

4655
```sh
4756
OCR_FORWARDING_API_PORT=8110
4857
MARKER_API_PORT=8112
4958
PADDLEOCR_API_PORT=8114
5059
DOCLING_API_PORT=8115
60+
KREUZBERG_API_PORT=8116
5161
```
5262

5363
> [!IMPORTANT]
@@ -60,7 +70,7 @@ DOCLING_API_PORT=8115
6070
> HTTP_PROXY=
6171
> ```
6272
63-
5. Start the OCR API Server (e.g. using marker and docling):
73+
5. Start the OCR API Server (e.g. using `marker` and `docling`):
6474
6575
```sh
6676
docker compose --profile marker --profile docling up -d

docker-compose.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -118,21 +118,21 @@ services:
118118
docling:
119119
profiles: [docling]
120120
build:
121-
context: src/ocr/docling
121+
context: packages/ocr/docling
122122
dockerfile: Dockerfile
123123
args:
124124
<<: *build-args-common
125125
DOCLING_API_PORT: ${DOCLING_API_PORT}
126126
environment:
127127
<<: [*proxy-common, *common-env]
128-
CONTAINER_DATA_FOLDER: /data
128+
DATA_FOLDER: /data
129129
DOCLING_API_PORT: ${DOCLING_API_PORT}
130130
env_file:
131131
- ./.env
132132
ports:
133133
- "${DOCLING_API_PORT}:${DOCLING_API_PORT}"
134134
volumes:
135-
- ${HOST_DATA_FOLDER}:${CONTAINER_DATA_FOLDER:-/data}
135+
- ${PWD}/${DATA_FOLDER}:/data
136136
networks:
137137
- pyonb_ocr_api
138138
healthcheck:
@@ -192,15 +192,15 @@ services:
192192
OCR_FORWARDING_API_PORT: ${OCR_FORWARDING_API_PORT}
193193
environment:
194194
<<: [*proxy-common, *common-env]
195-
CONTAINER_DATA_FOLDER: /data
195+
DATA_FOLDER: /data
196196
OCR_FORWARDING_API_PORT: ${OCR_FORWARDING_API_PORT}
197197
env_file:
198198
- ./.env
199199
ports:
200200
- "${OCR_FORWARDING_API_PORT}:${OCR_FORWARDING_API_PORT}"
201201
volumes:
202202
- ./src/api/app:/app
203-
- ${HOST_DATA_FOLDER}:${CONTAINER_DATA_FOLDER:-/data}
203+
- ${PWD}/${DATA_FOLDER}:/data
204204
networks:
205205
- pyonb_ocr_api
206206
healthcheck:

packages/ocr/docling/Dockerfile

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
FROM ghcr.io/astral-sh/uv:python3.13-bookworm AS app
2+
3+
SHELL ["/bin/bash", "-o", "pipefail", "-e", "-u", "-x", "-c"]
4+
5+
WORKDIR /app
6+
ENV PYTHONDONTWRITEBYTECODE=1
7+
ENV PYTHONUNBUFFERED=1
8+
9+
COPY ./pyproject.toml .
10+
COPY ./README.md .
11+
COPY ./src src/
12+
13+
RUN uv venv
14+
RUN --mount=type=cache,target=/root/.cache/uv,sharing=locked uv sync --no-editable --no-dev
15+
16+
# make uvicorn etc available
17+
ENV PATH="/app/.venv/bin:$PATH"
18+
19+
CMD uvicorn pyonb_docling.api:app --host 0.0.0.0 --port "$DOCLING_API_PORT" --workers 4 --use-colors

packages/ocr/docling/README.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Instructions
2+
3+
## Python
4+
5+
First install `pyonb_docling`. From the top-level `pyonb` directory:
6+
7+
```shell
8+
uv sync --extra docling
9+
```
10+
11+
Then, to convert a PDF to markdown:
12+
13+
```python
14+
import pyonb_docling
15+
16+
result = pyonb_docling.convert_pdf_to_markdown(
17+
file_path="path/to/data/input.pdf",
18+
)
19+
```
20+
21+
## Docker Compose
22+
23+
From the `pyonb/packages/ocr/docling` directory:
24+
25+
```shell
26+
docker compose run docling data/input.pdf data/output.md
27+
```
28+
29+
Note, you will need to set `DATA_FOLDER` in a `.env` file,
30+
e.g: `DATA_FOLDER=path/to/data/input.pdf`
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
[build-system]
2+
build-backend = "hatchling.build"
3+
requires = ["hatchling"]
4+
5+
[project]
6+
dependencies = [
7+
"docling",
8+
"fastapi[standard]",
9+
"python-dotenv",
10+
"uvicorn",
11+
]
12+
description = "pyonb wrapper around docling"
13+
name = "pyonb-docling"
14+
readme = "README.md"
15+
requires-python = ">=3.11"
16+
version = "0.1.0"

src/ocr/docling/__init__.py renamed to packages/ocr/docling/src/pyonb_docling/__init__.py

File renamed without changes.

src/ocr/docling/api.py renamed to packages/ocr/docling/src/pyonb_docling/api.py

Lines changed: 2 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@
88
from fastapi import FastAPI, File, HTTPException, UploadFile, status
99
from fastapi.responses import JSONResponse, RedirectResponse
1010

11+
from pyonb_docling.main import convert_pdf_to_markdown
12+
1113
logging.basicConfig(
1214
filename="docling." + datetime.datetime.now(tz=datetime.UTC).strftime("%Y%m%d") + ".log",
1315
format="%(asctime)s %(message)s",
@@ -18,18 +20,6 @@
1820
logger = logging.getLogger()
1921
logger.setLevel(logging.DEBUG)
2022

21-
# TODO(tom): improve imports - below try statements horrible
22-
try:
23-
# local
24-
from .main import convert_pdf_to_markdown
25-
except Exception:
26-
logger.exception("Detected inside Docker container.")
27-
# Docker container
28-
try:
29-
from main import convert_pdf_to_markdown # type: ignore # noqa: PGH003
30-
except Exception:
31-
logger.exception("Docling imports not possible.")
32-
3323
app = FastAPI(swagger_ui_parameters={"tryItOutEnabled": True})
3424

3525

src/ocr/docling/main.py renamed to packages/ocr/docling/src/pyonb_docling/main.py

File renamed without changes.

pyproject.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,8 @@ optional-dependencies = {dev = [
4343
"ruff",
4444
"tox",
4545
"twine",
46+
], docling = [
47+
"pyonb-docling",
4648
], docs = [
4749
"mkdocs",
4850
"mkdocs-include-markdown-plugin",
@@ -153,6 +155,7 @@ gh.python."3.12" = ["py312"]
153155
gh.python."3.13" = ["py313"]
154156

155157
[tool.uv.sources]
158+
pyonb-docling = {workspace = true}
156159
pyonb-kreuzberg = {workspace = true}
157160

158161
[tool.uv.workspace]

0 commit comments

Comments
 (0)