Skip to content

Commit 5e797d4

Browse files
committed
use ocrmypdf
1 parent dce811f commit 5e797d4

File tree

8 files changed

+94
-104
lines changed

8 files changed

+94
-104
lines changed

.github/workflows/predict.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,9 @@ jobs:
2828

2929
- name: Run inference
3030
run: |
31-
docker run --name ocr2pdf \
32-
-v ./src:/app \
33-
-v ./pdf:/app/pdf \
31+
docker run \
32+
-v ./src:/ocr2pdf \
33+
-v ./pdf:/ocr2pdf/pdf \
3434
ghcr.io/ipitio/ocr-pdf:latest \
3535
bash predict.sh pdf
3636
@@ -39,4 +39,4 @@ jobs:
3939
uses: EndBug/add-and-commit@v9
4040
with:
4141
add: "**/*.pdf"
42-
message: "enhanced pdfs"
42+
message: "processed files"

README.md

Lines changed: 25 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -4,23 +4,29 @@
44

55
# ocr2pdf
66

7-
**Convert images and scans to searchable PDFs!**
7+
**OCRmyPDF and Merge it**
88

99
---
1010

1111
[![downloads](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fipitio.github.io%2Fbackage%2Fipitio%2Focr-pdf%2Focr-pdf.json&query=%24.downloads&logo=github&logoColor=959da5&labelColor=333a41&label=pulls)](https://github.com/arevindh/pihole-speedtest/pkgs/container/pihole-speedtest) [![build](https://github.com/ipitio/ocr-pdf/actions/workflows/publish.yml/badge.svg)](https://github.com/ipitio/ocr-pdf/actions/workflows/publish.yml)
1212

1313
</div>
1414

15-
You can run this in your browser, on your computer, or somewhere in between, depending how much you want to automate and virtualize. The core logic resides in a Python script that you could run yourself, if you really wanted to. It extracts all the files from `todo`, transforms their pages with a pretrained LSTM RNN, and loads them into `done`. Files in subfolders will be merged in alphabetical order, but will still be available individually.
15+
Convert images and scans to searchable and selectable (and merged) PDFs! The core logic resides in a Python script that you could run yourself, if you really wanted to. It extracts all the files from `todo`, transforms them with Tesseract via [OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF), and loads them into `done`. Files in subfolders will be merged in alphabetical order, but will still be available individually.
1616

1717
I recommend you use either:
1818

1919
- The Bash script, which runs the Python script
2020
- The Docker image, which runs the Bash script
2121
- A Google Colab or GitHub Actions server, both of which run the Docker image
2222

23-
Read on to find out which is best for you!
23+
Read on to find out which is best for you! In any case, the Bash script is, or must be, called like so:
24+
25+
```bash
26+
bash /path/to/predict.sh /folder/containing/todo/ [OCRmyPDF options]
27+
```
28+
29+
For more information, see the [OCRmyPDF documentation](https://ocrmypdf.readthedocs.io/en/latest).
2430

2531
## Fast Start
2632

@@ -34,27 +40,21 @@ Are you on mobile or simply want an easy and seamless experience?
3440
2. Follow the instructions in the notebook
3541
3. Find the OCR'd files in your [Drive](https://drive.google.com/drive/my-drive)`/ocr-pdf`
3642

43+
To add OCRmyPDF options, append them to the `run` command in the code cell.
44+
3745
### Self-hosted: Prebuilt Docker Image
3846

3947
If you want to skip building an image, just use mine:
4048

41-
1. Install Docker and Compose, such as with Docker Desktop
42-
2. Enter a new folder, add the file below, and put your files in `./pdf/todo`
43-
3. Run the following command to OCR the files and move them to `./pdf/done`
44-
45-
```yaml
46-
# compose.yml
47-
services:
48-
predict:
49-
container_name: ocr2pdf
50-
image: ghcr.io/ipitio/ocr-pdf:latest
51-
command: bash predict.sh pdf
52-
volumes:
53-
- ./pdf:/app/pdf
54-
```
49+
1. Install Docker, such as with Docker Desktop
50+
2. Make a new `pdf` folder and put your files in `pdf/todo`
51+
3. Run the following command from `pdf/..` to convert the files and move them into `pdf/done`
5552

5653
```bash
57-
docker compose up
54+
docker run --rm \
55+
-v ./pdf:/ocr2pdf/pdf \
56+
ghcr.io/ipitio/ocr-pdf:latest \
57+
bash predict.sh pdf [OCRmyPDF options]
5858
```
5959

6060
## Quick Start
@@ -70,27 +70,31 @@ It's still easy as 1, 2, 3! You'll find the OCR'd files in `pdf/done`.
7070
If you made a fork and cloned it, Git is your best friend!
7171

7272
```bash
73-
git add pdf/*
73+
git add .
7474
git commit -m "add files"
7575
git push
7676
# wait for the magic to happen
7777
git pull
7878
```
7979

80+
To add OCRmyPDF options, edit the command the `predict.yml` file before committing.
81+
8082
### Self-hosted
8183

8284
#### Build Docker Image
8385

84-
If you aren't on Linux, or want to avoid polluting your system, use Docker Compose:
86+
If you aren't on Linux, or want to avoid polluting your system, use Docker Compose (which is included with Docker Desktop):
8587

8688
```bash
8789
docker compose up
8890
```
8991

92+
To add OCRmyPDF options, edit the command in the `compose.yml` file.
93+
9094
#### Use Bare Metal
9195

9296
Are you on Linux and want to make the most out of it?
9397

9498
```bash
95-
bash src/predict.sh pdf
99+
bash src/predict.sh pdf [OCRmyPDF options]
96100
```

colab.ipynb

Lines changed: 24 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -26,18 +26,9 @@
2626
"\n",
2727
"## Steps\n",
2828
"\n",
29-
"1. Make two new folders, one inside the other\n",
30-
" - The outer one can be named anything, say `pdf`\n",
31-
" - The inner one must be named `todo`\n",
32-
"2. Place your files in the `todo` folder\n",
33-
" - Those by themselves will just be converted\n",
34-
" - Those inside subfolders will also be merged in alphabetical order\n",
35-
"3. Share the outer `pdf` folder with this notebook\n",
36-
" - Zip the folder\n",
37-
" - Open this notebook in [Colab](https://colab.research.google.com/github/ipitio/ocr-pdf/blob/master/colab.ipynb)\n",
38-
" - Run the cell below to be prompted to connect Drive and upload the zip\n",
39-
"\n",
40-
"You'll be offered a zip of the converted (and merged) files to download locally, whether or not Drive was connected\n"
29+
"To merge files, organize them into folders and zip each one. Ensure the files are named in alphabetical order, as they will be merged in that order. If you'd like to add any options for [OCRmyPDF](https://ocrmypdf.readthedocs.io/en/latest), append them to the `run` line in the cell below. At the end, you'll be offered a zip of the converted (and merged) files to download locally, whether or not Drive was connected.\n",
30+
"\n",
31+
"1. Run the cell below to get prompted to connect Drive and upload your files and/or zipped folders\n"
4132
]
4233
},
4334
{
@@ -58,34 +49,31 @@
5849
"\n",
5950
"# Extract your PDFs\n",
6051
"files.upload()\n",
61-
"\n",
62-
"# Get the name of the zip file\n",
63-
"pdfs = [pdf for pdf in os.listdir() if pdf.endswith(\".zip\")]\n",
64-
"if len(pdfs) == 0:\n",
65-
" raise Exception(\"No ZIP file found\")\n",
52+
"![ -d pdf ] || mkdir pdf\n",
53+
"![ -d pdf/todo ] || mkdir pdf/todo\n",
54+
"![ -d pdf/done ] || mkdir pdf/done\n",
55+
"!unzip -o \"*.zip\" -d pdf/todo 2>/dev/null\n",
56+
"!rm -f *.zip\n",
57+
"!mv *.* pdf/todo 2>/dev/null\n",
6658
"\n",
6759
"# Transform them\n",
6860
"%pip install udocker\n",
6961
"!udocker --allow-root install\n",
70-
"\n",
71-
"for pdf in pdfs:\n",
72-
" !unzip -o \"$pdf\"\n",
73-
" !rm -f \"$pdf\"\n",
74-
" !udocker --allow-root run -v /content/\"$pdf\":/app/pdf ghcr.io/ipitio/ocr-pdf bash predict.sh pdf\n",
75-
" converted = os.listdir(\"$pdf/done\")\n",
76-
"\n",
77-
" # And load\n",
78-
" if drive and len(converted) > 0:\n",
79-
" ![ -d \"drive/MyDrive/ocr-pdf\" ] || mkdir \"drive/MyDrive/ocr-pdf\"\n",
80-
" !\\cp -r \"$pdf/done/\"* \"drive/MyDrive/ocr-pdf/\"\n",
81-
"\n",
82-
" if len(converted) == 1 and os.path.isfile(\"$pdf/done/\" + converted[0]):\n",
83-
" files.download(\"$pdf/done/\" + converted[0])\n",
84-
" elif len(converted) > 0:\n",
85-
" !zip -r \"$pdf.zip\" \"$pdf/done\"\n",
86-
" files.download(\"$pdf.zip\")\n",
87-
" else:\n",
88-
" print(\"No PDFs found\")"
62+
"!udocker --allow-root run -v /content/pdf:/ocr2pdf/pdf ghcr.io/ipitio/ocr-pdf bash predict.sh pdf\n",
63+
"converted = os.listdir(\"pdf/done\")\n",
64+
"\n",
65+
"# And load\n",
66+
"if drive and len(converted) > 0:\n",
67+
" ![ -d \"drive/MyDrive/ocr-pdf\" ] || mkdir \"drive/MyDrive/ocr-pdf\"\n",
68+
" !\\cp -r \"pdf/done/\"* \"drive/MyDrive/ocr-pdf/\"\n",
69+
"\n",
70+
"if len(converted) == 1 and os.path.isfile(\"$pdf/done/\" + converted[0]):\n",
71+
" files.download(\"pdf/done/\" + converted[0])\n",
72+
"elif len(converted) > 0:\n",
73+
" !zip -r \"pdf.zip\" \"pdf/done\"\n",
74+
" files.download(\"pdf.zip\")\n",
75+
"else:\n",
76+
" print(\"No PDFs found\")"
8977
]
9078
}
9179
],

compose.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ services:
22
predict:
33
container_name: ocr2pdf
44
build: ./src
5-
command: bash predict.sh pdf
5+
command: bash predict.sh pdf -l eng+fra
66
volumes:
7-
- ./src:/app
8-
- ./pdf:/app/pdf
7+
- ./src:/ocr2pdf
8+
- ./pdf:/ocr2pdf/pdf

src/Dockerfile

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
FROM python:3.11-slim
2-
WORKDIR /app
1+
FROM jbarlow83/ocrmypdf-ubuntu:v16.5.0
2+
WORKDIR /ocr2pdf
33
COPY . .
44
RUN bash predict.sh
5+
ENTRYPOINT []

src/main.py

Lines changed: 22 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -3,18 +3,17 @@
33
"""
44

55
import os
6+
import subprocess
67
import sys
78
from pathlib import Path
89

910
import pymupdf
10-
import pytesseract
1111
from joblib import Parallel, delayed
1212
from natsort import natsorted, ns
13-
from pdf2image import convert_from_path
1413
from PIL import Image
1514

1615

17-
def predict(base: Path, input_file: Path) -> None:
16+
def predict(base: Path, input_file: Path, args: list[str]) -> None:
1817
"""
1918
Predicts the text in the input file and saves it to the output file
2019
@@ -23,35 +22,28 @@ def predict(base: Path, input_file: Path) -> None:
2322
input_file (Path): The input file
2423
"""
2524
relative_path = input_file.relative_to(base / "todo")
26-
output_file = base / "done" / relative_path.with_suffix(".pdf")
2725

28-
if str(input_file).lower().endswith(".pdf"):
29-
pages = convert_from_path(input_file, fmt="jpeg")
30-
else:
31-
try:
32-
pages = [Image.open(input_file)]
33-
except Exception:
34-
return
35-
36-
print(f"Processing {relative_path}...")
37-
doc = pymupdf.open()
38-
39-
for page in pages:
40-
doc.insert_pdf(pymupdf.open("pdf", pytesseract.image_to_pdf_or_hocr(page)))
26+
try:
27+
if not str(input_file).lower().endswith(".pdf"):
28+
image = Image.open(input_file)
29+
image.convert("RGB").save(input_file, dpi=image.info.get("dpi", (300, 300)))
4130

42-
if not output_file.parent.exists():
31+
output_file = base / "done" / relative_path.with_suffix(".pdf")
4332
output_file.parent.mkdir(exist_ok=True, parents=True)
44-
45-
doc.save(output_file, garbage=4, deflate=True)
46-
doc.close()
47-
48-
try:
33+
subprocess.run(
34+
[
35+
"bash",
36+
"-c",
37+
f"ocrmypdf --jobs 1 {' '.join(args)} {input_file} {output_file}",
38+
],
39+
check=True,
40+
)
4941
input_file.unlink()
42+
except subprocess.CalledProcessError:
43+
print(f"Failed to process {relative_path}")
5044
except Exception:
5145
pass
5246

53-
print(f"Processed {relative_path}")
54-
5547

5648
if __name__ == "__main__":
5749
pdfs = Path(sys.argv[1] if len(sys.argv) > 1 else ".")
@@ -60,7 +52,11 @@ def predict(base: Path, input_file: Path) -> None:
6052
(pdfs / "done").mkdir(exist_ok=True, parents=True)
6153

6254
Parallel(n_jobs=-1)(
63-
delayed(predict)(pdfs, Path(root) / file)
55+
delayed(predict)(
56+
pdfs,
57+
Path(root) / file,
58+
sys.argv[2:] if len(sys.argv) > 2 else ["--rotate-pages", "--deskew", "--skip-text", "--invalidate-digital-signatures", "--clean"],
59+
)
6460
for root, _, files in os.walk(pdfs / "todo")
6561
for file in files
6662
)
@@ -96,5 +92,3 @@ def predict(base: Path, input_file: Path) -> None:
9692

9793
for pdf in pdf_list:
9894
pdf.close()
99-
100-
print("Done")

src/predict.sh

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,24 +2,29 @@
22
# shellcheck disable=SC1091,SC2015
33

44
apt_install() {
5-
apt-get install -y python3 python3-pip python3-venv tesseract-ocr poppler-utils git
5+
# shellcheck disable=SC2068
6+
apt-get install -y python3 python3-pip python3-venv tesseract-ocr poppler-utils git ocrmypdf $@
67
}
78

8-
if ! apt_install 2>/dev/null; then
9+
main() {
10+
find . -name requirements.txt -exec pip3 install --user --root-user-action ignore --break-system-packages --no-cache-dir -r {} \;
11+
[ -z "$1" ] || find . -name main.py -exec python3 {} "${@:1}" \;
12+
}
13+
14+
langs=$(echo "$*" | grep -oP '(?<=-l )[^ ]+' | tr '+' '\n' | sed 's/^/tesseract-ocr-/' | sort -u | tr '\n' ' ')
15+
if ! apt_install "$langs" 2>/dev/null; then
916
apt-get update
10-
apt_install
17+
apt_install "$langs"
1118
fi
1219

1320
[ -d venv ] || python3 -m venv venv
1421
export OMP_THREAD_LIMIT=1
1522

16-
if [[ -f venv/bin/pip3 ]]; then
23+
if [[ -e venv/bin/pip3 ]]; then
1724
source venv/bin/activate
18-
find . -name requirements.txt -exec ./venv/bin/pip3 install --no-cache-dir -r {} \;
19-
[ -z "$1" ] || find . -name main.py -exec ./venv/bin/python3 {} "$1" \;
25+
main "${@}"
2026
deactivate
2127
elif [[ -f /.dockerenv ]]; then
2228
[[ ":$PATH:" == *":/root/.local/bin:"* ]] || export PATH=$PATH:/root/.local/bin
23-
pip3 install -r requirements.txt --user --break-system-packages
24-
[ -z "$1" ] || python3 ./main.py "$1"
29+
main "${@}"
2530
fi

src/requirements.txt

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
pytesseract==0.3.13
2-
pdf2image==1.17.0
31
PyMuPDF==1.24.11
42
pillow==10.4.0
53
joblib==1.4.2

0 commit comments

Comments
 (0)