use ocrmypdf

ipitio · ipitio · commit 5e797d4d8c4f · 2024-10-17T06:26:07.000-04:00
diff --git a/.github/workflows/predict.yml b/.github/workflows/predict.yml
@@ -28,9 +28,9 @@ jobs:
 
       - name: Run inference
         run: |
-          docker run --name ocr2pdf \
-            -v ./src:/app \
-            -v ./pdf:/app/pdf \
+          docker run \
+            -v ./src:/ocr2pdf \
+            -v ./pdf:/ocr2pdf/pdf \
             ghcr.io/ipitio/ocr-pdf:latest \
             bash predict.sh pdf
 
@@ -39,4 +39,4 @@ jobs:
         uses: EndBug/add-and-commit@v9
         with:
           add: "**/*.pdf"
-          message: "enhanced pdfs"
+          message: "processed files"
diff --git a/README.md b/README.md
@@ -4,23 +4,29 @@
 
 # ocr2pdf
 
-**Convert images and scans to searchable PDFs!**
+**OCRmyPDF and Merge it**
 
 ---
 
 [![downloads](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fipitio.github.io%2Fbackage%2Fipitio%2Focr-pdf%2Focr-pdf.json&query=%24.downloads&logo=github&logoColor=959da5&labelColor=333a41&label=pulls)](https://github.com/arevindh/pihole-speedtest/pkgs/container/pihole-speedtest) [![build](https://github.com/ipitio/ocr-pdf/actions/workflows/publish.yml/badge.svg)](https://github.com/ipitio/ocr-pdf/actions/workflows/publish.yml)
 
 </div>
 
-You can run this in your browser, on your computer, or somewhere in between, depending how much you want to automate and virtualize. The core logic resides in a Python script that you could run yourself, if you really wanted to. It extracts all the files from `todo`, transforms their pages with a pretrained LSTM RNN, and loads them into `done`. Files in subfolders will be merged in alphabetical order, but will still be available individually.
+Convert images and scans to searchable and selectable (and merged) PDFs! The core logic resides in a Python script that you could run yourself, if you really wanted to. It extracts all the files from `todo`, transforms them with Tesseract via [OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF), and loads them into `done`. Files in subfolders will be merged in alphabetical order, but will still be available individually.
 
 I recommend you use either:
 
 - The Bash script, which runs the Python script
 - The Docker image, which runs the Bash script
 - A Google Colab or GitHub Actions server, both of which run the Docker image
 
-Read on to find out which is best for you!
+Read on to find out which is best for you! In any case, the Bash script is, or must be, called like so:
+
+```bash
+bash /path/to/predict.sh /folder/containing/todo/ [OCRmyPDF options]
+```
+
+For more information, see the [OCRmyPDF documentation](https://ocrmypdf.readthedocs.io/en/latest).
 
 ## Fast Start
 
@@ -34,27 +40,21 @@ Are you on mobile or simply want an easy and seamless experience?
 2. Follow the instructions in the notebook
 3. Find the OCR'd files in your [Drive](https://drive.google.com/drive/my-drive)`/ocr-pdf`
 
+To add OCRmyPDF options, append them to the `run` command in the code cell.
+
 ### Self-hosted: Prebuilt Docker Image
 
 If you want to skip building an image, just use mine:
 
-1. Install Docker and Compose, such as with Docker Desktop
-2. Enter a new folder, add the file below, and put your files in `./pdf/todo`
-3. Run the following command to OCR the files and move them to `./pdf/done`
-
-```yaml
-# compose.yml
-services:
-    predict:
-        container_name: ocr2pdf
-        image: ghcr.io/ipitio/ocr-pdf:latest
-        command: bash predict.sh pdf
-        volumes:
-            - ./pdf:/app/pdf
-```
+1. Install Docker, such as with Docker Desktop
+2. Make a new `pdf` folder and put your files in `pdf/todo`
+3. Run the following command from `pdf/..` to convert the files and move them into `pdf/done`
 
 ```bash
-docker compose up
+docker run --rm \
+    -v ./pdf:/ocr2pdf/pdf \
+    ghcr.io/ipitio/ocr-pdf:latest \
+    bash predict.sh pdf [OCRmyPDF options]
 ```
 
 ## Quick Start
@@ -70,27 +70,31 @@ It's still easy as 1, 2, 3! You'll find the OCR'd files in `pdf/done`.
 If you made a fork and cloned it, Git is your best friend!
 
 ```bash
-git add pdf/*
+git add .
 git commit -m "add files"
 git push
 # wait for the magic to happen
 git pull
 ```
 
+To add OCRmyPDF options, edit the command the `predict.yml` file before committing.
+
 ### Self-hosted
 
 #### Build Docker Image
 
-If you aren't on Linux, or want to avoid polluting your system, use Docker Compose:
+If you aren't on Linux, or want to avoid polluting your system, use Docker Compose (which is included with Docker Desktop):
 
 ```bash
 docker compose up
 ```
 
+To add OCRmyPDF options, edit the command in the `compose.yml` file.
+
 #### Use Bare Metal
 
 Are you on Linux and want to make the most out of it?
 
 ```bash
-bash src/predict.sh pdf
+bash src/predict.sh pdf [OCRmyPDF options]
 ```
diff --git a/colab.ipynb b/colab.ipynb
@@ -26,18 +26,9 @@
     "\n",
     "## Steps\n",
     "\n",
-    "1. Make two new folders, one inside the other\n",
-    "   - The outer one can be named anything, say `pdf`\n",
-    "   - The inner one must be named `todo`\n",
-    "2. Place your files in the `todo` folder\n",
-    "   - Those by themselves will just be converted\n",
-    "   - Those inside subfolders will also be merged in alphabetical order\n",
-    "3. Share the outer `pdf` folder with this notebook\n",
-    "   - Zip the folder\n",
-    "   - Open this notebook in [Colab](https://colab.research.google.com/github/ipitio/ocr-pdf/blob/master/colab.ipynb)\n",
-    "   - Run the cell below to be prompted to connect Drive and upload the zip\n",
-    "\n",
-    "You'll be offered a zip of the converted (and merged) files to download locally, whether or not Drive was connected\n"
+    "To merge files, organize them into folders and zip each one. Ensure the files are named in alphabetical order, as they will be merged in that order. If you'd like to add any options for [OCRmyPDF](https://ocrmypdf.readthedocs.io/en/latest), append them to the `run` line in the cell below. At the end, you'll be offered a zip of the converted (and merged) files to download locally, whether or not Drive was connected.\n",
+    "\n",
+    "1. Run the cell below to get prompted to connect Drive and upload your files and/or zipped folders\n"
    ]
   },
   {
@@ -58,34 +49,31 @@
     "\n",
     "# Extract your PDFs\n",
     "files.upload()\n",
-    "\n",
-    "# Get the name of the zip file\n",
-    "pdfs = [pdf for pdf in os.listdir() if pdf.endswith(\".zip\")]\n",
-    "if len(pdfs) == 0:\n",
-    "    raise Exception(\"No ZIP file found\")\n",
+    "![ -d pdf ] || mkdir pdf\n",
+    "![ -d pdf/todo ] || mkdir pdf/todo\n",
+    "![ -d pdf/done ] || mkdir pdf/done\n",
+    "!unzip -o \"*.zip\" -d pdf/todo 2>/dev/null\n",
+    "!rm -f *.zip\n",
+    "!mv *.* pdf/todo 2>/dev/null\n",
     "\n",
     "# Transform them\n",
     "%pip install udocker\n",
     "!udocker --allow-root install\n",
-    "\n",
-    "for pdf in pdfs:\n",
-    "    !unzip -o \"$pdf\"\n",
-    "    !rm -f \"$pdf\"\n",
-    "    !udocker --allow-root run -v /content/\"$pdf\":/app/pdf ghcr.io/ipitio/ocr-pdf bash predict.sh pdf\n",
-    "    converted = os.listdir(\"$pdf/done\")\n",
-    "\n",
-    "    # And load\n",
-    "    if drive and len(converted) > 0:\n",
-    "        ![ -d \"drive/MyDrive/ocr-pdf\" ] || mkdir \"drive/MyDrive/ocr-pdf\"\n",
-    "        !\\cp -r \"$pdf/done/\"* \"drive/MyDrive/ocr-pdf/\"\n",
-    "\n",
-    "    if len(converted) == 1 and os.path.isfile(\"$pdf/done/\" + converted[0]):\n",
-    "        files.download(\"$pdf/done/\" + converted[0])\n",
-    "    elif len(converted) > 0:\n",
-    "        !zip -r \"$pdf.zip\" \"$pdf/done\"\n",
-    "        files.download(\"$pdf.zip\")\n",
-    "    else:\n",
-    "        print(\"No PDFs found\")"
+    "!udocker --allow-root run -v /content/pdf:/ocr2pdf/pdf ghcr.io/ipitio/ocr-pdf bash predict.sh pdf\n",
+    "converted = os.listdir(\"pdf/done\")\n",
+    "\n",
+    "# And load\n",
+    "if drive and len(converted) > 0:\n",
+    "    ![ -d \"drive/MyDrive/ocr-pdf\" ] || mkdir \"drive/MyDrive/ocr-pdf\"\n",
+    "    !\\cp -r \"pdf/done/\"* \"drive/MyDrive/ocr-pdf/\"\n",
+    "\n",
+    "if len(converted) == 1 and os.path.isfile(\"$pdf/done/\" + converted[0]):\n",
+    "    files.download(\"pdf/done/\" + converted[0])\n",
+    "elif len(converted) > 0:\n",
+    "    !zip -r \"pdf.zip\" \"pdf/done\"\n",
+    "    files.download(\"pdf.zip\")\n",
+    "else:\n",
+    "    print(\"No PDFs found\")"
    ]
   }
  ],
diff --git a/compose.yml b/compose.yml
@@ -2,7 +2,7 @@ services:
     predict:
         container_name: ocr2pdf
         build: ./src
-        command: bash predict.sh pdf
+        command: bash predict.sh pdf -l eng+fra
         volumes:
-            - ./src:/app
-            - ./pdf:/app/pdf
+            - ./src:/ocr2pdf
+            - ./pdf:/ocr2pdf/pdf
diff --git a/src/Dockerfile b/src/Dockerfile
@@ -1,4 +1,5 @@
-FROM python:3.11-slim
-WORKDIR /app
+FROM jbarlow83/ocrmypdf-ubuntu:v16.5.0
+WORKDIR /ocr2pdf
 COPY . .
 RUN bash predict.sh
+ENTRYPOINT []
diff --git a/src/main.py b/src/main.py
@@ -3,18 +3,17 @@
 """
 
 import os
+import subprocess
 import sys
 from pathlib import Path
 
 import pymupdf
-import pytesseract
 from joblib import Parallel, delayed
 from natsort import natsorted, ns
-from pdf2image import convert_from_path
 from PIL import Image
 
 
-def predict(base: Path, input_file: Path) -> None:
+def predict(base: Path, input_file: Path, args: list[str]) -> None:
     """
     Predicts the text in the input file and saves it to the output file
 
@@ -23,35 +22,28 @@ def predict(base: Path, input_file: Path) -> None:
         input_file (Path): The input file
     """
     relative_path = input_file.relative_to(base / "todo")
-    output_file = base / "done" / relative_path.with_suffix(".pdf")
 
-    if str(input_file).lower().endswith(".pdf"):
-        pages = convert_from_path(input_file, fmt="jpeg")
-    else:
-        try:
-            pages = [Image.open(input_file)]
-        except Exception:
-            return
-
-    print(f"Processing {relative_path}...")
-    doc = pymupdf.open()
-
-    for page in pages:
-        doc.insert_pdf(pymupdf.open("pdf", pytesseract.image_to_pdf_or_hocr(page)))
+    try:
+        if not str(input_file).lower().endswith(".pdf"):
+            image = Image.open(input_file)
+            image.convert("RGB").save(input_file, dpi=image.info.get("dpi", (300, 300)))
 
-    if not output_file.parent.exists():
+        output_file = base / "done" / relative_path.with_suffix(".pdf")
         output_file.parent.mkdir(exist_ok=True, parents=True)
-
-    doc.save(output_file, garbage=4, deflate=True)
-    doc.close()
-
-    try:
+        subprocess.run(
+            [
+                "bash",
+                "-c",
+                f"ocrmypdf --jobs 1 {' '.join(args)} {input_file} {output_file}",
+            ],
+            check=True,
+        )
         input_file.unlink()
+    except subprocess.CalledProcessError:
+        print(f"Failed to process {relative_path}")
     except Exception:
         pass
 
-    print(f"Processed {relative_path}")
-
 
 if __name__ == "__main__":
     pdfs = Path(sys.argv[1] if len(sys.argv) > 1 else ".")
@@ -60,7 +52,11 @@ def predict(base: Path, input_file: Path) -> None:
     (pdfs / "done").mkdir(exist_ok=True, parents=True)
 
     Parallel(n_jobs=-1)(
-        delayed(predict)(pdfs, Path(root) / file)
+        delayed(predict)(
+            pdfs,
+            Path(root) / file,
+            sys.argv[2:] if len(sys.argv) > 2 else ["--rotate-pages", "--deskew", "--skip-text", "--invalidate-digital-signatures", "--clean"],
+        )
         for root, _, files in os.walk(pdfs / "todo")
         for file in files
     )
@@ -96,5 +92,3 @@ def predict(base: Path, input_file: Path) -> None:
 
         for pdf in pdf_list:
             pdf.close()
-
-    print("Done")
diff --git a/src/predict.sh b/src/predict.sh
@@ -2,24 +2,29 @@
 # shellcheck disable=SC1091,SC2015
 
 apt_install() {
-    apt-get install -y python3 python3-pip python3-venv tesseract-ocr poppler-utils git
+    # shellcheck disable=SC2068
+    apt-get install -y python3 python3-pip python3-venv tesseract-ocr poppler-utils git ocrmypdf $@
 }
 
-if ! apt_install 2>/dev/null; then
+main() {
+    find . -name requirements.txt -exec pip3 install --user --root-user-action ignore --break-system-packages --no-cache-dir -r {} \;
+    [ -z "$1" ] || find . -name main.py -exec python3 {} "${@:1}" \;
+}
+
+langs=$(echo "$*" | grep -oP '(?<=-l )[^ ]+' | tr '+' '\n' | sed 's/^/tesseract-ocr-/' | sort -u | tr '\n' ' ')
+if ! apt_install "$langs" 2>/dev/null; then
     apt-get update
-    apt_install
+    apt_install "$langs"
 fi
 
 [ -d venv ] || python3 -m venv venv
 export OMP_THREAD_LIMIT=1
 
-if [[ -f venv/bin/pip3 ]]; then
+if [[ -e venv/bin/pip3 ]]; then
     source venv/bin/activate
-    find . -name requirements.txt -exec ./venv/bin/pip3 install --no-cache-dir -r {} \;
-    [ -z "$1" ] || find . -name main.py -exec ./venv/bin/python3 {} "$1" \;
+    main "${@}"
     deactivate
 elif [[ -f /.dockerenv ]]; then
     [[ ":$PATH:" == *":/root/.local/bin:"* ]] || export PATH=$PATH:/root/.local/bin
-    pip3 install -r requirements.txt --user --break-system-packages
-    [ -z "$1" ] || python3 ./main.py "$1"
+    main "${@}"
 fi
diff --git a/src/requirements.txt b/src/requirements.txt
@@ -1,5 +1,3 @@
-pytesseract==0.3.13
-pdf2image==1.17.0
 PyMuPDF==1.24.11
 pillow==10.4.0
 joblib==1.4.2

Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,3 @@`
`1`		`-pytesseract==0.3.13`
`2`		`-pdf2image==1.17.0`
`3`	`1`	`PyMuPDF==1.24.11`
`4`	`2`	`pillow==10.4.0`
`5`	`3`	`joblib==1.4.2`