Skip to content

Conversation

Copy link

Copilot AI commented Aug 28, 2025

Currently, the OCR service only trusts the exit code from OCRmyPDF to determine if OCR processing was successful. This can lead to false positives where OCR appears to complete successfully (exit code 0) but no actual text was extracted from the document.

Problem

OCRmyPDF can return exit code 0 in cases where:

  • The input document contains only blank pages
  • The image quality is too poor for text recognition
  • The document contains only images with no readable text

In these scenarios, the OCR status was incorrectly set to COMPLETED even though no text extraction occurred.

Solution

This PR adds a verification step after successful OCR completion:

  1. Text extraction verification: After OCR exits with code 0, the service now uses the existing extract_text() helper function to verify that the OCR output file actually contains extractable text.

  2. Improved status logic:

    • If text is found → OCRStatus.COMPLETED
    • If no text or only whitespace is found → OCRStatus.FAILED
    • If the OCR output file is missing → OCRStatus.OUTPUT_ERROR
  3. Enhanced logging: Added detailed logging that reports the number of characters extracted during verification.

Changes

  • ocr_service/main.py: Added import for extract_text, implemented verification logic, and fixed a bug where the completion status was set regardless of the OCR result
  • tests/test_ocr_verification.py: Added unit tests covering various text extraction scenarios

Example Impact

Before this change, a blank PDF page would result in:

OCR processing completed: blank_page.pdf
Status: COMPLETED

After this change:

OCR processing completed: blank_page.pdf
OCR verification failed: no text found in OCR output file
Status: FAILED

This ensures the OCR pipeline only marks documents as successfully processed when text extraction actually occurred.

Fixes #41.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copy link
Author

Copilot AI commented Aug 28, 2025

@maxi07 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs.

I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review.

Copilot AI changed the title [WIP] Add verification for OCR after our step Add OCR text verification to prevent false positive completions Aug 28, 2025
Copilot AI requested a review from maxi07 August 28, 2025 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add verification for OCR after our step

2 participants