Add OCR text verification to prevent false positive completions #46

Copilot · 2025-08-28T14:18:17Z

Currently, the OCR service only trusts the exit code from OCRmyPDF to determine if OCR processing was successful. This can lead to false positives where OCR appears to complete successfully (exit code 0) but no actual text was extracted from the document.

Problem

OCRmyPDF can return exit code 0 in cases where:

The input document contains only blank pages
The image quality is too poor for text recognition
The document contains only images with no readable text

In these scenarios, the OCR status was incorrectly set to COMPLETED even though no text extraction occurred.

Solution

This PR adds a verification step after successful OCR completion:

Text extraction verification: After OCR exits with code 0, the service now uses the existing extract_text() helper function to verify that the OCR output file actually contains extractable text.
Improved status logic:
- If text is found → OCRStatus.COMPLETED
- If no text or only whitespace is found → OCRStatus.FAILED
- If the OCR output file is missing → OCRStatus.OUTPUT_ERROR
Enhanced logging: Added detailed logging that reports the number of characters extracted during verification.

Changes

ocr_service/main.py: Added import for extract_text, implemented verification logic, and fixed a bug where the completion status was set regardless of the OCR result
tests/test_ocr_verification.py: Added unit tests covering various text extraction scenarios

Example Impact

Before this change, a blank PDF page would result in:

OCR processing completed: blank_page.pdf
Status: COMPLETED

After this change:

OCR processing completed: blank_page.pdf
OCR verification failed: no text found in OCR output file
Status: FAILED

This ensures the OCR pipeline only marks documents as successfully processed when text extraction actually occurred.

Fixes #41.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot · 2025-08-28T14:18:19Z

@maxi07 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs.

I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review.

Co-authored-by: maxi07 <[email protected]>

Initial plan

6702ca4

Copilot AI assigned Copilot and maxi07 Aug 28, 2025

Copilot started work on behalf of maxi07 August 28, 2025 14:18 View session

Add OCR verification to check for text extraction after OCR completion

e439182

Co-authored-by: maxi07 <[email protected]>

Copilot AI changed the title ~~[WIP] Add verification for OCR after our step~~ Add OCR text verification to prevent false positive completions Aug 28, 2025

Copilot AI requested a review from maxi07 August 28, 2025 14:28

Copilot finished work on behalf of maxi07 August 28, 2025 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add OCR text verification to prevent false positive completions #46

Add OCR text verification to prevent false positive completions #46

Uh oh!

Copilot AI commented Aug 28, 2025 •

edited

Loading

Uh oh!

Copilot AI commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add OCR text verification to prevent false positive completions #46

Are you sure you want to change the base?

Add OCR text verification to prevent false positive completions #46

Uh oh!

Conversation

Copilot AI commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Changes

Example Impact

Uh oh!

Copilot AI commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Aug 28, 2025 •

edited

Loading