Add OCR text verification to prevent false positive completions #46
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, the OCR service only trusts the exit code from OCRmyPDF to determine if OCR processing was successful. This can lead to false positives where OCR appears to complete successfully (exit code 0) but no actual text was extracted from the document.
Problem
OCRmyPDF can return exit code 0 in cases where:
In these scenarios, the OCR status was incorrectly set to
COMPLETEDeven though no text extraction occurred.Solution
This PR adds a verification step after successful OCR completion:
Text extraction verification: After OCR exits with code 0, the service now uses the existing
extract_text()helper function to verify that the OCR output file actually contains extractable text.Improved status logic:
OCRStatus.COMPLETEDOCRStatus.FAILEDOCRStatus.OUTPUT_ERROREnhanced logging: Added detailed logging that reports the number of characters extracted during verification.
Changes
ocr_service/main.py: Added import forextract_text, implemented verification logic, and fixed a bug where the completion status was set regardless of the OCR resulttests/test_ocr_verification.py: Added unit tests covering various text extraction scenariosExample Impact
Before this change, a blank PDF page would result in:
After this change:
This ensures the OCR pipeline only marks documents as successfully processed when text extraction actually occurred.
Fixes #41.
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.