Use this checklist before shipping OCR pipeline changes.
- Run
python3 -m unittest discover -s tests - Run
python3 tests/benchmark_ocr.py --manifest tests/fixtures/benchmark_manifest.json - Run the modern vs legacy comparison:
python3 tests/benchmark_ocr.py --manifest tests/fixtures/benchmark_manifest.json --executor modern --baseline-executor legacy --baseline-output tests/fixtures/benchmark_baseline.legacy.local.json --output tests/fixtures/benchmark_report.modern.local.json --enforce-targets - Compare the benchmark report against the prior baseline on the same machine
- Review requested vs effective diagnostics for at least one image case and one PDF case
- Selected image,
Fast - Selected image,
Balanced - Selected image,
Accuracy - File image,
Customwith manual PSM/OEM values - Single-page PDF
- Multi-page PDF
- Mixed image + PDF batch
- Batch with merge enabled
- Batch with merge disabled
- Preview enabled
- Preview disabled
eng+hinor another mixed-language case- Missing language code case to verify install guidance
- OEM
0or2on a runtime without legacy support to verify warning/disable behavior
Fastperforms one exact OCR attemptBalancedperforms one exact attempt and at most one recovery attemptAccuracyperforms one exact attempt and at most one enhanced preprocessing recoveryCustomkeeps the user-selected PSM/OEM/scale/preprocessing values exactly- PDFs are rendered page-by-page instead of rasterizing the whole document up front
FastandBalancedstart PDF OCR at200 DPIAccuracystarts PDF OCR at300 DPI- Weak PDF pages may be rerendered at
300 DPI - Requested vs effective diagnostics are visible in completion messages and logs
- Missing language packs are reported with install guidance
- invalid Tesseract path
- missing Tesseract executable
- missing PDF renderer
- missing traineddata for requested language
- empty OCR result from low-quality source
- Keep benchmark inputs local and sanitized
- Measure PDF performance on both single-page and multi-page documents
- Measure bulk throughput with multiple files, not just one PDF
- Hidden rollback key for maintainers:
HiddenOcrExecutor=legacy