-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Parser Improvement: OCR correction uppercases 'Indonesia' and 'Undang-Undang' inappropriately
Severity: HIGH
File: ocr_correct.py
Function: correct_ocr_errors
Estimated errors fixed: 4
Current Behavior
The pattern r'\bUNDANG[\s-]*UNDANG\b' with re.IGNORECASE matches any casing of 'Undang-Undang' (including correct title case) and replaces it with all-caps 'UNDANG-UNDANG'. Similarly, r'\bINDONES[!I1]A\b' with re.IGNORECASE matches 'Indonesia' and replaces with 'INDONESIA'. This converts properly-cased body text to all-caps.
Proposed Fix
Remove re.IGNORECASE from patterns that should only match OCR-broken forms, not correctly-cased text. For UNDANG-UNDANG, only fix spacing/hyphen issues when already all-caps. For INDONESIA, only fix when there's an actual OCR error character (!I1) not when it's already correct. Add a separate pattern to normalize 'INDONESIA' back to 'Indonesia' in body text contexts like 'Republik Indonesia'.
Code Before
(re.compile(r'\bINDONES[!I1]A\b', re.IGNORECASE), 'INDONESIA'),
(re.compile(r'\bUNDANG[\s-]*UNDANG\b', re.IGNORECASE), 'UNDANG-UNDANG'),Code After
# Only fix actual OCR errors (! or 1 instead of I) — don't touch correctly spelled text
(re.compile(r'\bINDONES[!1]A\b'), 'INDONESIA'),
# Only fix spacing issues in already-uppercase UNDANG UNDANG
(re.compile(r'\bUNDANG\s+UNDANG\b'), 'UNDANG-UNDANG'),Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 5 parser feedback entries.