Skip to content

[Parser] OCR correction uppercases 'Indonesia' and 'Undang-Undang' inappropriately #15

@ilhamfp

Description

@ilhamfp

Parser Improvement: OCR correction uppercases 'Indonesia' and 'Undang-Undang' inappropriately

Severity: HIGH
File: ocr_correct.py
Function: correct_ocr_errors
Estimated errors fixed: 4

Current Behavior

The pattern r'\bUNDANG[\s-]*UNDANG\b' with re.IGNORECASE matches any casing of 'Undang-Undang' (including correct title case) and replaces it with all-caps 'UNDANG-UNDANG'. Similarly, r'\bINDONES[!I1]A\b' with re.IGNORECASE matches 'Indonesia' and replaces with 'INDONESIA'. This converts properly-cased body text to all-caps.

Proposed Fix

Remove re.IGNORECASE from patterns that should only match OCR-broken forms, not correctly-cased text. For UNDANG-UNDANG, only fix spacing/hyphen issues when already all-caps. For INDONESIA, only fix when there's an actual OCR error character (!I1) not when it's already correct. Add a separate pattern to normalize 'INDONESIA' back to 'Indonesia' in body text contexts like 'Republik Indonesia'.

Code Before

    (re.compile(r'\bINDONES[!I1]A\b', re.IGNORECASE), 'INDONESIA'),
    (re.compile(r'\bUNDANG[\s-]*UNDANG\b', re.IGNORECASE), 'UNDANG-UNDANG'),

Code After

    # Only fix actual OCR errors (! or 1 instead of I) — don't touch correctly spelled text
    (re.compile(r'\bINDONES[!1]A\b'), 'INDONESIA'),
    # Only fix spacing issues in already-uppercase UNDANG UNDANG
    (re.compile(r'\bUNDANG\s+UNDANG\b'), 'UNDANG-UNDANG'),

Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 5 parser feedback entries.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions