[Parser] OCR correction uppercases 'Indonesia' and 'Undang-Undang' inappropriately

## Parser Improvement: OCR correction uppercases 'Indonesia' and 'Undang-Undang' inappropriately

**Severity:** HIGH
**File:** `ocr_correct.py`
**Function:** `correct_ocr_errors`
**Estimated errors fixed:** 4

### Current Behavior

The pattern `r'\bUNDANG[\s-]*UNDANG\b'` with `re.IGNORECASE` matches any casing of 'Undang-Undang' (including correct title case) and replaces it with all-caps 'UNDANG-UNDANG'. Similarly, `r'\bINDONES[!I1]A\b'` with `re.IGNORECASE` matches 'Indonesia' and replaces with 'INDONESIA'. This converts properly-cased body text to all-caps.

### Proposed Fix

Remove re.IGNORECASE from patterns that should only match OCR-broken forms, not correctly-cased text. For UNDANG-UNDANG, only fix spacing/hyphen issues when already all-caps. For INDONESIA, only fix when there's an actual OCR error character (!I1) not when it's already correct. Add a separate pattern to normalize 'INDONESIA' back to 'Indonesia' in body text contexts like 'Republik Indonesia'.

### Code Before

```python
    (re.compile(r'\bINDONES[!I1]A\b', re.IGNORECASE), 'INDONESIA'),
    (re.compile(r'\bUNDANG[\s-]*UNDANG\b', re.IGNORECASE), 'UNDANG-UNDANG'),
```

### Code After

```python
    # Only fix actual OCR errors (! or 1 instead of I) — don't touch correctly spelled text
    (re.compile(r'\bINDONES[!1]A\b'), 'INDONESIA'),
    # Only fix spacing issues in already-uppercase UNDANG UNDANG
    (re.compile(r'\bUNDANG\s+UNDANG\b'), 'UNDANG-UNDANG'),
```

---

_Generated by the Pasal.id Correction Agent (Opus 4.6) after analyzing 5 parser feedback entries._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parser] OCR correction uppercases 'Indonesia' and 'Undang-Undang' inappropriately #15

Parser Improvement: OCR correction uppercases 'Indonesia' and 'Undang-Undang' inappropriately

Current Behavior

Proposed Fix

Code Before

Code After

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Parser] OCR correction uppercases 'Indonesia' and 'Undang-Undang' inappropriately #15

Description

Parser Improvement: OCR correction uppercases 'Indonesia' and 'Undang-Undang' inappropriately

Current Behavior

Proposed Fix

Code Before

Code After

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions