Skip to content

Commit a808947

Browse files
committed
fixes to pdf alto for better field separation. ALTO XML row-band grouping, full-width splitting, and header length constraint. Replace column-major sort with row-band grouping (vertical-overlap merging). Split bands at full-width block boundaries.
1 parent 2ff6bfa commit a808947

File tree

3 files changed

+360
-17
lines changed

3 files changed

+360
-17
lines changed

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,15 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [0.1.6] - 2026-02-26
9+
10+
### Changed
11+
12+
- ALTO XML reading order: replace column-major sort with row-band grouping via vertical-overlap merging
13+
- Split row-bands at full-width block boundaries (>60% page width) to preserve section ordering in mixed layouts
14+
- Header detection: limit to blocks ≤150 chars and ≤3 lines to prevent bold body paragraphs from being marked as headers
15+
- Skip contact-like text (emails, URLs, phone numbers) from header detection
16+
817
## [0.1.5] - 2026-02-25
918

1019
### Fixed

0 commit comments

Comments
 (0)