You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: use geometric sorting as fallback for reading order
- When no structure tree exists, fall back to geometric sorting
instead of raw stream order
- Sorts by Y (top→bottom), then X (left→right)
- Includes automatic two-column layout detection
- Add Reading Order section to README explaining the approach
zpdf uses a two-tier approach for extracting text in logical reading order:
152
+
153
+
1.**Structure Tree** (preferred): Uses the PDF's semantic structure as defined by the document author. This is the correct approach for tagged/accessible PDFs (PDF/UA) and properly handles multi-column layouts, sidebars, tables, and captions.
154
+
155
+
2.**Geometric Sorting** (fallback): When no structure tree exists, zpdf falls back to sorting text by Y-coordinate (top→bottom), then X-coordinate (left→right), with automatic two-column detection.
156
+
157
+
| Method | Pros | Cons |
158
+
|--------|------|------|
159
+
| Structure tree | Correct semantic order, handles complex layouts | Only works on tagged PDFs |
160
+
| Geometric sort | Works on any PDF, handles two-column | Can fail on complex layouts |
161
+
| Stream order | Fast, raw extraction | Often wrong order |
0 commit comments