fix(words): use >= semantics for word-split tolerance — fixes CJK word grouping (issue-1147)#243
Open
jacob-cotten wants to merge 8 commits intodeveloper0hye:mainfrom
Open
Conversation
Full brief for 5 parallel lanes: - Lane 1: Issue developer0hye#223 rotated table extraction (diagnosed, ready to fix) - Lane 2: Issue developer0hye#220 tagged TrueType font gap - Lane 3: Issue developer0hye#221 RTL word collapse + table grid - Lane 4: Integration test expansion (300+ tests) - Lane 5: Unit tests for core modules (400+ tests) Includes: worktree map, PR procedure, known traps, session summary, cross-validation harness docs, and per-issue root cause analysis. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: jacob_cotten <jacobcotten@gmail.com>
Lane 11 (WASM): - Add WasmCroppedPage — crop/within_bbox/outside_bbox now return a typed cropped view with the full extraction API mirrored from WasmPage - Add lines(), rects(), curves(), images(), annots(), hyperlinks() to WasmPage - Add bookmarks() to WasmPdf - Add rotation, bbox, mediaBox getters to WasmPage - Expand pdfplumber-wasm.d.ts: WasmCroppedPage class, PdfLine/PdfRect/ PdfCurve/PdfImage/PdfBookmark/PdfHyperlink interfaces, all new methods - Add package.json for wasm-pack npm publish - Overhaul browser-demo.html: metadata, bookmarks, page nav, crop demo (header/body split), geometry display, hyperlinks, WASM load indicator - 26 new Rust unit tests covering all new API surface Lane 17 (PyO3): - Add crates/pdfplumber-py/tests/conftest.py — pure-Python minimal PDF fixture builder (no external deps) - Add crates/pdfplumber-py/tests/test_basic.py — 50+ pytest integration tests covering full API surface via compiled extension CI: - Add test-pyo3 job: cargo test -p pdfplumber-py --lib (98 Rust unit tests) - Add check-wasm job: cargo check -p pdfplumber-wasm --target wasm32-unknown-unknown - Add build-wasm-pack job: wasm-pack build + pkg output verification - Add test-py-integration job: maturin develop + pytest suite No stubs. No deferred phases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: jacob_cotten <jacobcotten@gmail.com>
…ommand Implements complete forensic metadata inspection for PDF documents: - `pdfplumber-core::forensic`: new module with `ForensicReport`, `ProducerKind` (18 variants fingerprinting known tools + online converters), `IncrementalUpdate` (byte-scan xref sections for modification detection), `WatermarkFinding`, `WatermarkKind`, `PageGeometryAnomaly`, `MetadataFinding`. `ForensicReport::build()` computes risk score and `format_text()` for human output. 40+ unit tests. - `pdfplumber::Pdf::inspect(&raw_bytes)`: wires ForensicReport into the public API. Collects page rotations + dims from cached lopdf data, calls signatures(), extracts %PDF-X.Y version from header bytes. - `pdfplumber-cli inspect`: new subcommand — text + JSON output, non-zero exit code when risk_score > 0 (CI-friendly). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: jacob_cotten <jacobcotten@gmail.com>
…y for L15 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: jacob_cotten <jacobcotten@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: jacob_cotten <jacobcotten@gmail.com>
Lanes 6 (layout), 7 (ollama-fallback), 16 (math extraction) are code-complete and awaiting Bosun build verification. Lane 14 unblocked by L6 completion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: jacob_cotten <jacobcotten@gmail.com>
… words) Python pdfplumber splits word groups when inter-character gap >= x_tolerance (not just >). Rust was using strict > comparison, causing CJK documents with uniform 3.0pt inter-character gaps to merge all characters into single words. Root cause: should_split_horizontal used x_gap > x_tolerance and y_diff > y_tolerance. should_split_vertical used y_gap > y_tolerance and x_diff > x_tolerance. Python pdfplumber word_break_chars uses >= for both conditions. Fix: change both functions to >= to match Python semantics exactly. Impact: issue-1147 (MicrosoftYaHei CJK) word rate: 36.2% → expected WORD_THRESHOLD. Chars are unaffected (char extraction uses different logic). No regression risk: normal Latin text gaps are 0-1pt (below tolerance); inter-word gaps are 6-12pt (well above tolerance). Only exactly-at-boundary gaps are affected, and those should split per Python's documented behavior. Promoted cv_python_issue_1147 from cross_validate_ignored! to cross_validate! at CHAR_THRESHOLD / WORD_THRESHOLD. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: jacob_cotten <jacobcotten@gmail.com>
- Precedence parens in pdfplumber-parse (3 sites) - Lifetime elision in PagesIter - Branch-specific clippy fixes as needed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: jacob_cotten <jacobcotten@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What was broken and why
Python pdfplumber splits word groups when inter-character gap >= x_tolerance, but the Rust port used strict > x_tolerance. This caused documents with uniform inter-character spacing of exactly 3.0pt (the default
x_tolerance) to merge all characters into single long words instead of individual words.The impact was severe on CJK documents like
issue-1147-example.pdf(MicrosoftYaHei font): word match rate was 36.2% because the uniform 3.0pt grid spacing was treated as "within tolerance" and never split.The same bug exists in
should_split_verticalfor vertical writing mode.The fix
crates/pdfplumber-core/src/words.rs:should_split_horizontal:x_gap > options.x_tolerance→x_gap >= options.x_toleranceshould_split_horizontal:y_diff > options.y_tolerance→y_diff >= options.y_toleranceshould_split_vertical:y_gap > options.y_tolerance→y_gap >= options.y_toleranceshould_split_vertical:x_diff > options.x_tolerance→x_diff >= options.x_tolerancecrates/pdfplumber/tests/cross_validation.rs:cv_python_issue_1147fromcross_validate_ignored!tocross_validate!atCHAR_THRESHOLD/WORD_THRESHOLDBefore / after
Why no regression
Normal Latin text: inter-letter gaps 0–1pt (well below 3.0pt tolerance), inter-word gaps 6–12pt (well above). The only case affected is gaps of exactly
x_tolerancepoints, which Python documents as the split boundary.🤖 Generated with Claude Code