feat(wasm+py+forensic): Lanes 11, 15, 17 — WASM parity, PyO3 tests, forensic inspection#239
Open
jacob-cotten wants to merge 3 commits intodeveloper0hye:mainfrom
Open
feat(wasm+py+forensic): Lanes 11, 15, 17 — WASM parity, PyO3 tests, forensic inspection#239jacob-cotten wants to merge 3 commits intodeveloper0hye:mainfrom
jacob-cotten wants to merge 3 commits intodeveloper0hye:mainfrom
Conversation
Lane 11 (WASM bindings): - Add WasmCroppedPage with full extraction API parity (chars, extract_text, extract_words, find_tables, extract_tables, lines, rects, curves, images, crop, within_bbox, outside_bbox) - Add to WasmPage: lines(), rects(), curves(), images(), annots(), hyperlinks(), rotation, bbox, mediaBox getters, crop/within_bbox/outside_bbox returning WasmCroppedPage - Add WasmPdf::bookmarks() - TypeScript .d.ts: add PdfLine, PdfRect, PdfCurve, PdfImage, PdfBookmark, PdfHyperlink interfaces, WasmCroppedPage class - package.json: add for wasm-pack npm package metadata - browser-demo.html: full rewrite — metadata, bookmarks/TOC, page navigation, crop demo (header/body split), geometry inspector, hyperlinks, WASM load indicator - 26 new Rust unit tests for all new API surface Lane 17 (PyO3 Python bindings): - Add crates/pdfplumber-py/tests/conftest.py: pure-Python minimal PDF fixture builder (no external deps, hand-crafted PDF bytes) - Add crates/pdfplumber-py/tests/test_basic.py: 50+ Python integration tests covering full API: PDF.open_bytes, PDF.open, pages, metadata, bookmarks, Page properties, chars/words/tables/shapes, crop/within_bbox/outside_bbox, CroppedPage methods Closes developer0hye#11 (WASM target), developer0hye#17 (PyO3 bindings) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: jacob_cotten <jacobcotten@gmail.com>
…ommand Implements complete forensic metadata inspection for PDF documents: - `pdfplumber-core::forensic`: new module with `ForensicReport`, `ProducerKind` (18 variants fingerprinting known tools + online converters), `IncrementalUpdate` (byte-scan xref sections for modification detection), `WatermarkFinding`, `WatermarkKind`, `PageGeometryAnomaly`, `MetadataFinding`. `ForensicReport::build()` computes risk score and `format_text()` for human output. 40+ unit tests. - `pdfplumber::Pdf::inspect(&raw_bytes)`: wires ForensicReport into the public API. Collects page rotations + dims from cached lopdf data, calls signatures(), extracts %PDF-X.Y version from header bytes. - `pdfplumber-cli inspect`: new subcommand — text + JSON output, non-zero exit code when risk_score > 0 (CI-friendly). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: jacob_cotten <jacobcotten@gmail.com>
- Add doc comments to RepeatedTextBlock variant fields (missing_docs) - Add parens for clippy::precedence in pdfplumber-parse (3 sites) - Elide needless lifetime in PagesIter - Remove phantom serde feature cfg from CLI inspect_cmd Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: jacob_cotten <jacobcotten@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three lanes shipped together as they share pdfplumber-core infrastructure:
Lane 11 — WASM Bindings (full API parity)
What was missing:
WasmCroppedPage— entirely absent from WASM despite existing in PyO3lines(),rects(),curves(),images()— no geometry API on any WASM typehyperlinks(),annots(),bookmarks()— no annotation/link accessrotation,bbox,mediaBox— missing page propertiescrop(),within_bbox(),outside_bbox()— no spatial filteringWhat was added:
WasmCroppedPagestruct with full extraction API (chars, words, tables, geometry, crop chain)WasmPageWasmPdf::bookmarks()for TOC access.d.ts:PdfLine,PdfRect,PdfCurve,PdfImage,PdfBookmark,PdfHyperlink,WasmCroppedPageclasspackage.jsonfor wasm-pack npm packagingLane 17 — PyO3 Python Bindings (test suite)
What was missing: Zero Python integration tests despite 98 Rust unit tests in the crate.
What was added:
tests/conftest.py: pure-Python hand-crafted PDF byte fixture (no external deps — works beforematurin develop)tests/test_basic.py: 50+ Python integration tests coveringPDF.open_bytes,PDF.open,pages,metadata,bookmarks,Pageproperties, all extraction methods,crop/within_bbox/outside_bbox,CroppedPagemethodsLane 15 — Forensic Metadata Inspection
New module:
pdfplumber-core::forensic(~650 lines, 40+ tests)ProducerKind: 18-variant enum fingerprinting known PDF producers —AdobeAcrobat,MicrosoftWord,LibreOffice,GoogleDocs,Latex,Pdf24(online converter 🚩),Smallpdf(online converter 🚩), and 11 moredetect_incremental_updates(bytes): pure byte-scan forstartxrefmarkers — each = one PDF revision. Detects post-signing modifications.WatermarkFinding/WatermarkKind:LowOpacityText,InvisibleText,RepeatedTextBlock,LowOpacityOverlayPageGeometryAnomaly: unusual rotation, non-standard dimensionsMetadataFinding: scrubbed fields, creation=mod date (common in online converter output)ForensicReport::build(): assembles all findings, computes risk score (0=clean, accumulates per finding type)format_text(): human-readable multi-section forensic reportPdf::inspect(&raw_bytes): public API entry point — never fails, all errors produce sensible defaultsNew CLI subcommand:
pdfplumber inspect <file> [--format text|json]risk_score > 0(CI pipeline friendly)Note on CI jobs: This PR does not include the CI workflow additions (requires
workflowOAuth scope). CI jobs for WASM (wasm-pack build) and PyO3 (maturin develop + pytest) will be added in a follow-up PR.Test plan
cargo check -p pdfplumber -p pdfplumber-core -p pdfplumber-cli— verify all imports resolvecargo test -p pdfplumber-core— 40+ new forensic testscargo check -p pdfplumber-wasm --target wasm32-unknown-unknown— WASM compile checkcargo test -p pdfplumber-py --lib --features pyo3/auto-initialize— PyO3 unit testsmaturin develop && pytest crates/pdfplumber-py/tests/— Python integration tests🤖 Generated with Claude Code