Skip to content

feat(wasm+py+forensic): Lanes 11, 15, 17 — WASM parity, PyO3 tests, forensic inspection#239

Open
jacob-cotten wants to merge 3 commits intodeveloper0hye:mainfrom
jacob-cotten:feat/wasm-pyo3-forensic
Open

feat(wasm+py+forensic): Lanes 11, 15, 17 — WASM parity, PyO3 tests, forensic inspection#239
jacob-cotten wants to merge 3 commits intodeveloper0hye:mainfrom
jacob-cotten:feat/wasm-pyo3-forensic

Conversation

@jacob-cotten
Copy link

Summary

Three lanes shipped together as they share pdfplumber-core infrastructure:

Lane 11 — WASM Bindings (full API parity)

What was missing:

  • WasmCroppedPage — entirely absent from WASM despite existing in PyO3
  • lines(), rects(), curves(), images() — no geometry API on any WASM type
  • hyperlinks(), annots(), bookmarks() — no annotation/link access
  • rotation, bbox, mediaBox — missing page properties
  • crop(), within_bbox(), outside_bbox() — no spatial filtering

What was added:

  • WasmCroppedPage struct with full extraction API (chars, words, tables, geometry, crop chain)
  • All geometry methods on WasmPage
  • WasmPdf::bookmarks() for TOC access
  • TypeScript .d.ts: PdfLine, PdfRect, PdfCurve, PdfImage, PdfBookmark, PdfHyperlink, WasmCroppedPage class
  • package.json for wasm-pack npm packaging
  • Complete browser demo rewrite: metadata display, bookmarks, per-page nav, crop demo (header/body split), geometry inspector, hyperlinks section, WASM load indicator
  • 26 new Rust unit tests

Lane 17 — PyO3 Python Bindings (test suite)

What was missing: Zero Python integration tests despite 98 Rust unit tests in the crate.

What was added:

  • tests/conftest.py: pure-Python hand-crafted PDF byte fixture (no external deps — works before maturin develop)
  • tests/test_basic.py: 50+ Python integration tests covering PDF.open_bytes, PDF.open, pages, metadata, bookmarks, Page properties, all extraction methods, crop/within_bbox/outside_bbox, CroppedPage methods

Lane 15 — Forensic Metadata Inspection

New module: pdfplumber-core::forensic (~650 lines, 40+ tests)

  • ProducerKind: 18-variant enum fingerprinting known PDF producers — AdobeAcrobat, MicrosoftWord, LibreOffice, GoogleDocs, Latex, Pdf24 (online converter 🚩), Smallpdf (online converter 🚩), and 11 more
  • detect_incremental_updates(bytes): pure byte-scan for startxref markers — each = one PDF revision. Detects post-signing modifications.
  • WatermarkFinding / WatermarkKind: LowOpacityText, InvisibleText, RepeatedTextBlock, LowOpacityOverlay
  • PageGeometryAnomaly: unusual rotation, non-standard dimensions
  • MetadataFinding: scrubbed fields, creation=mod date (common in online converter output)
  • ForensicReport::build(): assembles all findings, computes risk score (0=clean, accumulates per finding type)
  • format_text(): human-readable multi-section forensic report
  • Pdf::inspect(&raw_bytes): public API entry point — never fails, all errors produce sensible defaults

New CLI subcommand: pdfplumber inspect <file> [--format text|json]

  • Non-zero exit code when risk_score > 0 (CI pipeline friendly)

Note on CI jobs: This PR does not include the CI workflow additions (requires workflow OAuth scope). CI jobs for WASM (wasm-pack build) and PyO3 (maturin develop + pytest) will be added in a follow-up PR.

Test plan

  • cargo check -p pdfplumber -p pdfplumber-core -p pdfplumber-cli — verify all imports resolve
  • cargo test -p pdfplumber-core — 40+ new forensic tests
  • cargo check -p pdfplumber-wasm --target wasm32-unknown-unknown — WASM compile check
  • cargo test -p pdfplumber-py --lib --features pyo3/auto-initialize — PyO3 unit tests
  • maturin develop && pytest crates/pdfplumber-py/tests/ — Python integration tests

🤖 Generated with Claude Code

jacob-cotten and others added 2 commits March 6, 2026 06:36
Lane 11 (WASM bindings):
- Add WasmCroppedPage with full extraction API parity (chars, extract_text,
  extract_words, find_tables, extract_tables, lines, rects, curves, images,
  crop, within_bbox, outside_bbox)
- Add to WasmPage: lines(), rects(), curves(), images(), annots(), hyperlinks(),
  rotation, bbox, mediaBox getters, crop/within_bbox/outside_bbox returning
  WasmCroppedPage
- Add WasmPdf::bookmarks()
- TypeScript .d.ts: add PdfLine, PdfRect, PdfCurve, PdfImage, PdfBookmark,
  PdfHyperlink interfaces, WasmCroppedPage class
- package.json: add for wasm-pack npm package metadata
- browser-demo.html: full rewrite — metadata, bookmarks/TOC, page navigation,
  crop demo (header/body split), geometry inspector, hyperlinks, WASM load
  indicator
- 26 new Rust unit tests for all new API surface

Lane 17 (PyO3 Python bindings):
- Add crates/pdfplumber-py/tests/conftest.py: pure-Python minimal PDF fixture
  builder (no external deps, hand-crafted PDF bytes)
- Add crates/pdfplumber-py/tests/test_basic.py: 50+ Python integration tests
  covering full API: PDF.open_bytes, PDF.open, pages, metadata, bookmarks,
  Page properties, chars/words/tables/shapes, crop/within_bbox/outside_bbox,
  CroppedPage methods

Closes developer0hye#11 (WASM target), developer0hye#17 (PyO3 bindings)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: jacob_cotten <jacobcotten@gmail.com>
…ommand

Implements complete forensic metadata inspection for PDF documents:

- `pdfplumber-core::forensic`: new module with `ForensicReport`, `ProducerKind`
  (18 variants fingerprinting known tools + online converters), `IncrementalUpdate`
  (byte-scan xref sections for modification detection), `WatermarkFinding`,
  `WatermarkKind`, `PageGeometryAnomaly`, `MetadataFinding`.
  `ForensicReport::build()` computes risk score and `format_text()` for human output.
  40+ unit tests.

- `pdfplumber::Pdf::inspect(&raw_bytes)`: wires ForensicReport into the public API.
  Collects page rotations + dims from cached lopdf data, calls signatures(),
  extracts %PDF-X.Y version from header bytes.

- `pdfplumber-cli inspect`: new subcommand — text + JSON output, non-zero exit code
  when risk_score > 0 (CI-friendly).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: jacob_cotten <jacobcotten@gmail.com>
- Add doc comments to RepeatedTextBlock variant fields (missing_docs)
- Add parens for clippy::precedence in pdfplumber-parse (3 sites)
- Elide needless lifetime in PagesIter
- Remove phantom serde feature cfg from CLI inspect_cmd

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: jacob_cotten <jacobcotten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant