Skip to content

Fix tagged TrueType extraction gap for hello_structure.pdf #220

@developer0hye

Description

@developer0hye

Summary

hello_structure.pdf is still a severe gap and remains ignored.

Evidence

Run:
cargo test -p pdfplumber --test cross_validation -- --include-ignored --nocapture

Current result:

  • hello_structure.pdf: chars 38.9%, words 44.4%

Ignored reason in test file already points to tagged PDF + TrueType handling gap.

Scope

  • Investigate tagged PDF extraction path for this fixture
  • Close TrueType/encoding mapping gap causing low char/word recovery

Acceptance Criteria

  • Fixture reaches >=95% chars and >=95% words
  • Convert from cross_validate_ignored! to asserting cross_validate!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions