Skip to content

developer0hye/pdfplumber-rs

Repository files navigation

pdfplumber-rs

CI crates.io docs.rs MSRV License

Extract chars, words, lines, rects, and tables from PDF documents with precise coordinates.

pdfplumber-rs is a Rust port of Python's pdfplumber. It extracts structured content from PDF files with coordinate-accurate positioning, including characters, words, lines, rectangles, curves, images, and tables.

Features

  • Text extraction with spatial grouping into words, lines, and text blocks
  • Table detection using lattice (line-based), stream (text-alignment), and explicit strategies
  • Spatial filtering via crop, within_bbox, and outside_bbox
  • CJK support including CID fonts, Identity-H/V CMaps, and CJK-aware word grouping
  • Page-level streaming for memory-efficient processing of large documents
  • WASM support via wasm32-unknown-unknown target
  • Optional serde serialization for all data types
  • Optional parallel processing via rayon

Installation

Add to your Cargo.toml:

[dependencies]
pdfplumber = "0.1"

Feature Flags

Feature Default Description
std Yes Enables file-path APIs (Pdf::open_file). Disable for WASM.
serde No Adds Serialize/Deserialize to all public data types.
parallel No Enables Pdf::pages_parallel() via rayon. Not WASM-compatible.

Quick Start

Extract Text

use pdfplumber::{Pdf, TextOptions};

fn main() {
    let pdf = Pdf::open_file("document.pdf", None).unwrap();
    for page_result in pdf.pages_iter() {
        let page = page_result.unwrap();
        let text = page.extract_text(&TextOptions::default());
        println!("Page {}: {}", page.page_number(), text);
    }
}

Extract Tables

use pdfplumber::{Pdf, TableSettings};

fn main() {
    let pdf = Pdf::open_file("document.pdf", None).unwrap();
    let page = pdf.page(0).unwrap();
    let tables = page.find_tables(&TableSettings::default());
    for table in &tables {
        for row in &table.rows {
            let cells: Vec<&str> = row.iter()
                .map(|c| c.text.as_deref().unwrap_or(""))
                .collect();
            println!("{:?}", cells);
        }
    }
}

Extract Characters

use pdfplumber::Pdf;

fn main() {
    let pdf = Pdf::open_file("document.pdf", None).unwrap();
    let page = pdf.page(0).unwrap();
    for ch in page.chars() {
        println!(
            "'{}' at ({:.1}, {:.1}) font={} size={:.1}",
            ch.text, ch.bbox.x0, ch.bbox.top, ch.fontname, ch.size
        );
    }
}

WASM Support

For wasm32-unknown-unknown targets, disable the default std feature:

[dependencies]
pdfplumber = { version = "0.1", default-features = false }

Use the bytes-based API:

let pdf = Pdf::open(pdf_bytes, None)?;
let page = pdf.page(0)?;
let text = page.extract_text(&TextOptions::default());

Architecture

+--------------------------------------------------------------+
|  Layer 5: Table Detection (Lattice / Stream / Explicit)      |
+--------------------------------------------------------------+
|  Layer 4: Text Grouping & Reading Order                      |
|  Characters -> Words -> Lines -> TextBlocks                  |
+--------------------------------------------------------------+
|  Layer 3: Object Extraction                                  |
|  Chars (bbox/font/size/color), Paths (lines/rects/curves)    |
+--------------------------------------------------------------+
|  Layer 2: Content Stream Interpreter                         |
|  Text state, Graphics state, CTM, XObject Do                 |
+--------------------------------------------------------------+
|  Layer 1: PDF Parsing (pluggable backend via PdfBackend)     |
|  lopdf (default)                                             |
+--------------------------------------------------------------+

The library is split into three crates:

Crate Description
pdfplumber-core Backend-independent data types and algorithms
pdfplumber-parse PDF parsing and content stream interpretation
pdfplumber Public API facade (this is what you depend on)

Minimum Supported Rust Version

Rust 1.85 or later.

License

Licensed under either of:

at your option.

About

pdfplumber-rs is a Rust port of Python's pdfplumber. It extracts structured content from PDF files with coordinate-accurate positioning, including characters, words, lines, rectangles, curves, images, and tables.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages