Skip to content

Native Rust port of IBM's Docling document processing library. Convert PDF, DOCX, XLSX, PPTX, HTML, Markdown, and CSV to structured data for RAG applications.

Notifications You must be signed in to change notification settings

carles-abarca/docling-rs

Repository files navigation

docling-rs

A native Rust implementation inspired by IBM's Docling Python library.

docling-rs brings document processing capabilities to the Rust ecosystem, offering a high-performance alternative for converting documents into structured, machine-readable formats optimized for RAG (Retrieval-Augmented Generation) and LLM applications.

Why docling-rs?

The original Docling by IBM is an excellent Python library for document processing. This Rust adaptation provides:

  • Native Performance: No Python runtime required, significantly faster processing
  • Single Binary Distribution: Easy deployment with self-contained executables
  • Memory Safety: Rust's guarantees for reliable production use
  • Cross-Platform: Pre-built binaries for Windows, macOS (Intel & Apple Silicon), and Linux
  • Batteries Included: PDF support with bundled PDFium library

Features

  • Multi-format Support: Markdown, HTML, CSV, DOCX, XLSX, PPTX, and PDF
  • Unified Document Model: All formats convert to a common DoclingDocument structure
  • Intelligent Chunking: Hierarchical and hybrid chunking strategies for RAG applications
  • Pure Rust: No Python dependencies, native performance
  • Cross-platform: Windows, macOS (Intel & Apple Silicon), and Linux
  • Modular Architecture: Workspace-based design with separate crates per format
  • CLI Included: Full-featured command-line interface for batch processing
  • Batteries Included: PDF support with bundled PDFium binaries

Status

v1.0.3 - Production-ready with 7 format backends (all enabled by default)

Component Status
Core Library ✅ Complete
CLI ✅ Complete
Markdown Backend ✅ Complete
HTML Backend ✅ Complete
CSV Backend ✅ Complete
DOCX Backend ✅ Complete
XLSX Backend ✅ Complete
PPTX Backend ✅ Complete
PDF Backend ✅ Complete
Chunking ✅ Complete
Documentation ✅ Complete

Installation

Pre-built Binaries

Download from Releases:

  • Windows: docling-rs-x86_64-windows.msi or .zip
  • macOS Intel: docling-rs-x86_64-macos.dmg
  • macOS Apple Silicon: docling-rs-aarch64-macos.dmg
  • Linux: docling-rs-x86_64-linux.tar.gz

Rust Library

Add to your Cargo.toml:

[dependencies]
docling-rs = "1.0"

# Or with minimal features (no PDF/Office)
docling-rs = { version = "1.0", default-features = false, features = ["markdown", "html", "csv"] }

Feature Flags

Feature Description Default
full All format backends Yes
markdown Markdown support Yes
html HTML support Yes
csv CSV support Yes
docx Microsoft Word support Yes
xlsx Microsoft Excel support Yes
pptx Microsoft PowerPoint support Yes
pdf PDF support (requires PDFium) Yes

Quick Start

CLI Usage

# Convert a single file
docling-rs document.pdf --to markdown --output-dir ./output

# Batch convert a directory
docling-rs ./documents/ --to json --output-dir ./converted

# Enable chunking for RAG (uses embedded all-MiniLM-L6-v2 tokenizer)
docling-rs document.pdf --chunk --to json

# Filter by input format
docling-rs ./docs/ --from pdf,docx --to markdown

Library Usage

use docling_rs::DocumentConverter;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let converter = DocumentConverter::new();

    // Convert a file (format auto-detected)
    let result = converter.convert_file("document.pdf")?;
    let doc = result.document();

    // Export to different formats
    let markdown = doc.to_markdown();
    let text = doc.to_text();
    let json = serde_json::to_string_pretty(&doc)?;

    println!("Document: {}", doc.name());
    println!("Nodes: {}", doc.nodes().len());

    Ok(())
}

Converting from Bytes

use docling_rs::{DocumentConverter, InputFormat};

let converter = DocumentConverter::new();
let result = converter.convert_bytes(
    pdf_bytes,
    "document.pdf".to_string(),
    InputFormat::PDF,
)?;

Supported Formats

Format Extensions Description
Markdown .md, .markdown CommonMark and GFM
HTML .html, .htm Semantic HTML extraction
CSV .csv Tabular data
Word .docx, .dotx, .docm Microsoft Word
Excel .xlsx, .xlsm, .xls Microsoft Excel
PowerPoint .pptx, .potx, .ppsx Microsoft PowerPoint
PDF .pdf PDF documents

Document Chunking

Intelligent chunking for RAG and embedding applications with embedded tokenizer (sentence-transformers/all-MiniLM-L6-v2):

use docling_rs::{DocumentConverter, chunking::{HybridChunker, HuggingFaceTokenizer}};

let converter = DocumentConverter::new();
let result = converter.convert_file("document.pdf")?;
let doc = result.document();

// Hybrid chunker with embedded tokenizer (recommended for RAG)
let tokenizer = HuggingFaceTokenizer::default_embedded()?;
let chunker = HybridChunker::builder()
    .tokenizer(Box::new(tokenizer))
    .max_tokens(128)  // Default: 128, optimized for embeddings
    .merge_peers(true)
    .build()?;

// Generate chunks
for chunk in chunker.chunk(&doc) {
    println!("Chunk: {} chars", chunk.text.len());
}

Chunking Features (v1.0.3)

  • Embedded Tokenizer: all-MiniLM-L6-v2 tokenizer bundled in the binary
  • Hybrid Strategy Default: Token-aware chunking optimized for RAG
  • Table Chunking: CSV/XLSX tables are chunked row-by-row in key=value format
  • Smart Merging: Undersized chunks are merged while preserving semantic boundaries

CLI Options

Usage: docling-rs [OPTIONS] <INPUT>

Arguments:
  <INPUT>  Input file or directory

Options:
  -t, --to <FORMAT>              Output format: json, markdown, text [default: markdown]
  -o, --output-dir <DIR>         Output directory
  -f, --from <FORMATS>           Filter input formats (comma-separated)
      --chunk                    Enable document chunking
      --chunk-strategy <STRAT>   Chunking strategy: hierarchical, hybrid [default: hybrid]
      --chunk-max-tokens <N>     Max tokens per chunk [default: 128]
      --chunk-merge-peers        Merge undersized peer chunks [default: true]
      --tokenizer <MODEL>        HuggingFace tokenizer model [default: embedded all-MiniLM-L6-v2]
      --continue-on-error        Continue on errors (batch mode)
      --abort-on-error           Stop on first error (batch mode)
  -v, --verbose                  Verbose output
  -q, --quiet                    Suppress output
  -h, --help                     Print help
  -V, --version                  Print version

Chunking Strategies

# Hybrid chunking (default) - token-aware with embedded tokenizer, ideal for RAG
docling-rs document.pdf --chunk --to json

# Custom max tokens
docling-rs document.pdf --chunk --chunk-max-tokens 256 --to json

# Hierarchical chunking - preserves document structure
docling-rs document.pdf --chunk --chunk-strategy hierarchical --to json

# Disable chunk merging for more granular output
docling-rs document.pdf --chunk --chunk-merge-peers false --to json

Architecture

docling-rs uses a modular workspace structure:

crates/
├── docling-rs/              # Main facade library
├── docling-rs-core/         # Core types and traits
├── docling-rs-cli/          # Command-line interface
└── docling-rs-formats/      # Format backends
    ├── markdown/
    ├── html/
    ├── csv/
    ├── docx/
    ├── xlsx/
    ├── pptx/
    └── pdf/

Documentation

Full documentation is available in the manual/ directory:

Development

Prerequisites

  • Rust 1.75 or later
  • PDFium library (for PDF support)

Building

# Build all crates
cargo build --workspace

# Build CLI only
cargo build -p docling-rs-cli --release

# Build with all features
cargo build -p docling-rs --features full

Testing

# Run all tests
cargo test --workspace

# Run specific crate tests
cargo test -p docling-rs
cargo test -p docling-rs-cli

# Manual CLI testing
./scripts/test-cli-manual.sh

Linting

cargo clippy --workspace
cargo fmt --check

Acknowledgments

This project is inspired by and pays tribute to IBM's Docling project. While docling-rs is an independent Rust implementation and not affiliated with IBM, it aims to provide similar document processing capabilities for the Rust ecosystem.

License

MIT

Contributing

Contributions are welcome! See CLAUDE.md for development guidelines.

About

Native Rust port of IBM's Docling document processing library. Convert PDF, DOCX, XLSX, PPTX, HTML, Markdown, and CSV to structured data for RAG applications.

Topics

Resources

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •