docling-rs

A native Rust implementation inspired by IBM's Docling Python library.

docling-rs brings document processing capabilities to the Rust ecosystem, offering a high-performance alternative for converting documents into structured, machine-readable formats optimized for RAG (Retrieval-Augmented Generation) and LLM applications.

Why docling-rs?

The original Docling by IBM is an excellent Python library for document processing. This Rust adaptation provides:

Native Performance: No Python runtime required, significantly faster processing
Single Binary Distribution: Easy deployment with self-contained executables
Memory Safety: Rust's guarantees for reliable production use
Cross-Platform: Pre-built binaries for Windows, macOS (Intel & Apple Silicon), and Linux
Batteries Included: PDF support with bundled PDFium library

Features

Multi-format Support: Markdown, HTML, CSV, DOCX, XLSX, PPTX, and PDF
Unified Document Model: All formats convert to a common DoclingDocument structure
Intelligent Chunking: Hierarchical and hybrid chunking strategies for RAG applications
Pure Rust: No Python dependencies, native performance
Cross-platform: Windows, macOS (Intel & Apple Silicon), and Linux
Modular Architecture: Workspace-based design with separate crates per format
CLI Included: Full-featured command-line interface for batch processing
Batteries Included: PDF support with bundled PDFium binaries

Status

v1.0.3 - Production-ready with 7 format backends (all enabled by default)

Component	Status
Core Library	✅ Complete
CLI	✅ Complete
Markdown Backend	✅ Complete
HTML Backend	✅ Complete
CSV Backend	✅ Complete
DOCX Backend	✅ Complete
XLSX Backend	✅ Complete
PPTX Backend	✅ Complete
PDF Backend	✅ Complete
Chunking	✅ Complete
Documentation	✅ Complete

Installation

Pre-built Binaries

Download from Releases:

Windows: docling-rs-x86_64-windows.msi or .zip
macOS Intel: docling-rs-x86_64-macos.dmg
macOS Apple Silicon: docling-rs-aarch64-macos.dmg
Linux: docling-rs-x86_64-linux.tar.gz

Rust Library

Add to your Cargo.toml:

[dependencies]
docling-rs = "1.0"

# Or with minimal features (no PDF/Office)
docling-rs = { version = "1.0", default-features = false, features = ["markdown", "html", "csv"] }

Feature Flags

Feature	Description	Default
`full`	All format backends	Yes
`markdown`	Markdown support	Yes
`html`	HTML support	Yes
`csv`	CSV support	Yes
`docx`	Microsoft Word support	Yes
`xlsx`	Microsoft Excel support	Yes
`pptx`	Microsoft PowerPoint support	Yes
`pdf`	PDF support (requires PDFium)	Yes

Quick Start

CLI Usage

# Convert a single file
docling-rs document.pdf --to markdown --output-dir ./output

# Batch convert a directory
docling-rs ./documents/ --to json --output-dir ./converted

# Enable chunking for RAG (uses embedded all-MiniLM-L6-v2 tokenizer)
docling-rs document.pdf --chunk --to json

# Filter by input format
docling-rs ./docs/ --from pdf,docx --to markdown

Library Usage

use docling_rs::DocumentConverter;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let converter = DocumentConverter::new();

    // Convert a file (format auto-detected)
    let result = converter.convert_file("document.pdf")?;
    let doc = result.document();

    // Export to different formats
    let markdown = doc.to_markdown();
    let text = doc.to_text();
    let json = serde_json::to_string_pretty(&doc)?;

    println!("Document: {}", doc.name());
    println!("Nodes: {}", doc.nodes().len());

    Ok(())
}

Converting from Bytes

use docling_rs::{DocumentConverter, InputFormat};

let converter = DocumentConverter::new();
let result = converter.convert_bytes(
    pdf_bytes,
    "document.pdf".to_string(),
    InputFormat::PDF,
)?;

Supported Formats

Format	Extensions	Description
Markdown	`.md`, `.markdown`	CommonMark and GFM
HTML	`.html`, `.htm`	Semantic HTML extraction
CSV	`.csv`	Tabular data
Word	`.docx`, `.dotx`, `.docm`	Microsoft Word
Excel	`.xlsx`, `.xlsm`, `.xls`	Microsoft Excel
PowerPoint	`.pptx`, `.potx`, `.ppsx`	Microsoft PowerPoint
PDF	`.pdf`	PDF documents

Document Chunking

Intelligent chunking for RAG and embedding applications with embedded tokenizer (sentence-transformers/all-MiniLM-L6-v2):

use docling_rs::{DocumentConverter, chunking::{HybridChunker, HuggingFaceTokenizer}};

let converter = DocumentConverter::new();
let result = converter.convert_file("document.pdf")?;
let doc = result.document();

// Hybrid chunker with embedded tokenizer (recommended for RAG)
let tokenizer = HuggingFaceTokenizer::default_embedded()?;
let chunker = HybridChunker::builder()
    .tokenizer(Box::new(tokenizer))
    .max_tokens(128)  // Default: 128, optimized for embeddings
    .merge_peers(true)
    .build()?;

// Generate chunks
for chunk in chunker.chunk(&doc) {
    println!("Chunk: {} chars", chunk.text.len());
}

Chunking Features (v1.0.3)

Embedded Tokenizer: all-MiniLM-L6-v2 tokenizer bundled in the binary
Hybrid Strategy Default: Token-aware chunking optimized for RAG
Table Chunking: CSV/XLSX tables are chunked row-by-row in key=value format
Smart Merging: Undersized chunks are merged while preserving semantic boundaries

CLI Options

Usage: docling-rs [OPTIONS] <INPUT>

Arguments:
  <INPUT>  Input file or directory

Options:
  -t, --to <FORMAT>              Output format: json, markdown, text [default: markdown]
  -o, --output-dir <DIR>         Output directory
  -f, --from <FORMATS>           Filter input formats (comma-separated)
      --chunk                    Enable document chunking
      --chunk-strategy <STRAT>   Chunking strategy: hierarchical, hybrid [default: hybrid]
      --chunk-max-tokens <N>     Max tokens per chunk [default: 128]
      --chunk-merge-peers        Merge undersized peer chunks [default: true]
      --tokenizer <MODEL>        HuggingFace tokenizer model [default: embedded all-MiniLM-L6-v2]
      --continue-on-error        Continue on errors (batch mode)
      --abort-on-error           Stop on first error (batch mode)
  -v, --verbose                  Verbose output
  -q, --quiet                    Suppress output
  -h, --help                     Print help
  -V, --version                  Print version

Chunking Strategies

# Hybrid chunking (default) - token-aware with embedded tokenizer, ideal for RAG
docling-rs document.pdf --chunk --to json

# Custom max tokens
docling-rs document.pdf --chunk --chunk-max-tokens 256 --to json

# Hierarchical chunking - preserves document structure
docling-rs document.pdf --chunk --chunk-strategy hierarchical --to json

# Disable chunk merging for more granular output
docling-rs document.pdf --chunk --chunk-merge-peers false --to json

Architecture

docling-rs uses a modular workspace structure:

crates/
├── docling-rs/              # Main facade library
├── docling-rs-core/         # Core types and traits
├── docling-rs-cli/          # Command-line interface
└── docling-rs-formats/      # Format backends
    ├── markdown/
    ├── html/
    ├── csv/
    ├── docx/
    ├── xlsx/
    ├── pptx/
    └── pdf/

Documentation

Full documentation is available in the manual/ directory:

Overview - Introduction and features
Installation - Setup guide
Quick Start - Get started in 5 minutes
API Reference - Library API documentation
CLI Usage - Command-line interface guide
Supported Formats - Format details
Chunking for RAG - Chunking strategies
Architecture - System design

Development

Prerequisites

Rust 1.75 or later
PDFium library (for PDF support)

Building

# Build all crates
cargo build --workspace

# Build CLI only
cargo build -p docling-rs-cli --release

# Build with all features
cargo build -p docling-rs --features full

Testing

# Run all tests
cargo test --workspace

# Run specific crate tests
cargo test -p docling-rs
cargo test -p docling-rs-cli

# Manual CLI testing
./scripts/test-cli-manual.sh

Linting

cargo clippy --workspace
cargo fmt --check

Acknowledgments

This project is inspired by and pays tribute to IBM's Docling project. While docling-rs is an independent Rust implementation and not affiliated with IBM, it aims to provide similar document processing capabilities for the Rust ecosystem.

License

MIT

Contributing

Contributions are welcome! See CLAUDE.md for development guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.claude/commands		.claude/commands
.github/workflows		.github/workflows
.specify		.specify
crates		crates
docs		docs
examples		examples
installer/windows		installer/windows
manual		manual
pdfium		pdfium
scripts		scripts
specs		specs
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
README.md		README.md
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docling-rs

Why docling-rs?

Features

Status

Installation

Pre-built Binaries

Rust Library

Feature Flags

Quick Start

CLI Usage

Library Usage

Converting from Bytes

Supported Formats

Document Chunking

Chunking Features (v1.0.3)

CLI Options

Chunking Strategies

Architecture

Documentation

Development

Prerequisites

Building

Testing

Linting

Acknowledgments

License

Contributing

About

Uh oh!

Releases 2

Packages

Contributors 2

Uh oh!

Languages

carles-abarca/docling-rs

Folders and files

Latest commit

History

Repository files navigation

docling-rs

Why docling-rs?

Features

Status

Installation

Pre-built Binaries

Rust Library

Feature Flags

Quick Start

CLI Usage

Library Usage

Converting from Bytes

Supported Formats

Document Chunking

Chunking Features (v1.0.3)

CLI Options

Chunking Strategies

Architecture

Documentation

Development

Prerequisites

Building

Testing

Linting

Acknowledgments

License

Contributing

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Uh oh!

Languages

Packages