A native Rust implementation inspired by IBM's Docling Python library.
docling-rs brings document processing capabilities to the Rust ecosystem, offering a high-performance alternative for converting documents into structured, machine-readable formats optimized for RAG (Retrieval-Augmented Generation) and LLM applications.
The original Docling by IBM is an excellent Python library for document processing. This Rust adaptation provides:
- Native Performance: No Python runtime required, significantly faster processing
- Single Binary Distribution: Easy deployment with self-contained executables
- Memory Safety: Rust's guarantees for reliable production use
- Cross-Platform: Pre-built binaries for Windows, macOS (Intel & Apple Silicon), and Linux
- Batteries Included: PDF support with bundled PDFium library
- Multi-format Support: Markdown, HTML, CSV, DOCX, XLSX, PPTX, and PDF
- Unified Document Model: All formats convert to a common
DoclingDocumentstructure - Intelligent Chunking: Hierarchical and hybrid chunking strategies for RAG applications
- Pure Rust: No Python dependencies, native performance
- Cross-platform: Windows, macOS (Intel & Apple Silicon), and Linux
- Modular Architecture: Workspace-based design with separate crates per format
- CLI Included: Full-featured command-line interface for batch processing
- Batteries Included: PDF support with bundled PDFium binaries
v1.0.3 - Production-ready with 7 format backends (all enabled by default)
| Component | Status |
|---|---|
| Core Library | ✅ Complete |
| CLI | ✅ Complete |
| Markdown Backend | ✅ Complete |
| HTML Backend | ✅ Complete |
| CSV Backend | ✅ Complete |
| DOCX Backend | ✅ Complete |
| XLSX Backend | ✅ Complete |
| PPTX Backend | ✅ Complete |
| PDF Backend | ✅ Complete |
| Chunking | ✅ Complete |
| Documentation | ✅ Complete |
Download from Releases:
- Windows:
docling-rs-x86_64-windows.msior.zip - macOS Intel:
docling-rs-x86_64-macos.dmg - macOS Apple Silicon:
docling-rs-aarch64-macos.dmg - Linux:
docling-rs-x86_64-linux.tar.gz
Add to your Cargo.toml:
[dependencies]
docling-rs = "1.0"
# Or with minimal features (no PDF/Office)
docling-rs = { version = "1.0", default-features = false, features = ["markdown", "html", "csv"] }| Feature | Description | Default |
|---|---|---|
full |
All format backends | Yes |
markdown |
Markdown support | Yes |
html |
HTML support | Yes |
csv |
CSV support | Yes |
docx |
Microsoft Word support | Yes |
xlsx |
Microsoft Excel support | Yes |
pptx |
Microsoft PowerPoint support | Yes |
pdf |
PDF support (requires PDFium) | Yes |
# Convert a single file
docling-rs document.pdf --to markdown --output-dir ./output
# Batch convert a directory
docling-rs ./documents/ --to json --output-dir ./converted
# Enable chunking for RAG (uses embedded all-MiniLM-L6-v2 tokenizer)
docling-rs document.pdf --chunk --to json
# Filter by input format
docling-rs ./docs/ --from pdf,docx --to markdownuse docling_rs::DocumentConverter;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let converter = DocumentConverter::new();
// Convert a file (format auto-detected)
let result = converter.convert_file("document.pdf")?;
let doc = result.document();
// Export to different formats
let markdown = doc.to_markdown();
let text = doc.to_text();
let json = serde_json::to_string_pretty(&doc)?;
println!("Document: {}", doc.name());
println!("Nodes: {}", doc.nodes().len());
Ok(())
}use docling_rs::{DocumentConverter, InputFormat};
let converter = DocumentConverter::new();
let result = converter.convert_bytes(
pdf_bytes,
"document.pdf".to_string(),
InputFormat::PDF,
)?;| Format | Extensions | Description |
|---|---|---|
| Markdown | .md, .markdown |
CommonMark and GFM |
| HTML | .html, .htm |
Semantic HTML extraction |
| CSV | .csv |
Tabular data |
| Word | .docx, .dotx, .docm |
Microsoft Word |
| Excel | .xlsx, .xlsm, .xls |
Microsoft Excel |
| PowerPoint | .pptx, .potx, .ppsx |
Microsoft PowerPoint |
.pdf |
PDF documents |
Intelligent chunking for RAG and embedding applications with embedded tokenizer (sentence-transformers/all-MiniLM-L6-v2):
use docling_rs::{DocumentConverter, chunking::{HybridChunker, HuggingFaceTokenizer}};
let converter = DocumentConverter::new();
let result = converter.convert_file("document.pdf")?;
let doc = result.document();
// Hybrid chunker with embedded tokenizer (recommended for RAG)
let tokenizer = HuggingFaceTokenizer::default_embedded()?;
let chunker = HybridChunker::builder()
.tokenizer(Box::new(tokenizer))
.max_tokens(128) // Default: 128, optimized for embeddings
.merge_peers(true)
.build()?;
// Generate chunks
for chunk in chunker.chunk(&doc) {
println!("Chunk: {} chars", chunk.text.len());
}- Embedded Tokenizer:
all-MiniLM-L6-v2tokenizer bundled in the binary - Hybrid Strategy Default: Token-aware chunking optimized for RAG
- Table Chunking: CSV/XLSX tables are chunked row-by-row in
key=valueformat - Smart Merging: Undersized chunks are merged while preserving semantic boundaries
Usage: docling-rs [OPTIONS] <INPUT>
Arguments:
<INPUT> Input file or directory
Options:
-t, --to <FORMAT> Output format: json, markdown, text [default: markdown]
-o, --output-dir <DIR> Output directory
-f, --from <FORMATS> Filter input formats (comma-separated)
--chunk Enable document chunking
--chunk-strategy <STRAT> Chunking strategy: hierarchical, hybrid [default: hybrid]
--chunk-max-tokens <N> Max tokens per chunk [default: 128]
--chunk-merge-peers Merge undersized peer chunks [default: true]
--tokenizer <MODEL> HuggingFace tokenizer model [default: embedded all-MiniLM-L6-v2]
--continue-on-error Continue on errors (batch mode)
--abort-on-error Stop on first error (batch mode)
-v, --verbose Verbose output
-q, --quiet Suppress output
-h, --help Print help
-V, --version Print version
# Hybrid chunking (default) - token-aware with embedded tokenizer, ideal for RAG
docling-rs document.pdf --chunk --to json
# Custom max tokens
docling-rs document.pdf --chunk --chunk-max-tokens 256 --to json
# Hierarchical chunking - preserves document structure
docling-rs document.pdf --chunk --chunk-strategy hierarchical --to json
# Disable chunk merging for more granular output
docling-rs document.pdf --chunk --chunk-merge-peers false --to jsondocling-rs uses a modular workspace structure:
crates/
├── docling-rs/ # Main facade library
├── docling-rs-core/ # Core types and traits
├── docling-rs-cli/ # Command-line interface
└── docling-rs-formats/ # Format backends
├── markdown/
├── html/
├── csv/
├── docx/
├── xlsx/
├── pptx/
└── pdf/
Full documentation is available in the manual/ directory:
- Overview - Introduction and features
- Installation - Setup guide
- Quick Start - Get started in 5 minutes
- API Reference - Library API documentation
- CLI Usage - Command-line interface guide
- Supported Formats - Format details
- Chunking for RAG - Chunking strategies
- Architecture - System design
- Rust 1.75 or later
- PDFium library (for PDF support)
# Build all crates
cargo build --workspace
# Build CLI only
cargo build -p docling-rs-cli --release
# Build with all features
cargo build -p docling-rs --features full# Run all tests
cargo test --workspace
# Run specific crate tests
cargo test -p docling-rs
cargo test -p docling-rs-cli
# Manual CLI testing
./scripts/test-cli-manual.shcargo clippy --workspace
cargo fmt --checkThis project is inspired by and pays tribute to IBM's Docling project. While docling-rs is an independent Rust implementation and not affiliated with IBM, it aims to provide similar document processing capabilities for the Rust ecosystem.
MIT
Contributions are welcome! See CLAUDE.md for development guidelines.