Feature Request: Integrate Rust-native PaddleOCR without Python dependencies

# Technical Feasibility Study: Integrating Python-Free Rust-Native PaddleOCR into Kreuzberg Framework

## 1. Overview and Background Analysis

### 1.1 Research Background and Objectives

Kreuzberg, as an emerging high-performance document intelligence processing framework, derives its core value proposition from leveraging Rust's memory safety, zero-cost abstractions, and exceptional concurrency capabilities to address performance bottlenecks and deployment complexity issues inherent in traditional document processing (especially Python-based ecosystems).

In document intelligence processing pipelines, Optical Character Recognition (OCR) is a critical component that directly determines whether unstructured data (such as scanned documents and images) can be transformed into searchable, analyzable structured information.

Currently, while Kreuzberg has implemented its core logic in Rust, it still partially relies on external ecosystems for OCR capabilities. Traditional integration approaches often invoke Tesseract (C++ library) via FFI or call EasyOCR/PaddleOCR through Python bindings. This "hybrid architecture," while leveraging mature existing models, introduces significant engineering pain points:

**Deployment Complexity (Dependency Hell):** Production environments must maintain Python runtime, manage pip dependencies, handle virtual environments, and ensure compatibility across different versions of deep learning frameworks (PyTorch/PaddlePaddle). This contradicts Rust's minimalist deployment philosophy of generating single static binaries.

**Performance Bottlenecks:** Python's Global Interpreter Lock (GIL) limits multi-threaded concurrency, and cross-language data transfer often involves memory copying, increasing latency.

**Resource Overhead:** Loading the complete Python interpreter and deep learning frameworks introduces substantial memory baseline (typically 500MB+), making it unsuitable for edge devices or high-density container deployments.

**This report aims to thoroughly explore the feasibility of an alternative approach: integrating Baidu's PaddleOCR into the Kreuzberg framework using pure Rust or Rust-native bindings without introducing any Python runtime.** The goal is to achieve recognition accuracy comparable to the original Python implementation while significantly reducing resource consumption and simplifying deployment.

### 1.2 PaddleOCR's Technical Standing

PaddleOCR (PP-OCR series) has become an industry de facto standard due to its excellent performance in Chinese and multilingual recognition, lightweight model design, and robustness for complex layouts (tables, distorted text). PP-OCRv4 and v5 versions, in particular, introduced more efficient backbone networks and data augmentation strategies, delivering excellent performance on both server and mobile platforms.

Therefore, whether these high-quality pretrained models can be reused in Rust is key to enhancing Kreuzberg's competitiveness.

---

## 2. Core Architecture Analysis and Integration Strategy

### 2.1 Kreuzberg v4's Plugin Architecture

Kreuzberg v4's architecture is deeply influenced by Rust language features, emphasizing modularity and extensibility. Its core defines interface specifications for various components through the trait system, allowing developers to inject custom implementations as plugins.

For OCR functionality, Kreuzberg defines the `OcrBackend` trait. This is an async trait designed to decouple specific OCR engine implementations. This means the core framework doesn't care whether the underlying layer calls Tesseract's C API, makes network requests to cloud APIs, or runs ONNX inference locally.

This loose coupling provides a perfect entry point for integrating Rust-native PaddleOCR.

By implementing the `OcrBackend` trait, we can build an inference path that completely bypasses the Python layer. In this architecture, Kreuzberg's main process directly manages image data in memory and passes it to the Rust-implemented OCR module. This module handles image preprocessing, model inference (via Rust-bound inference engines), and post-processing (decoding), ultimately returning structured text results.

### 2.2 Interface Definition and Data Flow

For seamless integration, we must strictly adhere to the `OcrBackend` interface contract. Based on Rust async programming best practices and Kreuzberg documentation, this interface typically contains the following key methods:

```rust
#[async_trait]
pub trait OcrBackend: Send + Sync {
    /// Initialize backend, typically loading models into memory here
    async fn init(config: &OcrConfig) -> Result<Self>;

    /// Execute OCR recognition
    /// input: Contains image byte stream or decoded pixel data
    /// options: Configuration overrides for single requests
    async fn scan(
        &self, 
        input: &OcrInput, 
        options: Option<&ScanOptions>
    ) -> Result<OcrOutput>;
}
```

When implementing this interface, special attention must be paid to zero-copy data flow. Kreuzberg internally likely uses the `image` crate's `DynamicImage` structure or raw `&[u8]` byte slices for image data transfer. Our Rust PaddleOCR implementation must be able to directly consume this memory data, rather than requiring file paths like some Python scripts, thereby avoiding unnecessary disk I/O overhead.

---

## 3. Evaluation of PaddleOCR Implementation Approaches in the Rust Ecosystem

To achieve "no Python dependency," a more pragmatic and efficient approach is to leverage Rust's powerful FFI capabilities to bind high-performance C++ inference engines (such as ONNX Runtime or MNN) while rewriting all preprocessing and post-processing logic in pure Rust.

### 3.1 Option 1: MNN-based rusto-rs

[rusto-rs](https://github.com/aspect-build/rusto-rs) is currently the most complete PaddleOCR replica in the open-source Rust community.

- **Tech Stack:** Uses MNN (Alibaba's lightweight inference engine) via Rust FFI.
- **Image Processing:** Critically, rusto-rs completely removes OpenCV dependency. Developers reimplemented all PaddleOCR image preprocessing algorithms using pure Rust `image` and `imageproc` crates, including complex text box contour detection and perspective transformation.
- **Model Support:** Explicitly supports PP-OCRv4 and PP-OCRv5 models, with toolchains for converting Paddle models to MNN format.
- **Integration Advantage:** Since it already implements the complete pipeline from `image::DynamicImage` to inference results, integration into Kreuzberg requires only a thin wrapper layer.

### 3.2 Option 2: ONNX Runtime-based oar-ocr / paddle-ocr-rs

This approach leverages the industry-standard model exchange format ONNX.

- **Tech Stack:** Uses `ort` crate (Rust bindings for Microsoft ONNX Runtime) as the inference backend.
- **Ecosystem Advantage:** PaddlePaddle officially provides comprehensive tools (paddle2onnx) for exporting models to ONNX format. ONNX Runtime has extensive hardware support, easily utilizing NVIDIA GPU (CUDA/TensorRT), Apple CoreML, and even AVX512 instruction set acceleration on CPU.
- **Implementation Details:** Projects like `oar-ocr` have encapsulated DBNet (detection) and CRNN (recognition) post-processing logic. Compared to MNN, ONNX Runtime has a more mature server-side ecosystem, and the `ort` crate is actively maintained.

### 3.3 Option 3: Pure Rust Inference Engine ocrs

[ocrs](https://github.com/robertknight/ocrs) represents the Rust community's exploration toward "pure Rust."

- **Features:** Uses the RTen engine, a completely Rust-written inference runtime requiring no C++ library linking.
- **Limitations:** While most aligned with "pure Rust" philosophy, it currently mainly supports specific PyTorch-exported models and has incomplete support for PaddleOCR-specific operators.

### 3.4 Selection Conclusion

Considering engineering feasibility, maintenance costs, and performance, **Option 2 (ONNX Runtime-based) is currently the best choice**, with Option 1 (rusto-rs) as a close second.

While both options rely on C++-written inference engines (ORT or MNN) at the bottom, they expose pure Rust interfaces, and compiled artifacts don't depend on Python environments in the system. This fully satisfies the core requirement of "no Python dependency."

---

## 4. Deep Technical Implementation Path

### 4.1 Precise Replication of Preprocessing Pipeline

OCR accuracy is extremely sensitive to image preprocessing. PaddleOCR (Python version) heavily uses OpenCV functionality. In Rust, we must replicate this logic using the `image` crate and ensure pixel-level alignment; otherwise, inference accuracy will degrade.

**Resizing:**
- **Detection Stage:** Must implement `ResizeShort`-like logic, adjusting the image's shortest edge to multiples of 32 while maintaining aspect ratio. Rust's `image::resize` provides multiple interpolation algorithms (Nearest, Triangle, CatmullRom, Gaussian, Lanczos3). The algorithm closest to OpenCV's `INTER_LINEAR` (typically `FilterType::Triangle`) must be selected to ensure input tensor consistency.
- **Recognition Stage:** Text box slices must be scaled to fixed height (typically 48px), with width scaled proportionally but aligned.

**Normalization:**
Image data must undergo `(pixel - mean) / std` operations. This can be vectorized using Rust's `ndarray` library or even SIMD instruction sets, far more efficient than Python's NumPy operations.

**Geometric Transformation:**
For detected tilted text boxes, perspective transformation is needed to "straighten" them. The rusto-rs project contains a pure Rust implementation of `get_rotate_crop_image` function, a core code snippet most worth reusing during integration.

### 4.2 Inference Engine Lifecycle and Concurrency Management

Kreuzberg, as a high-performance framework, typically runs in an async environment (Tokio Runtime).

**Model Persistence:** Loading ONNX models (det.onnx, rec.onnx) is time-consuming. Loading must complete during `OcrBackend` initialization, with Session objects persistently stored in structs. Since Session is thread-safe (in `ort`), we can wrap it in `Arc` (atomic reference counting) to share a single model instance across multiple concurrent OCR tasks, greatly saving memory.

**Async Computation Isolation:** Although `ort` supports multi-threaded inference, inference itself is CPU-intensive. Directly calling inference functions in Tokio async tasks blocks the event loop, slowing server responses. Therefore, `tokio::task::spawn_blocking` must be used to dispatch inference tasks to dedicated blocking thread pools.

### 4.3 Post-processing and Decoding Algorithms

Model outputs are merely tensors requiring complex post-processing for text conversion.

**DBNet Post-processing:** Detection model output is a probability map. Rust-implemented bitmap generation and polygon expansion algorithms (Unclip) are needed to extract final text box coordinates.

**CTC Decoding:** Recognition model output is a character index sequence. CTC (Connectionist Temporal Classification) decoding logic must be implemented:
- Remove consecutive duplicate characters.
- Remove blank tokens.
- Map indices back to UTF-8 characters using dictionary files (e.g., `ppocr_keys_v1.txt`).

This step is extremely efficient in Rust using `HashMap` or `Vec` lookups.

---

## 5. Performance and Resource Efficiency Analysis

Migration to Rust-native implementation yields multi-dimensional performance benefits.

### 5.1 Memory and Startup Overhead

**Memory Consumption:** Python solutions require loading the Python interpreter, NumPy, PaddlePaddle/PyTorch frameworks, and dependencies, with memory baseline typically 500MB to 1GB. In contrast, `ort` or MNN-based Rust implementations only need lightweight inference libraries and model weight files. Testing shows rusto-rs runtime peak memory can be controlled around 200MB—a huge advantage for resource-constrained container environments.

**Cold Start Speed:** Removing Python interpreter initialization makes Rust binary startup nearly instantaneous, crucial for Serverless deployment scenarios.

### 5.2 Computational Throughput

**Eliminating GIL Lock:** In high-concurrency scenarios, Python's GIL limits CPU-intensive task (like image preprocessing) parallel efficiency. Rust solutions can easily implement data-parallel image preprocessing using the Rayon library, fully utilizing multi-core CPU performance.

**Zero-copy Data:** Within Kreuzberg, image data flows from file parsers to OCR engines involving only pointer or reference passing in Rust. In Python hybrid architectures, serialization/deserialization between Rust heap memory and Python heap memory (or Buffer Protocol copying) often occurs, producing significant latency in bulk image processing.

---

## 6. Implementation Roadmap and Recommendations

Based on the above technical analysis, the following steps are recommended:

### 1. Proof of Concept (PoC)
- Create an independent Rust Crate, incorporating `oar-ocr` or `rusto-rs` as dependencies.
- Download PP-OCRv4 ONNX model files.
- Write test cases with standard test images, verifying Rust output text matches Python PaddleOCR, focusing on Chinese recognition accuracy.

### 2. Build Adapter
- Create new `kreuzberg-paddle` plugin module in Kreuzberg repository.
- Implement `OcrBackend` trait.
- In `scan` method, convert Kreuzberg image data to `ndarray` (for ONNX) or MNN Tensor.
- Handle coordinate system mapping to ensure OCR-returned Bounding Boxes correctly overlay original PDFs or images.

### 3. Resolve Build and Distribution Issues
- **Static Linking:** By default, `ort` dynamically links C++ libraries. For simplified distribution, explore `ort`'s static linking features or pre-configure `libonnxruntime.so` in Dockerfiles.
- **Model Management:** Following `ocrs` design, check local cache directories for model files during `init`, auto-downloading from HuggingFace or other CDNs if absent.

### 4. Performance Tuning
- Use `criterion` crate for benchmark testing preprocessing, inference, and post-processing stages separately, identifying performance hotspots for optimization.

---

## Conclusion

**Completely feasible with significant benefits.**

Integrating pure Rust PaddleOCR into Kreuzberg is not only technically feasible but a critical step toward the framework's "high-performance, easy-deployment" vision. By leveraging mature Rust ecosystem libraries like `ort` or `rusto-rs`, we can successfully remove the heavyweight Python runtime while retaining PaddleOCR's powerful multilingual recognition capabilities. This resolves the dependency hell plaguing developers and establishes a solid foundation for building high-throughput, low-latency document intelligence processing pipelines.

---

## Related Resources

- [rusto-rs](https://github.com/aspect-build/rusto-rs) - MNN-based PaddleOCR in Rust
- [ocrs](https://github.com/robertknight/ocrs) - Pure Rust OCR engine
- [ort](https://github.com/pykeio/ort) - ONNX Runtime bindings for Rust
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - Original Python implementation
- [paddle2onnx](https://github.com/PaddlePaddle/Paddle2ONNX) - Model conversion tool

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Integrate Rust-native PaddleOCR without Python dependencies #302

Technical Feasibility Study: Integrating Python-Free Rust-Native PaddleOCR into Kreuzberg Framework

1. Overview and Background Analysis

1.1 Research Background and Objectives

1.2 PaddleOCR's Technical Standing

2. Core Architecture Analysis and Integration Strategy

2.1 Kreuzberg v4's Plugin Architecture

2.2 Interface Definition and Data Flow

3. Evaluation of PaddleOCR Implementation Approaches in the Rust Ecosystem

3.1 Option 1: MNN-based rusto-rs

3.2 Option 2: ONNX Runtime-based oar-ocr / paddle-ocr-rs

3.3 Option 3: Pure Rust Inference Engine ocrs

3.4 Selection Conclusion

4. Deep Technical Implementation Path

4.1 Precise Replication of Preprocessing Pipeline

4.2 Inference Engine Lifecycle and Concurrency Management

4.3 Post-processing and Decoding Algorithms

5. Performance and Resource Efficiency Analysis

5.1 Memory and Startup Overhead

5.2 Computational Throughput

6. Implementation Roadmap and Recommendations

1. Proof of Concept (PoC)

2. Build Adapter

3. Resolve Build and Distribution Issues

4. Performance Tuning

Conclusion

Related Resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Integrate Rust-native PaddleOCR without Python dependencies #302

Description

Technical Feasibility Study: Integrating Python-Free Rust-Native PaddleOCR into Kreuzberg Framework

1. Overview and Background Analysis

1.1 Research Background and Objectives

1.2 PaddleOCR's Technical Standing

2. Core Architecture Analysis and Integration Strategy

2.1 Kreuzberg v4's Plugin Architecture

2.2 Interface Definition and Data Flow

3. Evaluation of PaddleOCR Implementation Approaches in the Rust Ecosystem

3.1 Option 1: MNN-based rusto-rs

3.2 Option 2: ONNX Runtime-based oar-ocr / paddle-ocr-rs

3.3 Option 3: Pure Rust Inference Engine ocrs

3.4 Selection Conclusion

4. Deep Technical Implementation Path

4.1 Precise Replication of Preprocessing Pipeline

4.2 Inference Engine Lifecycle and Concurrency Management

4.3 Post-processing and Decoding Algorithms

5. Performance and Resource Efficiency Analysis

5.1 Memory and Startup Overhead

5.2 Computational Throughput

6. Implementation Roadmap and Recommendations

1. Proof of Concept (PoC)

2. Build Adapter

3. Resolve Build and Distribution Issues

4. Performance Tuning

Conclusion

Related Resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions