Skip to content

Feature Request: Integrate Rust-native PaddleOCR without Python dependencies #302

@Haoxincode

Description

@Haoxincode

Technical Feasibility Study: Integrating Python-Free Rust-Native PaddleOCR into Kreuzberg Framework

1. Overview and Background Analysis

1.1 Research Background and Objectives

Kreuzberg, as an emerging high-performance document intelligence processing framework, derives its core value proposition from leveraging Rust's memory safety, zero-cost abstractions, and exceptional concurrency capabilities to address performance bottlenecks and deployment complexity issues inherent in traditional document processing (especially Python-based ecosystems).

In document intelligence processing pipelines, Optical Character Recognition (OCR) is a critical component that directly determines whether unstructured data (such as scanned documents and images) can be transformed into searchable, analyzable structured information.

Currently, while Kreuzberg has implemented its core logic in Rust, it still partially relies on external ecosystems for OCR capabilities. Traditional integration approaches often invoke Tesseract (C++ library) via FFI or call EasyOCR/PaddleOCR through Python bindings. This "hybrid architecture," while leveraging mature existing models, introduces significant engineering pain points:

Deployment Complexity (Dependency Hell): Production environments must maintain Python runtime, manage pip dependencies, handle virtual environments, and ensure compatibility across different versions of deep learning frameworks (PyTorch/PaddlePaddle). This contradicts Rust's minimalist deployment philosophy of generating single static binaries.

Performance Bottlenecks: Python's Global Interpreter Lock (GIL) limits multi-threaded concurrency, and cross-language data transfer often involves memory copying, increasing latency.

Resource Overhead: Loading the complete Python interpreter and deep learning frameworks introduces substantial memory baseline (typically 500MB+), making it unsuitable for edge devices or high-density container deployments.

This report aims to thoroughly explore the feasibility of an alternative approach: integrating Baidu's PaddleOCR into the Kreuzberg framework using pure Rust or Rust-native bindings without introducing any Python runtime. The goal is to achieve recognition accuracy comparable to the original Python implementation while significantly reducing resource consumption and simplifying deployment.

1.2 PaddleOCR's Technical Standing

PaddleOCR (PP-OCR series) has become an industry de facto standard due to its excellent performance in Chinese and multilingual recognition, lightweight model design, and robustness for complex layouts (tables, distorted text). PP-OCRv4 and v5 versions, in particular, introduced more efficient backbone networks and data augmentation strategies, delivering excellent performance on both server and mobile platforms.

Therefore, whether these high-quality pretrained models can be reused in Rust is key to enhancing Kreuzberg's competitiveness.


2. Core Architecture Analysis and Integration Strategy

2.1 Kreuzberg v4's Plugin Architecture

Kreuzberg v4's architecture is deeply influenced by Rust language features, emphasizing modularity and extensibility. Its core defines interface specifications for various components through the trait system, allowing developers to inject custom implementations as plugins.

For OCR functionality, Kreuzberg defines the OcrBackend trait. This is an async trait designed to decouple specific OCR engine implementations. This means the core framework doesn't care whether the underlying layer calls Tesseract's C API, makes network requests to cloud APIs, or runs ONNX inference locally.

This loose coupling provides a perfect entry point for integrating Rust-native PaddleOCR.

By implementing the OcrBackend trait, we can build an inference path that completely bypasses the Python layer. In this architecture, Kreuzberg's main process directly manages image data in memory and passes it to the Rust-implemented OCR module. This module handles image preprocessing, model inference (via Rust-bound inference engines), and post-processing (decoding), ultimately returning structured text results.

2.2 Interface Definition and Data Flow

For seamless integration, we must strictly adhere to the OcrBackend interface contract. Based on Rust async programming best practices and Kreuzberg documentation, this interface typically contains the following key methods:

#[async_trait]
pub trait OcrBackend: Send + Sync {
    /// Initialize backend, typically loading models into memory here
    async fn init(config: &OcrConfig) -> Result<Self>;

    /// Execute OCR recognition
    /// input: Contains image byte stream or decoded pixel data
    /// options: Configuration overrides for single requests
    async fn scan(
        &self, 
        input: &OcrInput, 
        options: Option<&ScanOptions>
    ) -> Result<OcrOutput>;
}

When implementing this interface, special attention must be paid to zero-copy data flow. Kreuzberg internally likely uses the image crate's DynamicImage structure or raw &[u8] byte slices for image data transfer. Our Rust PaddleOCR implementation must be able to directly consume this memory data, rather than requiring file paths like some Python scripts, thereby avoiding unnecessary disk I/O overhead.


3. Evaluation of PaddleOCR Implementation Approaches in the Rust Ecosystem

To achieve "no Python dependency," a more pragmatic and efficient approach is to leverage Rust's powerful FFI capabilities to bind high-performance C++ inference engines (such as ONNX Runtime or MNN) while rewriting all preprocessing and post-processing logic in pure Rust.

3.1 Option 1: MNN-based rusto-rs

rusto-rs is currently the most complete PaddleOCR replica in the open-source Rust community.

  • Tech Stack: Uses MNN (Alibaba's lightweight inference engine) via Rust FFI.
  • Image Processing: Critically, rusto-rs completely removes OpenCV dependency. Developers reimplemented all PaddleOCR image preprocessing algorithms using pure Rust image and imageproc crates, including complex text box contour detection and perspective transformation.
  • Model Support: Explicitly supports PP-OCRv4 and PP-OCRv5 models, with toolchains for converting Paddle models to MNN format.
  • Integration Advantage: Since it already implements the complete pipeline from image::DynamicImage to inference results, integration into Kreuzberg requires only a thin wrapper layer.

3.2 Option 2: ONNX Runtime-based oar-ocr / paddle-ocr-rs

This approach leverages the industry-standard model exchange format ONNX.

  • Tech Stack: Uses ort crate (Rust bindings for Microsoft ONNX Runtime) as the inference backend.
  • Ecosystem Advantage: PaddlePaddle officially provides comprehensive tools (paddle2onnx) for exporting models to ONNX format. ONNX Runtime has extensive hardware support, easily utilizing NVIDIA GPU (CUDA/TensorRT), Apple CoreML, and even AVX512 instruction set acceleration on CPU.
  • Implementation Details: Projects like oar-ocr have encapsulated DBNet (detection) and CRNN (recognition) post-processing logic. Compared to MNN, ONNX Runtime has a more mature server-side ecosystem, and the ort crate is actively maintained.

3.3 Option 3: Pure Rust Inference Engine ocrs

ocrs represents the Rust community's exploration toward "pure Rust."

  • Features: Uses the RTen engine, a completely Rust-written inference runtime requiring no C++ library linking.
  • Limitations: While most aligned with "pure Rust" philosophy, it currently mainly supports specific PyTorch-exported models and has incomplete support for PaddleOCR-specific operators.

3.4 Selection Conclusion

Considering engineering feasibility, maintenance costs, and performance, Option 2 (ONNX Runtime-based) is currently the best choice, with Option 1 (rusto-rs) as a close second.

While both options rely on C++-written inference engines (ORT or MNN) at the bottom, they expose pure Rust interfaces, and compiled artifacts don't depend on Python environments in the system. This fully satisfies the core requirement of "no Python dependency."


4. Deep Technical Implementation Path

4.1 Precise Replication of Preprocessing Pipeline

OCR accuracy is extremely sensitive to image preprocessing. PaddleOCR (Python version) heavily uses OpenCV functionality. In Rust, we must replicate this logic using the image crate and ensure pixel-level alignment; otherwise, inference accuracy will degrade.

Resizing:

  • Detection Stage: Must implement ResizeShort-like logic, adjusting the image's shortest edge to multiples of 32 while maintaining aspect ratio. Rust's image::resize provides multiple interpolation algorithms (Nearest, Triangle, CatmullRom, Gaussian, Lanczos3). The algorithm closest to OpenCV's INTER_LINEAR (typically FilterType::Triangle) must be selected to ensure input tensor consistency.
  • Recognition Stage: Text box slices must be scaled to fixed height (typically 48px), with width scaled proportionally but aligned.

Normalization:
Image data must undergo (pixel - mean) / std operations. This can be vectorized using Rust's ndarray library or even SIMD instruction sets, far more efficient than Python's NumPy operations.

Geometric Transformation:
For detected tilted text boxes, perspective transformation is needed to "straighten" them. The rusto-rs project contains a pure Rust implementation of get_rotate_crop_image function, a core code snippet most worth reusing during integration.

4.2 Inference Engine Lifecycle and Concurrency Management

Kreuzberg, as a high-performance framework, typically runs in an async environment (Tokio Runtime).

Model Persistence: Loading ONNX models (det.onnx, rec.onnx) is time-consuming. Loading must complete during OcrBackend initialization, with Session objects persistently stored in structs. Since Session is thread-safe (in ort), we can wrap it in Arc (atomic reference counting) to share a single model instance across multiple concurrent OCR tasks, greatly saving memory.

Async Computation Isolation: Although ort supports multi-threaded inference, inference itself is CPU-intensive. Directly calling inference functions in Tokio async tasks blocks the event loop, slowing server responses. Therefore, tokio::task::spawn_blocking must be used to dispatch inference tasks to dedicated blocking thread pools.

4.3 Post-processing and Decoding Algorithms

Model outputs are merely tensors requiring complex post-processing for text conversion.

DBNet Post-processing: Detection model output is a probability map. Rust-implemented bitmap generation and polygon expansion algorithms (Unclip) are needed to extract final text box coordinates.

CTC Decoding: Recognition model output is a character index sequence. CTC (Connectionist Temporal Classification) decoding logic must be implemented:

  • Remove consecutive duplicate characters.
  • Remove blank tokens.
  • Map indices back to UTF-8 characters using dictionary files (e.g., ppocr_keys_v1.txt).

This step is extremely efficient in Rust using HashMap or Vec lookups.


5. Performance and Resource Efficiency Analysis

Migration to Rust-native implementation yields multi-dimensional performance benefits.

5.1 Memory and Startup Overhead

Memory Consumption: Python solutions require loading the Python interpreter, NumPy, PaddlePaddle/PyTorch frameworks, and dependencies, with memory baseline typically 500MB to 1GB. In contrast, ort or MNN-based Rust implementations only need lightweight inference libraries and model weight files. Testing shows rusto-rs runtime peak memory can be controlled around 200MB—a huge advantage for resource-constrained container environments.

Cold Start Speed: Removing Python interpreter initialization makes Rust binary startup nearly instantaneous, crucial for Serverless deployment scenarios.

5.2 Computational Throughput

Eliminating GIL Lock: In high-concurrency scenarios, Python's GIL limits CPU-intensive task (like image preprocessing) parallel efficiency. Rust solutions can easily implement data-parallel image preprocessing using the Rayon library, fully utilizing multi-core CPU performance.

Zero-copy Data: Within Kreuzberg, image data flows from file parsers to OCR engines involving only pointer or reference passing in Rust. In Python hybrid architectures, serialization/deserialization between Rust heap memory and Python heap memory (or Buffer Protocol copying) often occurs, producing significant latency in bulk image processing.


6. Implementation Roadmap and Recommendations

Based on the above technical analysis, the following steps are recommended:

1. Proof of Concept (PoC)

  • Create an independent Rust Crate, incorporating oar-ocr or rusto-rs as dependencies.
  • Download PP-OCRv4 ONNX model files.
  • Write test cases with standard test images, verifying Rust output text matches Python PaddleOCR, focusing on Chinese recognition accuracy.

2. Build Adapter

  • Create new kreuzberg-paddle plugin module in Kreuzberg repository.
  • Implement OcrBackend trait.
  • In scan method, convert Kreuzberg image data to ndarray (for ONNX) or MNN Tensor.
  • Handle coordinate system mapping to ensure OCR-returned Bounding Boxes correctly overlay original PDFs or images.

3. Resolve Build and Distribution Issues

  • Static Linking: By default, ort dynamically links C++ libraries. For simplified distribution, explore ort's static linking features or pre-configure libonnxruntime.so in Dockerfiles.
  • Model Management: Following ocrs design, check local cache directories for model files during init, auto-downloading from HuggingFace or other CDNs if absent.

4. Performance Tuning

  • Use criterion crate for benchmark testing preprocessing, inference, and post-processing stages separately, identifying performance hotspots for optimization.

Conclusion

Completely feasible with significant benefits.

Integrating pure Rust PaddleOCR into Kreuzberg is not only technically feasible but a critical step toward the framework's "high-performance, easy-deployment" vision. By leveraging mature Rust ecosystem libraries like ort or rusto-rs, we can successfully remove the heavyweight Python runtime while retaining PaddleOCR's powerful multilingual recognition capabilities. This resolves the dependency hell plaguing developers and establishes a solid foundation for building high-throughput, low-latency document intelligence processing pipelines.


Related Resources

  • rusto-rs - MNN-based PaddleOCR in Rust
  • ocrs - Pure Rust OCR engine
  • ort - ONNX Runtime bindings for Rust
  • PaddleOCR - Original Python implementation
  • paddle2onnx - Model conversion tool

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions