Skip to content

Commit cb9594e

Browse files
committed
fix: MSG extraction hang on large attachments, remove LibreOffice dependency
- Replace msg_parser with direct CFB/OLE parsing for MSG files, fixing indefinite hang on large attachments (#372) - Add lenient FAT padding for truncated MSG files from some Outlook versions - Remove LibreOffice subprocess dependency for DOC/PPT extraction (now handled natively via OLE/CFB parsing) - Remove msg_parser and LibreOffice-related error types, docs, and scripts - Add 8 MSG integration tests covering unicode, attachments, truncated FAT, large files, and error handling
1 parent 259ddd3 commit cb9594e

File tree

67 files changed

+545
-1154
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+545
-1154
lines changed

.github/actions/install-system-deps/action.yml

Lines changed: 1 addition & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
name: Install System Dependencies
22
description: |
33
Install and cache platform-specific dependencies required for document conversion.
4-
Includes: Tesseract OCR, LibreOffice, fonts, and build tools.
4+
Includes: Tesseract OCR, fonts, and build tools.
55
Features robust caching with architecture/version awareness, timeout handling, and retry logic.
66
77
inputs:
@@ -35,14 +35,6 @@ runs:
3535
tesseract-macos-${{ runner.arch }}-v5-
3636
tesseract-macos-${{ runner.arch }}-
3737
38-
- name: Cache LibreOffice (macOS)
39-
if: runner.os == 'macOS'
40-
id: cache-libreoffice-macos
41-
uses: actions/cache@v5
42-
with:
43-
path: /Applications/LibreOffice.app
44-
key: libreoffice-macos-${{ runner.arch }}-v3
45-
4638
- name: Install dependencies (macOS)
4739
if: runner.os == 'macOS'
4840
shell: bash
@@ -86,16 +78,6 @@ runs:
8678
restore-keys: |
8779
tesseract-windows-${{ runner.arch }}-
8880
89-
- name: Cache LibreOffice (Windows)
90-
if: runner.os == 'Windows'
91-
id: cache-libreoffice-windows
92-
uses: actions/cache@v5
93-
with:
94-
path: |
95-
C:\Program Files\LibreOffice
96-
C:\ProgramData\chocolatey\lib\libreoffice
97-
key: libreoffice-windows-${{ runner.arch }}-v3
98-
9981
- name: Cache LLVM (Windows)
10082
if: runner.os == 'Windows'
10183
id: cache-llvm-windows

CHANGELOG.md

Lines changed: 57 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -7,73 +7,88 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
---
99

10-
## [4.2.15] - 2026-02-08
10+
## [Unreleased]
1111

1212
### Added
1313

14-
#### Agent Skill for AI Coding Assistants
15-
16-
- **Agent Skill for document extraction**: Added `skills/kreuzberg/SKILL.md` following the [Agent Skills](https://agentskills.io) open standard, with comprehensive instructions for Python, Node.js, Rust, and CLI usage. Includes 8 detailed reference files covering API signatures, configuration, supported formats, plugins, and all language bindings. Works with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any compatible tool.
14+
#### PaddleOCR Backend
15+
- **PaddleOCR backend via ONNX Runtime**: New OCR backend (`kreuzberg-paddle-ocr`) using PaddlePaddle's PP-OCRv4 models converted to ONNX format, run via ONNX Runtime. Supports 6 languages (English, Chinese, Japanese, Korean, German, French) with automatic model downloading and caching. Provides superior CJK recognition compared to Tesseract.
16+
- **PaddleOCR support in all bindings**: Available across Python, Rust, TypeScript/Node.js, Go, Java, PHP, Ruby, C#, and Elixir bindings via the `paddle-ocr` feature flag.
17+
- **PaddleOCR CLI support**: The `kreuzberg-cli` binary supports `--ocr-backend paddle-ocr` for PaddleOCR extraction.
1718

18-
#### MIME Type Mappings
19-
- Added `.docbook` (`application/docbook+xml`) and `.jats` (`application/x-jats+xml`) file extension mappings.
19+
#### Unified OCR Element Output
20+
- **Structured OCR element data**: Extraction results now include `OcrElement` data with bounding geometry (rectangles and quadrilaterals), per-element confidence scores, rotation information, and hierarchical levels (word, line, block, page). Available from both PaddleOCR and Tesseract backends.
2021

21-
### Added
22+
#### Shared ONNX Runtime Discovery
23+
- **`ort_discovery` module**: Finds ONNX Runtime shared libraries across platforms, shared between PaddleOCR and future ONNX-based backends.
2224

23-
#### OCR
24-
- **PaddleOCR backend via ONNX Runtime**: Added a new OCR backend (`kreuzberg-paddle-ocr`) using PaddlePaddle's PP-OCRv4 models converted to ONNX format, run via ONNX Runtime. Supports 6 languages (English, Chinese, Japanese, Korean, German, French) with automatic model downloading and caching. Provides superior CJK recognition compared to Tesseract.
25-
- **Unified OCR element output architecture**: Extraction results now include structured `OcrElement` data with bounding geometry (rectangles and quadrilaterals), per-element confidence scores, rotation information, and hierarchical levels (word, line, block, page). Available from both PaddleOCR and Tesseract backends.
26-
- **PaddleOCR support in all bindings**: PaddleOCR is available across Python, Rust, TypeScript/Node.js, Go, Java, PHP, Ruby, C#, and Elixir bindings via the `paddle-ocr` feature flag.
27-
- **PaddleOCR CLI support**: The `kreuzberg-cli` binary supports `--ocr-backend paddle-ocr` for PaddleOCR extraction.
28-
- **Shared ORT discovery**: Added `ort_discovery` module for finding ONNX Runtime shared libraries across platforms, shared between PaddleOCR and future ONNX-based backends.
29-
- **PaddleOCR model setup GitHub Action**: Added `.github/actions/setup-paddle-ocr-models/` action for CI pipelines to download and cache PaddleOCR model files.
25+
#### Document Structure Output
26+
- **`DocumentStructure` support across all bindings**: Added structured document output with `include_document_structure` configuration option across Python, TypeScript/Node.js, Go, Java, PHP, Ruby, C#, Elixir, and WASM bindings.
3027

31-
#### CI
32-
- **PaddleOCR CI integration**: Added PaddleOCR to the CI/publish pipelines with dedicated test jobs and model caching.
28+
#### Native DOC/PPT Extraction
29+
- **OLE/CFB-based extraction**: Added native DOC and PPT extraction via OLE/CFB binary parsing. Legacy Office formats no longer require any external tools.
3330

3431
#### musl Linux Support
3532
- **Re-enabled musl targets**: Added `x86_64-unknown-linux-musl` and `aarch64-unknown-linux-musl` targets for CLI binaries, Python wheels (musllinux), and Node.js native bindings. Resolves glibc 2.38+ requirement for prebuilt CLI binaries on older distros like Ubuntu 22.04 (#364).
36-
- **musl CI workflows**: Added dedicated `ci-musl.yaml` workflow for CLI musl build validation with Alpine container smoke tests, and musllinux Python wheel builds to `ci-python.yaml`.
37-
- **PDFium musl awareness**: Build script now downloads musl-specific PDFium binaries and uses `libstdc++` consistently for all Linux targets (including musl).
38-
- **musl C++ cross-compilation**: Added `resolve_cxx_compiler()` and `create_musl_cxx_wrapper()` to `kreuzberg-tesseract` build script for correct C++ header resolution when cross-compiling from glibc host to musl target. Skips `-ldl` linking on musl (not available/needed).
39-
40-
#### Build System
41-
- **Tesseract 5.5.2**: Bumped vendored Tesseract from 5.5.1 to 5.5.2 with `BUILD_TESSERACT_BINARY=OFF` to skip unnecessary binary compilation.
42-
- **Leptonica 1.87.0**: Bumped vendored Leptonica from 1.86.0 to 1.87.0.
43-
- **ONNX Runtime 1.24.1**: Bumped ONNX Runtime from 1.23.2 to 1.24.1.
44-
- **Dead code cleanup**: Removed unused EMSDK constants and `apply_patches()` function from `kreuzberg-tesseract` build script.
45-
46-
### Removed
47-
48-
#### Node.js Bindings
49-
- **Guten OCR references**: Removed all references to the unused Guten OCR backend. Renamed `KREUZBERG_DEBUG_GUTEN` env var to `KREUZBERG_DEBUG_OCR`.
50-
51-
#### PHP Bindings
52-
- **Guten OCR backend option**: Removed `'guten'` from the documented backend choices in `OcrConfig`.
5333

5434
### Fixed
5535

56-
#### PaddleOCR Recognition Model Shape Inference
57-
- Fixed PaddleOCR recognition model (`en_PP-OCRv4_rec_infer.onnx`) failing to load with `ShapeInferenceError` on ONNX Runtime 1.23.x. A `Squeeze` node incorrectly reduced a rank-1 tensor to a scalar before a `Concat` operation. The fixed model has been re-uploaded to the HuggingFace model repository.
36+
#### MSG Extraction Hang on Large Attachments (#372)
37+
- Fixed `.msg` (Outlook) extraction hanging indefinitely on files with large attachments. Replaced the `msg_parser` crate with direct OLE/CFB parsing using the `cfb` crate — attachment binary data is now read directly without hex-encoding overhead.
38+
- Added lenient FAT padding for MSG files with truncated sector tables produced by some Outlook versions.
39+
40+
#### Rotated PDF Text Extraction
41+
- Fixed text extraction returning empty content for PDFs with 90° or 270° page rotation. Kreuzberg now strips `/Rotate` entries from page dictionaries before loading, restoring correct text extraction for all rotation angles.
5842

5943
#### CSV and Excel Extraction Quality
6044
- Fixed CSV extraction producing near-zero quality scores (0.024) by outputting proper delimited text instead of debug format.
6145
- Fixed Excel extraction producing low quality scores (0.22) by outputting clean tab/newline-delimited cell text.
6246

63-
#### Native DOC/PPT Extraction
64-
- Added native DOC and PPT extraction via OLE/CFB parsing, replacing the LibreOffice subprocess dependency for legacy Office formats.
65-
6647
#### XML Extraction Quality
6748
- Improved XML text extraction to better handle namespaced elements, CDATA sections, and mixed content, improving quality scores.
6849

6950
#### WASM Table Extraction
7051
- Fixed WASM adapter not recognizing `page_number` field (snake_case) from Rust FFI, causing table data to be silently dropped in Deno and Cloudflare Workers tests.
7152

72-
#### Ruby CI ONNX Runtime Discovery
73-
- Fixed Ruby E2E tests failing with `dlopen failed` for `libonnxruntime.so` by adding ONNX Runtime setup and library path export to the Ruby CI test job.
53+
#### PaddleOCR Recognition Model
54+
- Fixed PaddleOCR recognition model (`en_PP-OCRv4_rec_infer.onnx`) failing to load with `ShapeInferenceError` on ONNX Runtime 1.23.x.
55+
- Fixed incorrect detection model filename in Docker and CI action (`en_PP-OCRv4_det_infer.onnx``ch_PP-OCRv4_det_infer.onnx`).
56+
57+
#### Python Bindings
58+
- Fixed `OcrConfig` constructor silently ignoring `paddle_ocr_config` and `element_config` keyword arguments.
59+
60+
### Changed
61+
62+
#### Build System
63+
- Bumped ONNX Runtime from 1.23.2 to 1.24.1 across CI, Docker images, and documentation.
64+
- Bumped vendored Tesseract from 5.5.1 to 5.5.2.
65+
- Bumped vendored Leptonica from 1.86.0 to 1.87.0.
66+
67+
### Removed
68+
69+
#### LibreOffice Dependency
70+
- **LibreOffice is no longer required**: Legacy .doc and .ppt files are now extracted natively via OLE/CFB parsing. LibreOffice has been removed from Docker images, CI pipelines, and system dependency requirements, reducing the full Docker image size by ~500-800MB. Users on Kreuzberg <4.3 still need LibreOffice for these formats.
71+
72+
#### `msg_parser` Dependency
73+
- Replaced `msg_parser` crate with direct CFB parsing for MSG extraction. Eliminates hex-encoding overhead and reduces dependency count.
74+
75+
#### Guten OCR Backend
76+
- Removed all references to the unused Guten OCR backend from Node.js and PHP bindings. Renamed `KREUZBERG_DEBUG_GUTEN` env var to `KREUZBERG_DEBUG_OCR`.
7477

75-
#### Java E2E Test Compilation
76-
- Fixed Java E2E helper compilation errors caused by `Metadata` type not being directly castable to `Map` and `Element.getType()` method not existing. Updated to use `Metadata.getAdditional()` and `Element.getElementType()`.
78+
---
79+
80+
## [4.2.15] - 2026-02-08
81+
82+
### Added
83+
84+
#### Agent Skill for AI Coding Assistants
85+
86+
- **Agent Skill for document extraction**: Added `skills/kreuzberg/SKILL.md` following the [Agent Skills](https://agentskills.io) open standard, with comprehensive instructions for Python, Node.js, Rust, and CLI usage. Includes 8 detailed reference files covering API signatures, configuration, supported formats, plugins, and all language bindings. Works with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any compatible tool.
87+
88+
#### MIME Type Mappings
89+
- Added `.docbook` (`application/docbook+xml`) and `.jats` (`application/x-jats+xml`) file extension mappings.
90+
91+
### Fixed
7792

7893
#### ODT List and Section Extraction
7994
- Fixed ODT extractor not handling `text:list` and `text:section` elements. Documents containing bulleted/numbered lists or sections returned empty content.
@@ -99,17 +114,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99114
#### PDF Error Handling Regression
100115
- Reverted incorrect change from v4.2.14 that silently returned empty results for corrupted/malformed PDFs instead of propagating errors. Corrupted PDFs now correctly return `PdfError::InvalidPdf` and password-protected PDFs return `PdfError::PasswordRequired` as expected.
101116

102-
#### PaddleOCR Model URLs
103-
- Fixed incorrect detection model filename in Docker and CI action (`en_PP-OCRv4_det_infer.onnx``ch_PP-OCRv4_det_infer.onnx`).
104-
105-
#### Python Bindings
106-
- Fixed `OcrConfig` constructor silently ignoring `paddle_ocr_config` and `element_config` keyword arguments.
107-
108117
### Changed
109118

110-
#### ONNX Runtime
111-
- Bumped ONNX Runtime from 1.23.2 to 1.24.1 across CI, Docker images, and documentation. Minimum supported ORT version is 1.23+.
112-
113119
#### API Parity
114120
- Added `security_limits` field to all 9 language bindings (TypeScript, Go, Python, Ruby, PHP, Java, C#, WASM, Elixir) for API parity with Rust core `ExtractionConfig`.
115121

Cargo.lock

Lines changed: 8 additions & 25 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ Each language binding provides comprehensive documentation with examples and bes
9191
- **[Rust](https://github.com/kreuzberg-dev/kreuzberg/tree/main/crates/kreuzberg)** – Core library, flexible feature flags, zero-copy APIs
9292

9393
**Containers:**
94-
- **[Docker](https://docs.kreuzberg.dev/guides/docker/)** – Official images with API, CLI, and MCP server modes (Core: ~1.0-1.3GB, Full: ~1.5-2.1GB with LibreOffice)
94+
- **[Docker](https://docs.kreuzberg.dev/guides/docker/)** – Official images with API, CLI, and MCP server modes (Core: ~1.0-1.3GB, Full: ~1.0-1.3GB with OCR + legacy format support)
9595

9696
**Command-Line:**
9797
- **[CLI](https://docs.kreuzberg.dev/cli/usage/)** – Cross-platform binary, batch processing, MCP server mode

crates/kreuzberg-cli/README.md

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -84,13 +84,6 @@ To enable optical character recognition for scanned documents:
8484
- **Ubuntu/Debian**: `sudo apt-get install tesseract-ocr`
8585
- **Windows**: Download from [tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract)
8686

87-
#### Legacy Office Format Support (Optional)
88-
89-
For `.doc` and `.ppt` file extraction:
90-
91-
- **macOS**: `brew install libreoffice`
92-
- **Ubuntu/Debian**: `sudo apt-get install libreoffice`
93-
9487
## Quick Start
9588

9689
> The CLI is available for Linux (x86_64/aarch64), macOS (Apple Silicon), and Windows with consistent behavior across all platforms.

crates/kreuzberg-node/README.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -98,12 +98,10 @@ yarn add @kreuzberg/node
9898
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.23+ for embeddings support
9999
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
100100

101-
- Optional: [LibreOffice](https://www.libreoffice.org/download/download/) for legacy Office formats (DOC, XLS, PPT, RTF, ODT, ODS, ODP)
102-
103101
**Format Support Notes:**
104-
- Modern Office formats (DOCX, XLSX, PPTX) work without LibreOffice
105-
- Legacy formats (DOC, XLS, PPT) require LibreOffice installation
106-
- WASM binding supports DOCX, XLSX, PPTX, and ODT (no LibreOffice required)
102+
- Legacy formats (DOC, XLS, PPT) are now extracted natively without external tools
103+
- Modern Office formats (DOCX, XLSX, PPTX) are fully supported
104+
- WASM binding supports all document formats via in-memory parsing
107105

108106

109107

crates/kreuzberg-node/typescript/errors.ts

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -380,7 +380,6 @@ export class PluginError extends KreuzbergError {
380380
* Error thrown when a required system dependency is missing.
381381
*
382382
* Missing dependency errors occur when external tools or libraries are not available, such as:
383-
* - LibreOffice (for DOC/PPT/XLS files)
384383
* - Tesseract OCR (for OCR processing)
385384
* - ImageMagick (for image processing)
386385
* - Poppler (for PDF rendering)
@@ -390,11 +389,11 @@ export class PluginError extends KreuzbergError {
390389
* import { extractFile, MissingDependencyError } from '@kreuzberg/node';
391390
*
392391
* try {
393-
* const result = await extractFile('document.doc');
392+
* const result = await extractFile('document.pdf');
394393
* } catch (error) {
395394
* if (error instanceof MissingDependencyError) {
396395
* console.error('Missing dependency:', error.message);
397-
* console.log('Please install LibreOffice to process DOC files');
396+
* console.log('Please install Tesseract OCR for image processing');
398397
* }
399398
* }
400399
* ```

crates/kreuzberg-node/typescript/index.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@
3636
*
3737
* ## Supported Formats
3838
*
39-
* - **Documents**: PDF, DOCX, PPTX, XLSX, DOC, PPT (with LibreOffice)
39+
* - **Documents**: PDF, DOCX, PPTX, XLSX, DOC, PPT
4040
* - **Text**: Markdown, Plain Text, XML
4141
* - **Web**: HTML (converted to Markdown)
4242
* - **Data**: JSON, YAML, TOML

crates/kreuzberg-py/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ result = await extract_file("document.pdf") # 140ms
3838
result = extract_file_sync("document.pdf") # 140ms
3939
```
4040

41-
**Why?** The subprocess call (pdftotext, libreoffice) accounts for 95-99% of time. With only one file, there's nothing to do concurrently, so async provides no benefit.
41+
**Why?** The extraction call accounts for 95-99% of time. With only one file, there's nothing to do concurrently, so async provides no benefit.
4242

4343
### Batch/Concurrent Processing
4444

crates/kreuzberg-wasm/src/errors.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -161,7 +161,7 @@ mod tests {
161161

162162
#[wasm_bindgen_test]
163163
fn test_convert_error_missing_dependency_returns_jsvalue() {
164-
let err = KreuzbergError::MissingDependency("libreoffice".to_string());
164+
let err = KreuzbergError::MissingDependency("tesseract".to_string());
165165
let result = convert_error(err);
166166

167167
assert!(!result.is_null());

0 commit comments

Comments
 (0)