You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: MSG extraction hang on large attachments, remove LibreOffice dependency
- Replace msg_parser with direct CFB/OLE parsing for MSG files, fixing
indefinite hang on large attachments (#372)
- Add lenient FAT padding for truncated MSG files from some Outlook versions
- Remove LibreOffice subprocess dependency for DOC/PPT extraction
(now handled natively via OLE/CFB parsing)
- Remove msg_parser and LibreOffice-related error types, docs, and scripts
- Add 8 MSG integration tests covering unicode, attachments, truncated FAT,
large files, and error handling
Copy file name to clipboardExpand all lines: CHANGELOG.md
+57-51Lines changed: 57 additions & 51 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,73 +7,88 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
8
8
---
9
9
10
-
## [4.2.15] - 2026-02-08
10
+
## [Unreleased]
11
11
12
12
### Added
13
13
14
-
#### Agent Skill for AI Coding Assistants
15
-
16
-
-**Agent Skill for document extraction**: Added `skills/kreuzberg/SKILL.md` following the [Agent Skills](https://agentskills.io) open standard, with comprehensive instructions for Python, Node.js, Rust, and CLI usage. Includes 8 detailed reference files covering API signatures, configuration, supported formats, plugins, and all language bindings. Works with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any compatible tool.
14
+
#### PaddleOCR Backend
15
+
-**PaddleOCR backend via ONNX Runtime**: New OCR backend (`kreuzberg-paddle-ocr`) using PaddlePaddle's PP-OCRv4 models converted to ONNX format, run via ONNX Runtime. Supports 6 languages (English, Chinese, Japanese, Korean, German, French) with automatic model downloading and caching. Provides superior CJK recognition compared to Tesseract.
16
+
-**PaddleOCR support in all bindings**: Available across Python, Rust, TypeScript/Node.js, Go, Java, PHP, Ruby, C#, and Elixir bindings via the `paddle-ocr` feature flag.
17
+
-**PaddleOCR CLI support**: The `kreuzberg-cli` binary supports `--ocr-backend paddle-ocr` for PaddleOCR extraction.
17
18
18
-
#### MIME Type Mappings
19
-
-Added `.docbook` (`application/docbook+xml`) and `.jats` (`application/x-jats+xml`) file extension mappings.
19
+
#### Unified OCR Element Output
20
+
-**Structured OCR element data**: Extraction results now include `OcrElement` data with bounding geometry (rectangles and quadrilaterals), per-element confidence scores, rotation information, and hierarchical levels (word, line, block, page). Available from both PaddleOCR and Tesseract backends.
20
21
21
-
### Added
22
+
#### Shared ONNX Runtime Discovery
23
+
-**`ort_discovery` module**: Finds ONNX Runtime shared libraries across platforms, shared between PaddleOCR and future ONNX-based backends.
22
24
23
-
#### OCR
24
-
-**PaddleOCR backend via ONNX Runtime**: Added a new OCR backend (`kreuzberg-paddle-ocr`) using PaddlePaddle's PP-OCRv4 models converted to ONNX format, run via ONNX Runtime. Supports 6 languages (English, Chinese, Japanese, Korean, German, French) with automatic model downloading and caching. Provides superior CJK recognition compared to Tesseract.
25
-
-**Unified OCR element output architecture**: Extraction results now include structured `OcrElement` data with bounding geometry (rectangles and quadrilaterals), per-element confidence scores, rotation information, and hierarchical levels (word, line, block, page). Available from both PaddleOCR and Tesseract backends.
26
-
-**PaddleOCR support in all bindings**: PaddleOCR is available across Python, Rust, TypeScript/Node.js, Go, Java, PHP, Ruby, C#, and Elixir bindings via the `paddle-ocr` feature flag.
27
-
-**PaddleOCR CLI support**: The `kreuzberg-cli` binary supports `--ocr-backend paddle-ocr` for PaddleOCR extraction.
28
-
-**Shared ORT discovery**: Added `ort_discovery` module for finding ONNX Runtime shared libraries across platforms, shared between PaddleOCR and future ONNX-based backends.
29
-
-**PaddleOCR model setup GitHub Action**: Added `.github/actions/setup-paddle-ocr-models/` action for CI pipelines to download and cache PaddleOCR model files.
25
+
#### Document Structure Output
26
+
-**`DocumentStructure` support across all bindings**: Added structured document output with `include_document_structure` configuration option across Python, TypeScript/Node.js, Go, Java, PHP, Ruby, C#, Elixir, and WASM bindings.
30
27
31
-
#### CI
32
-
-**PaddleOCR CI integration**: Added PaddleOCR to the CI/publish pipelines with dedicated test jobs and model caching.
28
+
#### Native DOC/PPT Extraction
29
+
-**OLE/CFB-based extraction**: Added native DOC and PPT extraction via OLE/CFB binary parsing. Legacy Office formats no longer require any external tools.
33
30
34
31
#### musl Linux Support
35
32
-**Re-enabled musl targets**: Added `x86_64-unknown-linux-musl` and `aarch64-unknown-linux-musl` targets for CLI binaries, Python wheels (musllinux), and Node.js native bindings. Resolves glibc 2.38+ requirement for prebuilt CLI binaries on older distros like Ubuntu 22.04 (#364).
36
-
-**musl CI workflows**: Added dedicated `ci-musl.yaml` workflow for CLI musl build validation with Alpine container smoke tests, and musllinux Python wheel builds to `ci-python.yaml`.
37
-
-**PDFium musl awareness**: Build script now downloads musl-specific PDFium binaries and uses `libstdc++` consistently for all Linux targets (including musl).
38
-
-**musl C++ cross-compilation**: Added `resolve_cxx_compiler()` and `create_musl_cxx_wrapper()` to `kreuzberg-tesseract` build script for correct C++ header resolution when cross-compiling from glibc host to musl target. Skips `-ldl` linking on musl (not available/needed).
39
-
40
-
#### Build System
41
-
-**Tesseract 5.5.2**: Bumped vendored Tesseract from 5.5.1 to 5.5.2 with `BUILD_TESSERACT_BINARY=OFF` to skip unnecessary binary compilation.
42
-
-**Leptonica 1.87.0**: Bumped vendored Leptonica from 1.86.0 to 1.87.0.
43
-
-**ONNX Runtime 1.24.1**: Bumped ONNX Runtime from 1.23.2 to 1.24.1.
44
-
-**Dead code cleanup**: Removed unused EMSDK constants and `apply_patches()` function from `kreuzberg-tesseract` build script.
45
-
46
-
### Removed
47
-
48
-
#### Node.js Bindings
49
-
-**Guten OCR references**: Removed all references to the unused Guten OCR backend. Renamed `KREUZBERG_DEBUG_GUTEN` env var to `KREUZBERG_DEBUG_OCR`.
50
-
51
-
#### PHP Bindings
52
-
-**Guten OCR backend option**: Removed `'guten'` from the documented backend choices in `OcrConfig`.
53
33
54
34
### Fixed
55
35
56
-
#### PaddleOCR Recognition Model Shape Inference
57
-
- Fixed PaddleOCR recognition model (`en_PP-OCRv4_rec_infer.onnx`) failing to load with `ShapeInferenceError` on ONNX Runtime 1.23.x. A `Squeeze` node incorrectly reduced a rank-1 tensor to a scalar before a `Concat` operation. The fixed model has been re-uploaded to the HuggingFace model repository.
36
+
#### MSG Extraction Hang on Large Attachments (#372)
37
+
- Fixed `.msg` (Outlook) extraction hanging indefinitely on files with large attachments. Replaced the `msg_parser` crate with direct OLE/CFB parsing using the `cfb` crate — attachment binary data is now read directly without hex-encoding overhead.
38
+
- Added lenient FAT padding for MSG files with truncated sector tables produced by some Outlook versions.
39
+
40
+
#### Rotated PDF Text Extraction
41
+
- Fixed text extraction returning empty content for PDFs with 90° or 270° page rotation. Kreuzberg now strips `/Rotate` entries from page dictionaries before loading, restoring correct text extraction for all rotation angles.
58
42
59
43
#### CSV and Excel Extraction Quality
60
44
- Fixed CSV extraction producing near-zero quality scores (0.024) by outputting proper delimited text instead of debug format.
- Added native DOC and PPT extraction via OLE/CFB parsing, replacing the LibreOffice subprocess dependency for legacy Office formats.
65
-
66
47
#### XML Extraction Quality
67
48
- Improved XML text extraction to better handle namespaced elements, CDATA sections, and mixed content, improving quality scores.
68
49
69
50
#### WASM Table Extraction
70
51
- Fixed WASM adapter not recognizing `page_number` field (snake_case) from Rust FFI, causing table data to be silently dropped in Deno and Cloudflare Workers tests.
71
52
72
-
#### Ruby CI ONNX Runtime Discovery
73
-
- Fixed Ruby E2E tests failing with `dlopen failed` for `libonnxruntime.so` by adding ONNX Runtime setup and library path export to the Ruby CI test job.
53
+
#### PaddleOCR Recognition Model
54
+
- Fixed PaddleOCR recognition model (`en_PP-OCRv4_rec_infer.onnx`) failing to load with `ShapeInferenceError` on ONNX Runtime 1.23.x.
55
+
- Fixed incorrect detection model filename in Docker and CI action (`en_PP-OCRv4_det_infer.onnx` → `ch_PP-OCRv4_det_infer.onnx`).
- Bumped ONNX Runtime from 1.23.2 to 1.24.1 across CI, Docker images, and documentation.
64
+
- Bumped vendored Tesseract from 5.5.1 to 5.5.2.
65
+
- Bumped vendored Leptonica from 1.86.0 to 1.87.0.
66
+
67
+
### Removed
68
+
69
+
#### LibreOffice Dependency
70
+
-**LibreOffice is no longer required**: Legacy .doc and .ppt files are now extracted natively via OLE/CFB parsing. LibreOffice has been removed from Docker images, CI pipelines, and system dependency requirements, reducing the full Docker image size by ~500-800MB. Users on Kreuzberg <4.3 still need LibreOffice for these formats.
71
+
72
+
#### `msg_parser` Dependency
73
+
- Replaced `msg_parser` crate with direct CFB parsing for MSG extraction. Eliminates hex-encoding overhead and reduces dependency count.
74
+
75
+
#### Guten OCR Backend
76
+
- Removed all references to the unused Guten OCR backend from Node.js and PHP bindings. Renamed `KREUZBERG_DEBUG_GUTEN` env var to `KREUZBERG_DEBUG_OCR`.
74
77
75
-
#### Java E2E Test Compilation
76
-
- Fixed Java E2E helper compilation errors caused by `Metadata` type not being directly castable to `Map` and `Element.getType()` method not existing. Updated to use `Metadata.getAdditional()` and `Element.getElementType()`.
78
+
---
79
+
80
+
## [4.2.15] - 2026-02-08
81
+
82
+
### Added
83
+
84
+
#### Agent Skill for AI Coding Assistants
85
+
86
+
-**Agent Skill for document extraction**: Added `skills/kreuzberg/SKILL.md` following the [Agent Skills](https://agentskills.io) open standard, with comprehensive instructions for Python, Node.js, Rust, and CLI usage. Includes 8 detailed reference files covering API signatures, configuration, supported formats, plugins, and all language bindings. Works with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any compatible tool.
87
+
88
+
#### MIME Type Mappings
89
+
- Added `.docbook` (`application/docbook+xml`) and `.jats` (`application/x-jats+xml`) file extension mappings.
90
+
91
+
### Fixed
77
92
78
93
#### ODT List and Section Extraction
79
94
- Fixed ODT extractor not handling `text:list` and `text:section` elements. Documents containing bulleted/numbered lists or sections returned empty content.
@@ -99,17 +114,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99
114
#### PDF Error Handling Regression
100
115
- Reverted incorrect change from v4.2.14 that silently returned empty results for corrupted/malformed PDFs instead of propagating errors. Corrupted PDFs now correctly return `PdfError::InvalidPdf` and password-protected PDFs return `PdfError::PasswordRequired` as expected.
101
116
102
-
#### PaddleOCR Model URLs
103
-
- Fixed incorrect detection model filename in Docker and CI action (`en_PP-OCRv4_det_infer.onnx` → `ch_PP-OCRv4_det_infer.onnx`).
- Bumped ONNX Runtime from 1.23.2 to 1.24.1 across CI, Docker images, and documentation. Minimum supported ORT version is 1.23+.
112
-
113
119
#### API Parity
114
120
- Added `security_limits` field to all 9 language bindings (TypeScript, Go, Python, Ruby, PHP, Java, C#, WASM, Elixir) for API parity with Rust core `ExtractionConfig`.
-**[Docker](https://docs.kreuzberg.dev/guides/docker/)** – Official images with API, CLI, and MCP server modes (Core: ~1.0-1.3GB, Full: ~1.5-2.1GB with LibreOffice)
94
+
-**[Docker](https://docs.kreuzberg.dev/guides/docker/)** – Official images with API, CLI, and MCP server modes (Core: ~1.0-1.3GB, Full: ~1.0-1.3GB with OCR + legacy format support)
95
95
96
96
**Command-Line:**
97
97
-**[CLI](https://docs.kreuzberg.dev/cli/usage/)** – Cross-platform binary, batch processing, MCP server mode
Copy file name to clipboardExpand all lines: crates/kreuzberg-py/README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -38,7 +38,7 @@ result = await extract_file("document.pdf") # 140ms
38
38
result = extract_file_sync("document.pdf") # 140ms
39
39
```
40
40
41
-
**Why?** The subprocess call (pdftotext, libreoffice) accounts for 95-99% of time. With only one file, there's nothing to do concurrently, so async provides no benefit.
41
+
**Why?** The extraction call accounts for 95-99% of time. With only one file, there's nothing to do concurrently, so async provides no benefit.
0 commit comments