Skip to content

Commit 27564aa

Browse files
committed
fix: DOCX formatted markdown output, typst table extraction, clippy fixes
- DOCX extraction now produces properly formatted markdown: bold, italic, underline, strikethrough, hyperlinks, heading hierarchy, bullet/numbered lists with nesting, and interleaved table rendering (#376) - Fix heading level overflow: Heading5+ clamped at h6 - Fix table cell formatting stripped in ExtractionResult tables - Fix typst extract_table_content double-counting opening parenthesis - Fix clippy collapsible_if in email.rs - Add 16 DOCX formatting integration tests - Add missing typst pandoc baseline files - Regenerate DOCX ground truth files
1 parent e8c3607 commit 27564aa

34 files changed

+2111
-305
lines changed

CHANGELOG.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
5050
#### WASM Table Extraction
5151
- Fixed WASM adapter not recognizing `page_number` field (snake_case) from Rust FFI, causing table data to be silently dropped in Deno and Cloudflare Workers tests.
5252

53+
#### DOCX Formatting Output (#376)
54+
- Fixed DOCX extraction producing plain text instead of formatted markdown. Bold, italic, underline, strikethrough, and hyperlinks are now rendered with proper markdown markers (`**bold**`, `*italic*`, `~~strikethrough~~`, `[text](url)`).
55+
- Fixed heading hierarchy: Title style maps to `#`, Heading1 to `##`, through Heading5+ clamped at `######`.
56+
- Fixed bullet lists (`- `), numbered lists (`1. `), and nested list indentation (2-space per level).
57+
- Fixed tables missing from markdown output. Tables are now interleaved with paragraphs in document order and rendered as markdown pipe tables.
58+
- Fixed table cell formatting being stripped — bold/italic inside table cells is now preserved.
59+
- Added 16 integration tests covering formatting, headings, lists, tables, and document structure.
60+
61+
#### Typst Table Content Extraction
62+
- Fixed Typst `extract_table_content` double-counting opening parenthesis, which caused the table parser to consume all remaining document content after a `#table()` call.
63+
5364
#### PaddleOCR Recognition Model
5465
- Fixed PaddleOCR recognition model (`en_PP-OCRv4_rec_infer.onnx`) failing to load with `ShapeInferenceError` on ONNX Runtime 1.23.x.
5566
- Fixed incorrect detection model filename in Docker and CI action (`en_PP-OCRv4_det_infer.onnx``ch_PP-OCRv4_det_infer.onnx`).

0 commit comments

Comments
 (0)