Skip to content

Commit a7f9bde

Browse files
committed
feat(docx): complete element output, DocumentStructure, pages, OCR, and critical review fixes
Add full DOCX extraction pipeline with DocumentStructure generation, per-page content splitting, OCR on embedded images, typed metadata fields, style-based heading detection, headers/footers/footnotes in markdown, table formatting with vertical merge support, and drawing placeholders. Optimize parser and extractor: eliminate 3x parse duplication via parse_docx_core helper, single-pass Run::to_markdown builder, remove unnecessary clones, use Cow/borrow patterns, deduplicate document structure code, safe element indexing, in-place output trimming. Remove dead code: Document.lists, ListItem, process_lists(), HeaderFooter::extract_text().
1 parent 35d6af6 commit a7f9bde

File tree

6 files changed

+2022
-301
lines changed

6 files changed

+2022
-301
lines changed

CHANGELOG.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
## [Unreleased]
1111

12+
### Added
13+
14+
#### DOCX Full Extraction Pipeline (#387)
15+
- **DocumentStructure generation**: Builds hierarchical document tree with heading-based sections, paragraphs, lists, tables, images, headers/footers, and footnotes/endnotes when `include_document_structure = true`.
16+
- **Pages field population**: Splits extracted text into per-page `PageContent` entries using detected page break boundaries, with tables and images assigned to correct pages.
17+
- **OCR on embedded images**: Runs secondary OCR on extracted DOCX images when OCR is configured, following the PPTX pattern.
18+
- **Image extraction with page assignment**: Drawing image placeholders in markdown output enable byte-position-based page number assignment for extracted images.
19+
- **Typed metadata fields**: `title`, `subject`, `authors`, `created_by`, `modified_by`, `created_at`, `modified_at`, `language`, and `keywords` are now populated as first-class `Metadata` fields instead of only appearing in the `additional` map.
20+
- **FormatMetadata::Docx**: Structured format metadata with `core_properties`, `app_properties`, and `custom_properties` available via `metadata.format`.
21+
- **Style-based heading detection**: Uses `StyleCatalog` with `outline_level` and inheritance chain walking for accurate heading level resolution, with string-matching fallback.
22+
- **Headers, footers, and footnote references**: Headers/footers included in markdown with `---` separators; `[^N]` inline footnote/endnote references rendered in text.
23+
- **Markdown formatting**: Bold (`**`), italic (`*`), underline (`<u>`), strikethrough (`~~`), and hyperlinks rendered as markdown.
24+
- **Table formatting metadata**: Vertical merge (`v_merge`) handled correctly, `grid_span` for horizontal merging, `is_header` row detection.
25+
- **Drawing image placeholders**: `![alt](image_N)` placeholders in markdown output for embedded images.
26+
27+
### Changed
28+
29+
#### DOCX Extractor Performance & Code Quality
30+
- **Eliminated 3x code duplication**: Extracted `parse_docx_core()` helper to deduplicate parsing logic across tokio/non-tokio cfg branches.
31+
- **Removed unnecessary clones**: Metadata structs (core/app/custom properties) borrowed then moved instead of cloned; drawings and image relationships only cloned when image extraction is enabled.
32+
- **Optimized Run::to_markdown()**: Single-pass string builder with pre-calculated capacity replaces clone + repeated `format!` calls on the hot path.
33+
- **In-place output trimming**: `to_markdown()` trims in-place instead of allocating a new String via `trim().to_string()`.
34+
- **Removed `into_owned()` on XML text decode**: Uses `Cow` directly from `e.decode()` instead of forcing heap allocation.
35+
- **`write!`/`writeln!` for string building**: Footnote definitions and image placeholders use `write!` to avoid intermediate String allocations.
36+
- **Safe element indexing**: `to_markdown()` uses `.get()` with `else { continue }` instead of direct indexing to prevent potential panics.
37+
- **Deduplicated document structure code**: Header/footer loops and footnote/endnote loops consolidated using iterators.
38+
39+
### Removed
40+
- **Dead code cleanup**: Removed unused `Document.lists` field, `ListItem` struct, `process_lists()` method, and `HeaderFooter::extract_text()` method.
41+
1242
---
1343

1444
## [4.3.2] - 2026-02-13

0 commit comments

Comments
 (0)