v6.0.1 - 02.01.2026 #72

harshankur · 2026-01-02T14:55:11Z

harshankur
Jan 2, 2026
Maintainer

v6.0.1 - 02.01.2026

Changes: v5.2.2..v6.0.1

We are thrilled to announce the release of officeParser v6.0.1, a major overhaul that transforms the library from a simple text extractor into a powerful, format-agnostic document analysis engine.

🌟 Key Highlights (v6.0.0+)

🌳 Abstract Syntax Tree (AST) Output

The core parsing engine now produces a rich, hierarchical Abstract Syntax Tree. This allows you to traverse documents structurally—accessing paragraphs, headings, tables, and lists with their original nesting and metadata preserved.

🖼️ OCR & Attachment Extraction

Integrated OCR: Use Tesseract.js to extract text from images and scanned PDF documents automatically. (Fixes Doesn't work for PDFs that are scans of paper documents (each page is an image) #57)
Base64 Attachments: Extract images and charts directly as Base64 strings from all supported formats. (Fixes Feature to extract images from Documents #68)

📄 New Format Support & Improvements

RTF Support: Added full support for Rich Text Format (.rtf) files, including complex nested tables and lists. (Fixes Add support for *.rtf files #54)
Hierarchical PDF Parsing: PDFs are now split into logical page nodes, matching the structure of slides and sheets.
PowerPoint & Excel Nodes: Introduced dedicated slide and sheet delimiter nodes for cleaner visualization and processing. (Fixes Feature Request: Add slide delimiter support for PowerPoint files #64)

🔗 Enhanced Hyperlinks

Extract Link Addresses: External hyperlinks are now correctly extracted and tagged in the AST. (Fixes Extract link address #50)
Clickable Visualizer Links: The built-in visualizer now renders external links as clickable <a> tags.

🛠️ Bug Fixes & Refinements

Word List Preservation: Fixed issues where numbered elements and indentation levels were lost in .docx parsing. (Fixes Numbered elements aren't preserved as they show up in .docx files #29)
Robust PDF Parsing: Added graceful error handling for corrupt PDF files and bad XRef entries, preventing parser crashes. (Fixes UnknownErrorException bad XRef entry #44)
Formatting Parity: Expanded support for bold, italic, underline, colors, and fonts across all parsers (Docx, Pptx, Xlsx, Odp, Odt, Ods, Pdf, Rtf).
Strict Typing: Full TypeScript rewrite providing comprehensive interfaces for the entire AST structure.

🎨 Interactive AST Visualizer (v6.0.1 Fix)

The Live Visualizer has been revamped and fixed for stable deployment:

Color-Coded Sections: Blue for Pages, Green for Sheets, and Orange for Slides.
Premium UI: New card-based layout with interactive previews and deep-linked metadata.
Deployment: Migrated to the /docs folder for standard GitHub Pages hosting at the repository's root. (Fixed in v6.0.1)

⚠️ Breaking Changes

The library now returns an OfficeParserAST object instead of a raw string.
To get the old behavior (plain text), call ast.toText() on the returned object.

This discussion was created from the release v6.0.1 - 02.01.2026.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v6.0.1 - 02.01.2026 #72

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

v6.0.1 - 02.01.2026 #72

Uh oh!

harshankur Jan 2, 2026 Maintainer

v6.0.1 - 02.01.2026

Changes: v5.2.2..v6.0.1

🌟 Key Highlights (v6.0.0+)

🌳 Abstract Syntax Tree (AST) Output

🖼️ OCR & Attachment Extraction

📄 New Format Support & Improvements

🔗 Enhanced Hyperlinks

🛠️ Bug Fixes & Refinements

🎨 Interactive AST Visualizer (v6.0.1 Fix)

⚠️ Breaking Changes

Replies: 0 comments

harshankur
Jan 2, 2026
Maintainer