@@ -21,25 +21,42 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2121 - Comprehensive error handling for invalid inputs
2222
2323#### Core
24- - ** Element-based output format** : New ` OutputFormat::ElementBased ` option provides Unstructured.io-compatible semantic element extraction
24+ - ** Djot markup format support** : New ` .djot ` file extraction with comprehensive Djot syntax support
25+ - Full parser implementation with structured representation via ` DjotContent ` type
26+ - Supports headings, paragraphs, lists (ordered, unordered, task, definition), tables, code blocks, emphasis, links, images, footnotes, math expressions
27+ - YAML frontmatter extraction with metadata preservation
28+ - Shared frontmatter utilities between Markdown and Djot extractors
29+ - Feature-gated behind ` djot ` feature flag (enabled by default)
30+ - 39 comprehensive tests covering Unicode, tables, roundtrip conversion, and edge cases
31+
32+ - ** Content output format configuration** : New ` ContentFormat ` enum for configurable text output formatting
33+ - Converts extracted content from ANY file format to Plain, Markdown, Djot, or HTML
34+ - Post-processing pipeline applies format transformation after extraction
35+ - Configuration via ` config.content_format ` field in ` ExtractionConfig ` (defaults to ` Plain ` )
36+ - CLI support with ` --content-format ` flag and ` KREUZBERG_CONTENT_FORMAT ` environment variable
37+ - Independent from ` result_format ` (Unified vs ElementBased structure)
38+
39+ - ** Element-based output format** : New ` ResultFormat::ElementBased ` option provides Unstructured.io-compatible semantic element extraction
2540 - Extracts structured elements: titles, paragraphs, lists, tables, images, page breaks, headings, code blocks, block quotes, headers, footers
2641 - Each element includes rich metadata: bounding boxes, page numbers, confidence scores, hierarchy information
2742 - Transformation pipeline converts unified output to element-based format via ` extraction::transform ` module
2843 - Added ` Element ` , ` ElementType ` , ` ElementMetadata ` , and ` BoundingBox ` types to core types module
2944 - Supports PDF hierarchy detection for semantic heading levels
30- - Configuration via ` config.output_format ` field (defaults to ` Unified ` )
45+ - Configuration via ` config.result_format ` field (defaults to ` Unified ` )
3146
3247#### Language Bindings
33- - ** Python** : Element-based output support with full type hints
34- - New ` output_format ` parameter in extraction config accepting ` "unified" ` or ` "element_based" `
35- - ` Element ` , ` ElementType ` , ` ElementMetadata ` , ` BoundingBox ` types exported from ` kreuzberg.types `
36- - Result includes ` elements ` field when using element-based format
48+ - ** Python** : Enhanced output configuration with full type hints
49+ - Content format support: ` content_format ` parameter accepting ` "plain" ` , ` "markdown" ` , ` "djot" ` , or ` "html" `
50+ - Element-based output: ` result_format ` parameter accepting ` "unified" ` or ` "element_based" `
51+ - ` Element ` , ` ElementType ` , ` ElementMetadata ` , ` BoundingBox ` , ` DjotContent ` types exported from ` kreuzberg.types `
52+ - Result includes ` elements ` field when using element-based format, ` djot_content ` when available
3753 - Compatible with Unstructured.io API for migration
3854
39- - ** TypeScript/Node.js** : Element-based output with strict TypeScript interfaces
40- - ` Element ` , ` ElementType ` , ` ElementMetadata ` , ` BoundingBox ` interfaces in ` @kreuzberg/core `
41- - ` outputFormat: "unified" | "element_based" ` configuration option
42- - Result type includes optional ` elements ` array
55+ - ** TypeScript/Node.js** : Enhanced output configuration with strict TypeScript interfaces
56+ - Content format support: ` contentFormat: "plain" | "markdown" | "djot" | "html" ` option
57+ - Element-based output: ` resultFormat: "unified" | "element_based" ` option
58+ - ` Element ` , ` ElementType ` , ` ElementMetadata ` , ` BoundingBox ` , ` DjotContent ` interfaces in ` @kreuzberg/core `
59+ - Result type includes optional ` elements ` array and ` djotContent ` field
4360
4461- ** Ruby** : Element-based output with idiomatic Ruby types
4562 - ` Element ` , ` ElementType ` , ` ElementMetadata ` , ` BoundingBox ` classes in ` Kreuzberg::Types `
@@ -71,12 +88,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7188 - ` :output_format ` option in config accepting ` :unified ` or ` :element_based `
7289 - Result map includes ` :elements ` key with element list
7390
74- - ** WASM** : Element-based output with TypeScript definitions
75- - Element types exported to WASM TypeScript bindings
76- - ` output_format ` configuration option
77- - Elements accessible from extraction result
91+ - ** PHP** , ** Go** , ** Java** , ** C#** , ** Ruby** , ** Elixir** , ** WASM** : All language bindings updated with:
92+ - Content format configuration support (` content_format ` / ` contentFormat ` / equivalent)
93+ - Result format configuration for element-based output (` result_format ` / ` resultFormat ` / equivalent)
94+ - ` DjotContent ` type bindings where applicable
95+ - Dual format support: control both output structure (unified/element-based) and content formatting (plain/markdown/djot/html)
7896
7997#### Documentation
98+ - ** Djot format documentation** : New format reference and usage examples
99+ - Added ` .djot ` to supported formats table with MIME type ` text/x-djot `
100+ - CLI usage examples for ` --content-format djot ` flag
101+ - Environment variable support documentation (` KREUZBERG_CONTENT_FORMAT ` )
102+ - Configuration reference updates for ` content_format ` field
103+ - Format count updated from 56 to 57 supported formats
80104- ** Migration guides** : New documentation for Unstructured.io users
81105 - ` docs/migration/from-unstructured.md ` : Step-by-step migration guide with code examples
82106 - ` docs/comparisons/kreuzberg-vs-unstructured.md ` : Feature comparison and compatibility matrix
@@ -86,6 +110,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
86110
87111### Fixed
88112
113+ #### Documentation
114+ - ** MkDocs build** : Fixed broken benchmark documentation links in ` docs/concepts/performance.md `
115+ - Commented out references to non-existent benchmark pages to fix strict mode build failures
116+ - Build now passes with 667 pages generated successfully
117+
89118#### Python
90119- ** Type exports** : Fixed missing type exports in ` kreuzberg.types.__all__ `
91120 - Added ` Element ` , ` ElementMetadata ` , ` ElementType ` , ` BoundingBox ` to exported types
0 commit comments