kreuzberg-dev
diff --git a/‎CHANGELOG.md‎
Lines changed: 43 additions & 14 deletions b/‎CHANGELOG.md‎
Lines changed: 43 additions & 14 deletions
diff --git a/‎Cargo.lock‎
Lines changed: 8 additions & 1 deletion b/‎Cargo.lock‎
Lines changed: 8 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 4 additions & 4 deletions b/‎README.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎crates/kreuzberg-cli/src/main.rs‎
Lines changed: 50 additions & 2 deletions b/‎crates/kreuzberg-cli/src/main.rs‎
Lines changed: 50 additions & 2 deletions
diff --git a/‎crates/kreuzberg-ffi/benches/result_view_benchmark.rs‎
Lines changed: 1 addition & 1 deletion b/‎crates/kreuzberg-ffi/benches/result_view_benchmark.rs‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎crates/kreuzberg-node/src/lib.rs‎
Lines changed: 1 addition & 1 deletion b/‎crates/kreuzberg-node/src/lib.rs‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎crates/kreuzberg-php/src/extraction.rs‎
Lines changed: 1 addition & 1 deletion b/‎crates/kreuzberg-php/src/extraction.rs‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎crates/kreuzberg-py/src/plugins.rs‎
Lines changed: 1 addition & 1 deletion b/‎crates/kreuzberg-py/src/plugins.rs‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎crates/kreuzberg-py/src/types.rs‎
Lines changed: 2 additions & 2 deletions b/‎crates/kreuzberg-py/src/types.rs‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎crates/kreuzberg/Cargo.toml‎
Lines changed: 1 addition & 0 deletions b/‎crates/kreuzberg/Cargo.toml‎
Lines changed: 1 addition & 0 deletions
@@ -21,25 +21,42 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   - Comprehensive error handling for invalid inputs
 
 #### Core
-- **Element-based output format**: New `OutputFormat::ElementBased` option provides Unstructured.io-compatible semantic element extraction
+- **Djot markup format support**: New `.djot` file extraction with comprehensive Djot syntax support
+  - Full parser implementation with structured representation via `DjotContent` type
+  - Supports headings, paragraphs, lists (ordered, unordered, task, definition), tables, code blocks, emphasis, links, images, footnotes, math expressions
+  - YAML frontmatter extraction with metadata preservation
+  - Shared frontmatter utilities between Markdown and Djot extractors
+  - Feature-gated behind `djot` feature flag (enabled by default)
+  - 39 comprehensive tests covering Unicode, tables, roundtrip conversion, and edge cases
+
+- **Content output format configuration**: New `ContentFormat` enum for configurable text output formatting
+  - Converts extracted content from ANY file format to Plain, Markdown, Djot, or HTML
+  - Post-processing pipeline applies format transformation after extraction
+  - Configuration via `config.content_format` field in `ExtractionConfig` (defaults to `Plain`)
+  - CLI support with `--content-format` flag and `KREUZBERG_CONTENT_FORMAT` environment variable
+  - Independent from `result_format` (Unified vs ElementBased structure)
+
+- **Element-based output format**: New `ResultFormat::ElementBased` option provides Unstructured.io-compatible semantic element extraction
   - Extracts structured elements: titles, paragraphs, lists, tables, images, page breaks, headings, code blocks, block quotes, headers, footers
   - Each element includes rich metadata: bounding boxes, page numbers, confidence scores, hierarchy information
   - Transformation pipeline converts unified output to element-based format via `extraction::transform` module
   - Added `Element`, `ElementType`, `ElementMetadata`, and `BoundingBox` types to core types module
   - Supports PDF hierarchy detection for semantic heading levels
-  - Configuration via `config.output_format` field (defaults to `Unified`)
+  - Configuration via `config.result_format` field (defaults to `Unified`)
 
 #### Language Bindings
-- **Python**: Element-based output support with full type hints
-  - New `output_format` parameter in extraction config accepting `"unified"` or `"element_based"`
-  - `Element`, `ElementType`, `ElementMetadata`, `BoundingBox` types exported from `kreuzberg.types`
-  - Result includes `elements` field when using element-based format
+- **Python**: Enhanced output configuration with full type hints
+  - Content format support: `content_format` parameter accepting `"plain"`, `"markdown"`, `"djot"`, or `"html"`
+  - Element-based output: `result_format` parameter accepting `"unified"` or `"element_based"`
+  - `Element`, `ElementType`, `ElementMetadata`, `BoundingBox`, `DjotContent` types exported from `kreuzberg.types`
+  - Result includes `elements` field when using element-based format, `djot_content` when available
   - Compatible with Unstructured.io API for migration
 
-- **TypeScript/Node.js**: Element-based output with strict TypeScript interfaces
-  - `Element`, `ElementType`, `ElementMetadata`, `BoundingBox` interfaces in `@kreuzberg/core`
-  - `outputFormat: "unified" | "element_based"` configuration option
-  - Result type includes optional `elements` array
+- **TypeScript/Node.js**: Enhanced output configuration with strict TypeScript interfaces
+  - Content format support: `contentFormat: "plain" | "markdown" | "djot" | "html"` option
+  - Element-based output: `resultFormat: "unified" | "element_based"` option
+  - `Element`, `ElementType`, `ElementMetadata`, `BoundingBox`, `DjotContent` interfaces in `@kreuzberg/core`
+  - Result type includes optional `elements` array and `djotContent` field
 
 - **Ruby**: Element-based output with idiomatic Ruby types
   - `Element`, `ElementType`, `ElementMetadata`, `BoundingBox` classes in `Kreuzberg::Types`
@@ -71,12 +88,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   - `:output_format` option in config accepting `:unified` or `:element_based`
   - Result map includes `:elements` key with element list
 
-- **WASM**: Element-based output with TypeScript definitions
-  - Element types exported to WASM TypeScript bindings
-  - `output_format` configuration option
-  - Elements accessible from extraction result
+- **PHP**, **Go**, **Java**, **C#**, **Ruby**, **Elixir**, **WASM**: All language bindings updated with:
+  - Content format configuration support (`content_format` / `contentFormat` / equivalent)
+  - Result format configuration for element-based output (`result_format` / `resultFormat` / equivalent)
+  - `DjotContent` type bindings where applicable
+  - Dual format support: control both output structure (unified/element-based) and content formatting (plain/markdown/djot/html)
 
 #### Documentation
+- **Djot format documentation**: New format reference and usage examples
+  - Added `.djot` to supported formats table with MIME type `text/x-djot`
+  - CLI usage examples for `--content-format djot` flag
+  - Environment variable support documentation (`KREUZBERG_CONTENT_FORMAT`)
+  - Configuration reference updates for `content_format` field
+  - Format count updated from 56 to 57 supported formats
 - **Migration guides**: New documentation for Unstructured.io users
   - `docs/migration/from-unstructured.md`: Step-by-step migration guide with code examples
   - `docs/comparisons/kreuzberg-vs-unstructured.md`: Feature comparison and compatibility matrix
@@ -86,6 +110,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Fixed
 
+#### Documentation
+- **MkDocs build**: Fixed broken benchmark documentation links in `docs/concepts/performance.md`
+  - Commented out references to non-existent benchmark pages to fix strict mode build failures
+  - Build now passes with 667 pages generated successfully
+
 #### Python
 - **Type exports**: Fixed missing type exports in `kreuzberg.types.__all__`
   - Added `Element`, `ElementMetadata`, `ElementType`, `BoundingBox` to exported types
 
@@ -56,13 +56,13 @@
   </a>
 </div>
 
-Extract text and metadata from a wide range of file formats (56+), generate embeddings and post-process at native speeds without needing a GPU.
+Extract text and metadata from a wide range of file formats (57+), generate embeddings and post-process at native speeds without needing a GPU.
 
 ## Key Features
 
 - **Extensible architecture** – Plugin system for custom OCR backends, validators, post-processors, and document extractors
 - **Polyglot** – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, C#, PHP, and Elixir
-- **56 file formats** – PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
+- **57 file formats** – PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
 - **OCR support** – Tesseract (all languages via native binding), EasyOCR/PaddleOCR (Python), Guten (Node.js), extensible via plugin API
 - **High performance** – Rust core with native PDFium, SIMD optimizations and full parallelism
 - **Flexible deployment** – Use as library, CLI tool, REST API server, or MCP server
@@ -135,7 +135,7 @@ To use embeddings functionality:
 
 ## Supported Formats
 
-56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
+57 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
 
 ### Office Documents
 
@@ -161,7 +161,7 @@ To use embeddings functionality:
 |----------|---------|----------|
 | **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
 | **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
-| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, reStructuredText, Org Mode |
+| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, Djot, reStructuredText, Org Mode |
 
 ### Email & Archives
 
 
@@ -48,8 +48,8 @@ use clap::{Parser, Subcommand};
 #[cfg(feature = "api")]
 use kreuzberg::ServerConfig;
 use kreuzberg::{
-    ChunkingConfig, ExtractionConfig, LanguageDetectionConfig, OcrConfig, batch_extract_file_sync, detect_mime_type,
-    extract_file_sync,
+    ChunkingConfig, ExtractionConfig, LanguageDetectionConfig, OcrConfig, OutputFormat as ContentOutputFormat,
+    batch_extract_file_sync, detect_mime_type, extract_file_sync,
 };
 use serde_json::json;
 use std::path::{Path, PathBuf};
@@ -114,6 +114,13 @@ enum Commands {
         /// Enable language detection (overrides config file)
         #[arg(long)]
         detect_language: Option<bool>,
+
+        /// Content output format (plain, markdown, djot, html)
+        ///
+        /// Controls the format of the extracted content.
+        /// Note: This is different from --format which controls CLI output (text/json).
+        #[arg(long, value_enum)]
+        content_format: Option<ContentOutputFormatArg>,
     },
 
     /// Batch extract from multiple documents
@@ -144,6 +151,13 @@ enum Commands {
         /// Enable quality processing (overrides config file)
         #[arg(long)]
         quality: Option<bool>,
+
+        /// Content output format (plain, markdown, djot, html)
+        ///
+        /// Controls the format of the extracted content.
+        /// Note: This is different from --format which controls CLI output (text/json).
+        #[arg(long, value_enum)]
+        content_format: Option<ContentOutputFormatArg>,
     },
 
     /// Detect MIME type of a file
@@ -257,6 +271,32 @@ impl std::str::FromStr for OutputFormat {
     }
 }
 
+/// Content output format for extraction results.
+///
+/// Controls the format of the extracted content (not the CLI output format).
+#[derive(Clone, Copy, Debug, PartialEq, Eq, clap::ValueEnum)]
+enum ContentOutputFormatArg {
+    /// Plain text (default)
+    Plain,
+    /// Markdown format
+    Markdown,
+    /// Djot markup format
+    Djot,
+    /// HTML format
+    Html,
+}
+
+impl From<ContentOutputFormatArg> for ContentOutputFormat {
+    fn from(arg: ContentOutputFormatArg) -> Self {
+        match arg {
+            ContentOutputFormatArg::Plain => ContentOutputFormat::Plain,
+            ContentOutputFormatArg::Markdown => ContentOutputFormat::Markdown,
+            ContentOutputFormatArg::Djot => ContentOutputFormat::Djot,
+            ContentOutputFormatArg::Html => ContentOutputFormat::Html,
+        }
+    }
+}
+
 /// Validates that a file exists and is accessible.
 ///
 /// Checks that the path exists in the filesystem and points to a regular file
@@ -368,6 +408,7 @@ fn main() -> Result<()> {
             chunk_overlap,
             quality,
             detect_language,
+            content_format,
         } => {
             validate_file_exists(&path)?;
             validate_chunk_params(chunk_size, chunk_overlap)?;
@@ -426,6 +467,9 @@ fn main() -> Result<()> {
                     config.language_detection = None;
                 }
             }
+            if let Some(content_fmt) = content_format {
+                config.output_format = content_fmt.into();
+            }
 
             let path_str = path.to_string_lossy().to_string();
 
@@ -468,6 +512,7 @@ fn main() -> Result<()> {
             force_ocr,
             no_cache,
             quality,
+            content_format,
         } => {
             validate_batch_paths(&paths)?;
 
@@ -493,6 +538,9 @@ fn main() -> Result<()> {
             if let Some(quality_flag) = quality {
                 config.enable_quality_processing = quality_flag;
             }
+            if let Some(content_fmt) = content_format {
+                config.output_format = content_fmt.into();
+            }
 
             let path_strs: Vec<String> = paths.iter().map(|p| p.to_string_lossy().to_string()).collect();
 
 
@@ -70,7 +70,7 @@ fn create_test_result(content_size: usize, chunk_count: usize) -> ExtractionResu
         chunks,
         images: None,
         pages: None,
-        elements: None,
+            elements: None,
     }
 }
 
 
@@ -1277,7 +1277,7 @@ impl TryFrom<JsExtractionConfig> for ExtractionConfig {
             html_options,
             max_concurrent_extractions: val.max_concurrent_extractions.map(|v| v as usize),
             pages: val.pages.map(|p| p.try_into()).transpose()?,
-            output_format: OutputFormat::Unified,
+            output_format: Default::default(),
         })
     }
 }
 
@@ -173,7 +173,7 @@ pub fn kreuzberg_extract_bytes(
                         chunks: None,
                         images: None,
                         pages: None,
-                        elements: None,
+            elements: None,
                     };
 
                     return ExtractionResult::from_rust(rust_result);
 
@@ -675,7 +675,7 @@ fn dict_to_extraction_result(_py: Python<'_>, dict: &Bound<'_, PyAny>) -> Result
         chunks: None,
         images: None,
         pages: None,
-        elements: None,
+            elements: None,
     })
 }
 
 
@@ -467,7 +467,7 @@ mod tests {
                 chunks: None,
                 images: None,
                 pages: None,
-                elements: None,
+            elements: None,
             };
 
             let py_result = ExtractionResult::from_rust(rust_result, py).expect("conversion should succeed");
@@ -494,7 +494,7 @@ mod tests {
                 chunks: None,
                 images: None,
                 pages: None,
-                elements: None,
+            elements: None,
             };
             rust_result
                 .metadata
 
@@ -136,6 +136,7 @@ regex = "1.12.2"
 serde = { workspace = true }
 serde_json = { workspace = true }
 serde_yaml_ng = "0.10.0"
+jotdown = "0.9"
 toml = { workspace = true }
 mime_guess = "2.0"
 rmp-serde = "1.3"
Original file line number	Diff line number	Diff line change
`@@ -70,7 +70,7 @@ fn create_test_result(content_size: usize, chunk_count: usize) -> ExtractionResu`
`70`	`70`	`chunks,`
`71`	`71`	`images: None,`
`72`	`72`	`pages: None,`
`73`		`- elements: None,`
	`73`	`+ elements: None,`
`74`	`74`	`}`
`75`	`75`	`}`
`76`	`76`
Original file line number	Diff line number	Diff line change
`@@ -1277,7 +1277,7 @@ impl TryFrom<JsExtractionConfig> for ExtractionConfig {`
`1277`	`1277`	`html_options,`
`1278`	`1278`	`max_concurrent_extractions: val.max_concurrent_extractions.map(\|v\| v as usize),`
`1279`	`1279`	`pages: val.pages.map(\|p\| p.try_into()).transpose()?,`
`1280`		`- output_format: OutputFormat::Unified,`
	`1280`	`+ output_format: Default::default(),`
`1281`	`1281`	`})`
`1282`	`1282`	`}`
`1283`	`1283`	`}`
Original file line number	Diff line number	Diff line change
`@@ -675,7 +675,7 @@ fn dict_to_extraction_result(_py: Python<'_>, dict: &Bound<'_, PyAny>) -> Result`
`675`	`675`	`chunks: None,`
`676`	`676`	`images: None,`
`677`	`677`	`pages: None,`
`678`		`- elements: None,`
	`678`	`+ elements: None,`
`679`	`679`	`})`
`680`	`680`	`}`
`681`	`681`