Skip to content

Commit a8ba006

Browse files
authored
Merge pull request #312 from kreuzberg-dev/feat/djot-support
feat: Add comprehensive Djot markup support with configurable output formats
2 parents 835af94 + a0e19ae commit a8ba006

File tree

47 files changed

+4349
-260
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+4349
-260
lines changed

CHANGELOG.md

Lines changed: 43 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -21,25 +21,42 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2121
- Comprehensive error handling for invalid inputs
2222

2323
#### Core
24-
- **Element-based output format**: New `OutputFormat::ElementBased` option provides Unstructured.io-compatible semantic element extraction
24+
- **Djot markup format support**: New `.djot` file extraction with comprehensive Djot syntax support
25+
- Full parser implementation with structured representation via `DjotContent` type
26+
- Supports headings, paragraphs, lists (ordered, unordered, task, definition), tables, code blocks, emphasis, links, images, footnotes, math expressions
27+
- YAML frontmatter extraction with metadata preservation
28+
- Shared frontmatter utilities between Markdown and Djot extractors
29+
- Feature-gated behind `djot` feature flag (enabled by default)
30+
- 39 comprehensive tests covering Unicode, tables, roundtrip conversion, and edge cases
31+
32+
- **Content output format configuration**: New `ContentFormat` enum for configurable text output formatting
33+
- Converts extracted content from ANY file format to Plain, Markdown, Djot, or HTML
34+
- Post-processing pipeline applies format transformation after extraction
35+
- Configuration via `config.content_format` field in `ExtractionConfig` (defaults to `Plain`)
36+
- CLI support with `--content-format` flag and `KREUZBERG_CONTENT_FORMAT` environment variable
37+
- Independent from `result_format` (Unified vs ElementBased structure)
38+
39+
- **Element-based output format**: New `ResultFormat::ElementBased` option provides Unstructured.io-compatible semantic element extraction
2540
- Extracts structured elements: titles, paragraphs, lists, tables, images, page breaks, headings, code blocks, block quotes, headers, footers
2641
- Each element includes rich metadata: bounding boxes, page numbers, confidence scores, hierarchy information
2742
- Transformation pipeline converts unified output to element-based format via `extraction::transform` module
2843
- Added `Element`, `ElementType`, `ElementMetadata`, and `BoundingBox` types to core types module
2944
- Supports PDF hierarchy detection for semantic heading levels
30-
- Configuration via `config.output_format` field (defaults to `Unified`)
45+
- Configuration via `config.result_format` field (defaults to `Unified`)
3146

3247
#### Language Bindings
33-
- **Python**: Element-based output support with full type hints
34-
- New `output_format` parameter in extraction config accepting `"unified"` or `"element_based"`
35-
- `Element`, `ElementType`, `ElementMetadata`, `BoundingBox` types exported from `kreuzberg.types`
36-
- Result includes `elements` field when using element-based format
48+
- **Python**: Enhanced output configuration with full type hints
49+
- Content format support: `content_format` parameter accepting `"plain"`, `"markdown"`, `"djot"`, or `"html"`
50+
- Element-based output: `result_format` parameter accepting `"unified"` or `"element_based"`
51+
- `Element`, `ElementType`, `ElementMetadata`, `BoundingBox`, `DjotContent` types exported from `kreuzberg.types`
52+
- Result includes `elements` field when using element-based format, `djot_content` when available
3753
- Compatible with Unstructured.io API for migration
3854

39-
- **TypeScript/Node.js**: Element-based output with strict TypeScript interfaces
40-
- `Element`, `ElementType`, `ElementMetadata`, `BoundingBox` interfaces in `@kreuzberg/core`
41-
- `outputFormat: "unified" | "element_based"` configuration option
42-
- Result type includes optional `elements` array
55+
- **TypeScript/Node.js**: Enhanced output configuration with strict TypeScript interfaces
56+
- Content format support: `contentFormat: "plain" | "markdown" | "djot" | "html"` option
57+
- Element-based output: `resultFormat: "unified" | "element_based"` option
58+
- `Element`, `ElementType`, `ElementMetadata`, `BoundingBox`, `DjotContent` interfaces in `@kreuzberg/core`
59+
- Result type includes optional `elements` array and `djotContent` field
4360

4461
- **Ruby**: Element-based output with idiomatic Ruby types
4562
- `Element`, `ElementType`, `ElementMetadata`, `BoundingBox` classes in `Kreuzberg::Types`
@@ -71,12 +88,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7188
- `:output_format` option in config accepting `:unified` or `:element_based`
7289
- Result map includes `:elements` key with element list
7390

74-
- **WASM**: Element-based output with TypeScript definitions
75-
- Element types exported to WASM TypeScript bindings
76-
- `output_format` configuration option
77-
- Elements accessible from extraction result
91+
- **PHP**, **Go**, **Java**, **C#**, **Ruby**, **Elixir**, **WASM**: All language bindings updated with:
92+
- Content format configuration support (`content_format` / `contentFormat` / equivalent)
93+
- Result format configuration for element-based output (`result_format` / `resultFormat` / equivalent)
94+
- `DjotContent` type bindings where applicable
95+
- Dual format support: control both output structure (unified/element-based) and content formatting (plain/markdown/djot/html)
7896

7997
#### Documentation
98+
- **Djot format documentation**: New format reference and usage examples
99+
- Added `.djot` to supported formats table with MIME type `text/x-djot`
100+
- CLI usage examples for `--content-format djot` flag
101+
- Environment variable support documentation (`KREUZBERG_CONTENT_FORMAT`)
102+
- Configuration reference updates for `content_format` field
103+
- Format count updated from 56 to 57 supported formats
80104
- **Migration guides**: New documentation for Unstructured.io users
81105
- `docs/migration/from-unstructured.md`: Step-by-step migration guide with code examples
82106
- `docs/comparisons/kreuzberg-vs-unstructured.md`: Feature comparison and compatibility matrix
@@ -86,6 +110,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
86110

87111
### Fixed
88112

113+
#### Documentation
114+
- **MkDocs build**: Fixed broken benchmark documentation links in `docs/concepts/performance.md`
115+
- Commented out references to non-existent benchmark pages to fix strict mode build failures
116+
- Build now passes with 667 pages generated successfully
117+
89118
#### Python
90119
- **Type exports**: Fixed missing type exports in `kreuzberg.types.__all__`
91120
- Added `Element`, `ElementMetadata`, `ElementType`, `BoundingBox` to exported types

Cargo.lock

Lines changed: 8 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -56,13 +56,13 @@
5656
</a>
5757
</div>
5858

59-
Extract text and metadata from a wide range of file formats (56+), generate embeddings and post-process at native speeds without needing a GPU.
59+
Extract text and metadata from a wide range of file formats (57+), generate embeddings and post-process at native speeds without needing a GPU.
6060

6161
## Key Features
6262

6363
- **Extensible architecture** – Plugin system for custom OCR backends, validators, post-processors, and document extractors
6464
- **Polyglot** – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, C#, PHP, and Elixir
65-
- **56 file formats** – PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
65+
- **57 file formats** – PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
6666
- **OCR support** – Tesseract (all languages via native binding), EasyOCR/PaddleOCR (Python), Guten (Node.js), extensible via plugin API
6767
- **High performance** – Rust core with native PDFium, SIMD optimizations and full parallelism
6868
- **Flexible deployment** – Use as library, CLI tool, REST API server, or MCP server
@@ -135,7 +135,7 @@ To use embeddings functionality:
135135

136136
## Supported Formats
137137

138-
56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
138+
57 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
139139

140140
### Office Documents
141141

@@ -161,7 +161,7 @@ To use embeddings functionality:
161161
|----------|---------|----------|
162162
| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
163163
| **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
164-
| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, reStructuredText, Org Mode |
164+
| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, Djot, reStructuredText, Org Mode |
165165

166166
### Email & Archives
167167

crates/kreuzberg-cli/src/main.rs

Lines changed: 50 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,8 @@ use clap::{Parser, Subcommand};
4848
#[cfg(feature = "api")]
4949
use kreuzberg::ServerConfig;
5050
use kreuzberg::{
51-
ChunkingConfig, ExtractionConfig, LanguageDetectionConfig, OcrConfig, batch_extract_file_sync, detect_mime_type,
52-
extract_file_sync,
51+
ChunkingConfig, ExtractionConfig, LanguageDetectionConfig, OcrConfig, OutputFormat as ContentOutputFormat,
52+
batch_extract_file_sync, detect_mime_type, extract_file_sync,
5353
};
5454
use serde_json::json;
5555
use std::path::{Path, PathBuf};
@@ -114,6 +114,13 @@ enum Commands {
114114
/// Enable language detection (overrides config file)
115115
#[arg(long)]
116116
detect_language: Option<bool>,
117+
118+
/// Content output format (plain, markdown, djot, html)
119+
///
120+
/// Controls the format of the extracted content.
121+
/// Note: This is different from --format which controls CLI output (text/json).
122+
#[arg(long, value_enum)]
123+
content_format: Option<ContentOutputFormatArg>,
117124
},
118125

119126
/// Batch extract from multiple documents
@@ -144,6 +151,13 @@ enum Commands {
144151
/// Enable quality processing (overrides config file)
145152
#[arg(long)]
146153
quality: Option<bool>,
154+
155+
/// Content output format (plain, markdown, djot, html)
156+
///
157+
/// Controls the format of the extracted content.
158+
/// Note: This is different from --format which controls CLI output (text/json).
159+
#[arg(long, value_enum)]
160+
content_format: Option<ContentOutputFormatArg>,
147161
},
148162

149163
/// Detect MIME type of a file
@@ -257,6 +271,32 @@ impl std::str::FromStr for OutputFormat {
257271
}
258272
}
259273

274+
/// Content output format for extraction results.
275+
///
276+
/// Controls the format of the extracted content (not the CLI output format).
277+
#[derive(Clone, Copy, Debug, PartialEq, Eq, clap::ValueEnum)]
278+
enum ContentOutputFormatArg {
279+
/// Plain text (default)
280+
Plain,
281+
/// Markdown format
282+
Markdown,
283+
/// Djot markup format
284+
Djot,
285+
/// HTML format
286+
Html,
287+
}
288+
289+
impl From<ContentOutputFormatArg> for ContentOutputFormat {
290+
fn from(arg: ContentOutputFormatArg) -> Self {
291+
match arg {
292+
ContentOutputFormatArg::Plain => ContentOutputFormat::Plain,
293+
ContentOutputFormatArg::Markdown => ContentOutputFormat::Markdown,
294+
ContentOutputFormatArg::Djot => ContentOutputFormat::Djot,
295+
ContentOutputFormatArg::Html => ContentOutputFormat::Html,
296+
}
297+
}
298+
}
299+
260300
/// Validates that a file exists and is accessible.
261301
///
262302
/// Checks that the path exists in the filesystem and points to a regular file
@@ -368,6 +408,7 @@ fn main() -> Result<()> {
368408
chunk_overlap,
369409
quality,
370410
detect_language,
411+
content_format,
371412
} => {
372413
validate_file_exists(&path)?;
373414
validate_chunk_params(chunk_size, chunk_overlap)?;
@@ -426,6 +467,9 @@ fn main() -> Result<()> {
426467
config.language_detection = None;
427468
}
428469
}
470+
if let Some(content_fmt) = content_format {
471+
config.output_format = content_fmt.into();
472+
}
429473

430474
let path_str = path.to_string_lossy().to_string();
431475

@@ -468,6 +512,7 @@ fn main() -> Result<()> {
468512
force_ocr,
469513
no_cache,
470514
quality,
515+
content_format,
471516
} => {
472517
validate_batch_paths(&paths)?;
473518

@@ -493,6 +538,9 @@ fn main() -> Result<()> {
493538
if let Some(quality_flag) = quality {
494539
config.enable_quality_processing = quality_flag;
495540
}
541+
if let Some(content_fmt) = content_format {
542+
config.output_format = content_fmt.into();
543+
}
496544

497545
let path_strs: Vec<String> = paths.iter().map(|p| p.to_string_lossy().to_string()).collect();
498546

crates/kreuzberg-ffi/benches/result_view_benchmark.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ fn create_test_result(content_size: usize, chunk_count: usize) -> ExtractionResu
7070
chunks,
7171
images: None,
7272
pages: None,
73-
elements: None,
73+
elements: None,
7474
}
7575
}
7676

crates/kreuzberg-node/src/lib.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1277,7 +1277,7 @@ impl TryFrom<JsExtractionConfig> for ExtractionConfig {
12771277
html_options,
12781278
max_concurrent_extractions: val.max_concurrent_extractions.map(|v| v as usize),
12791279
pages: val.pages.map(|p| p.try_into()).transpose()?,
1280-
output_format: OutputFormat::Unified,
1280+
output_format: Default::default(),
12811281
})
12821282
}
12831283
}

crates/kreuzberg-php/src/extraction.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,7 @@ pub fn kreuzberg_extract_bytes(
173173
chunks: None,
174174
images: None,
175175
pages: None,
176-
elements: None,
176+
elements: None,
177177
};
178178

179179
return ExtractionResult::from_rust(rust_result);

crates/kreuzberg-py/src/plugins.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -675,7 +675,7 @@ fn dict_to_extraction_result(_py: Python<'_>, dict: &Bound<'_, PyAny>) -> Result
675675
chunks: None,
676676
images: None,
677677
pages: None,
678-
elements: None,
678+
elements: None,
679679
})
680680
}
681681

crates/kreuzberg-py/src/types.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -467,7 +467,7 @@ mod tests {
467467
chunks: None,
468468
images: None,
469469
pages: None,
470-
elements: None,
470+
elements: None,
471471
};
472472

473473
let py_result = ExtractionResult::from_rust(rust_result, py).expect("conversion should succeed");
@@ -494,7 +494,7 @@ mod tests {
494494
chunks: None,
495495
images: None,
496496
pages: None,
497-
elements: None,
497+
elements: None,
498498
};
499499
rust_result
500500
.metadata

crates/kreuzberg/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,7 @@ regex = "1.12.2"
136136
serde = { workspace = true }
137137
serde_json = { workspace = true }
138138
serde_yaml_ng = "0.10.0"
139+
jotdown = "0.9"
139140
toml = { workspace = true }
140141
mime_guess = "2.0"
141142
rmp-serde = "1.3"

0 commit comments

Comments
 (0)