Skip to content

feat: Add comprehensive Djot markup support with configurable output formats#312

Merged
Goldziher merged 7 commits intomainfrom
feat/djot-support
Jan 18, 2026
Merged

feat: Add comprehensive Djot markup support with configurable output formats#312
Goldziher merged 7 commits intomainfrom
feat/djot-support

Conversation

@Goldziher
Copy link
Collaborator

Summary

Adds comprehensive support for Djot markup language with configurable output format conversion for all file types.

Features

  • Extract .djot files with full syntax support (headings, lists, tables, code blocks, emphasis, links, images, footnotes, math expressions, smart punctuation)
  • Convert extracted content from ANY format to Plain, Markdown, Djot, or HTML
  • YAML frontmatter support
  • CLI and API integration across all language bindings

Usage

kreuzberg extract document.pdf --content-format djot
KREUZBERG_OUTPUT_FORMAT=djot kreuzberg extract file.docx
kreuzberg batch *.pdf --content-format djot --format json

Changes

  • New Djot extractor with structured data representation
  • Output format configuration in post-processing pipeline
  • Shared frontmatter utilities for Markdown and Djot
  • Updated all FFI bindings (Python, Node.js, WASM, PHP, Java)
  • Documentation updates and mkdocs build fixes

Test Results

✅ All 39 Djot tests passing
✅ MkDocs build successful
✅ Integration tests passing

Closes #263

Goldziher and others added 7 commits January 18, 2026 13:35
The maven-gpg-plugin was attempting to sign artifacts during local
builds (mvn clean install), causing setup failures when GPG is not
configured. GPG signing is now skipped by default and only enabled
when using the 'publish' profile, allowing the project to work out
of the box for local development.
Add a new DjotExtractor that parses Djot markup documents using the
jotdown crate. Djot is a modern markup language with simpler parsing
rules than CommonMark.

Features:
- YAML frontmatter metadata extraction
- Table extraction as structured data
- Heading structure preservation
- Code block and link extraction
- Smart punctuation handling

The implementation follows the same pattern as the Markdown extractor,
making it consistent with the existing codebase.

MIME types: text/djot, text/x-djot

Closes #262
Move Djot extractor to its own feature flag since it only needs
jotdown and serde_yaml_ng (already a core dep), without requiring
the full office feature dependencies.

- Add `djot` feature with just `dep:jotdown` + `tokio-runtime`
- Include `djot` in the `full` feature
- Update all cfg attributes from `office` to `djot`
…onfiguration

Add full djot extraction and output format support:

- Add OutputFormat enum (Plain, Markdown, Djot, Html) to ExtractionConfig
- Add --content-format CLI flag for extract and batch commands
- Add KREUZBERG_OUTPUT_FORMAT environment variable support
- Implement 100% djot feature extraction including:
  - Block elements: blockquotes, lists, code blocks, divs, sections
  - Inline elements: strong, emphasis, links, images, spans
  - Attributes system with classes, IDs, and key-value pairs
  - Footnotes, math blocks, raw content
- Add djot generation functions for output format conversion
- Create frontmatter_utils.rs for shared YAML frontmatter handling
- Wire output_format through extraction pipeline
- Add djot_content field to ExtractionResult for structured djot data

Closes #263
Comment out broken links to non-existent benchmark pages to fix mkdocs
strict mode build. Benchmark documentation will be added in the future.
- Update format count from 56 to 57
- Add djot to text & markdown formats table
- Add output format examples and configuration
- Update features documentation
Added comprehensive changelog entries for:
- Djot markup format support with full feature list
- Content output format configuration (Plain/Markdown/Djot/HTML)
- Language bindings updates for all platforms
- Documentation updates and fixes
- Clarified distinction between result_format and content_format
@Goldziher Goldziher merged commit a8ba006 into main Jan 18, 2026
0 of 57 checks passed
@Goldziher Goldziher deleted the feat/djot-support branch January 18, 2026 13:35
Goldziher added a commit that referenced this pull request Feb 13, 2026
feat: Add comprehensive Djot markup support with configurable output formats
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants