Skip to content

feat: add comprehensive element-based output support#311

Merged
Goldziher merged 5 commits intomainfrom
feature/unstructured-compatibility
Jan 18, 2026
Merged

feat: add comprehensive element-based output support#311
Goldziher merged 5 commits intomainfrom
feature/unstructured-compatibility

Conversation

@Goldziher
Copy link
Collaborator

@Goldziher Goldziher commented Jan 18, 2026

Overview

Adds element-based output format across all language bindings, providing Unstructured.io-compatible semantic element extraction.

What's Added

Core

New OutputFormat::ElementBased option extracts structured elements (titles, paragraphs, lists, tables, images, page breaks, headings, code blocks, block quotes, headers, footers) with rich metadata including bounding boxes, page numbers, and hierarchy information.

Language Bindings

All 10 bindings now support element-based output with idiomatic types:

  • Python: output_format="element_based" with full type hints
  • TypeScript: outputFormat: "element_based" with strict interfaces
  • Ruby: output_format: :element_based with snake_case serialization
  • PHP: outputFormat: "element_based" with typed classes
  • Go: Idiomatic structs with JSON tags
  • Java: Builder pattern classes
  • C#: Nullable reference types
  • Elixir: Pattern matching support
  • WASM: TypeScript definitions
  • Rust: Core types

Documentation

  • Migration guide from Unstructured.io
  • Feature comparison with Unstructured.io
  • Element-based output guide covering all 11 element types
  • Updated type reference
  • Code snippets in all languages

Bug Fix

Fixed missing Python type exports in kreuzberg.types.__all__ for Element-related types.

Breaking Changes

None. Default behavior remains OutputFormat::Unified. Element-based output is opt-in via configuration.

@Goldziher Goldziher changed the title docs: comprehensive element-based output documentation feat: add comprehensive element-based output support Jan 18, 2026
CI Failures Fixed:
1. Go CI - FFI library path issue
   - GitHub Actions preserves directory structure in artifacts
   - Downloaded artifacts had nested paths (ffi-download/crates/kreuzberg-ffi/kreuzberg.h)
   - Script expected flat paths (ffi-download/kreuzberg.h)
   - Fixed: Updated move-downloaded-ffi-library.sh to check both nested and flat paths
   - Also fixed: Changed while loop to use process substitution to properly increment LIBRARY_COUNT

2. Ruby CI - SIGPIPE failure in diagnostic step
   - "Print post-extension compilation status" step failed
   - Issue: Script runs with bash -e -o pipefail
   - Commands like `find | head -20` cause SIGPIPE when head closes pipe early
   - With pipefail, SIGPIPE causes non-zero exit, triggering -e to exit script
   - Fixed: Added `|| true` to all piped commands with head/tail to ignore SIGPIPE

Root Causes:
- Go: Artifact download preserves directory structure
- Ruby: Diagnostic commands incompatible with pipefail flag

Testing:
- Changes should allow CI to pass on next run
- No functional changes to actual build/test logic
Add missing critical type exports to kreuzberg.types.__all__:
- Element, ElementMetadata, ElementType, BoundingBox for element-based extraction
- HtmlImageMetadata for HTML image metadata

These types are essential for:
- Type hints and IDE autocomplete for Python 3.10+ users
- element_based extraction output format compatibility
- unstructured.io API parity (Element, ElementType, ElementMetadata)

ElementType Literal now properly exports all 11 variants:
title, narrative_text, heading, list_item, table, image, page_break,
code_block, block_quote, footer, header

Total __all__ exports: 32 public types

Fixes type checking issues where critical types were defined but not exported,
causing import failures in consumer code.
Add Unstructured-compatible element-based output format to all 10
language bindings. Introduces OutputFormat enum (Unified/ElementBased)
with Element types containing semantic information (titles, paragraphs,
lists, tables, images, page breaks).

- Core: Add OutputFormat config, Element types, and transformation
  pipeline
- Rust FFI: Add elements field support across all test fixtures
- Python: Add Element types and output_format parameter
- TypeScript: Add Element interfaces and output format support
- Ruby: Add Element types, output_format parsing, and snake_case
  serialization
- PHP: Add Element classes and result field
- Go: Add Element structs and JSON tags
- Java: Add Element classes with builder pattern
- C#: Add Element classes with nullable reference types
- Elixir: Add Element types with pattern matching support
- WASM: Add Element TypeScript definitions

Tests updated across all bindings. Documentation added for migration
from Unstructured.io.
- Add element-based output guide covering all 11 element types
- Update type reference with Element, ElementType, ElementMetadata,
  BoundingBox, OutputFormat
- Create code snippets for element-based extraction in 10 languages
- Update navigation to include element-based guide and migration sections
- Document element types: title, narrative_text, list_item, table,
  image, page_break, heading, code_block, block_quote, header, footer
@Goldziher Goldziher force-pushed the feature/unstructured-compatibility branch from 1015f7e to d118c1b Compare January 18, 2026 11:34
Add new chunking endpoint to Axum API server that enables text chunking
via HTTP requests with comprehensive configuration options.

Features:
- JSON-based endpoint accepting text and chunking configuration
- Support for text and markdown chunking strategies
- Configurable parameters: max_characters, overlap, trim
- Returns chunks with byte offsets, indices, and metadata
- Case-insensitive chunker_type parameter
- Comprehensive error handling and validation

Implementation:
- Add ChunkRequest, ChunkResponse, ChunkItem types to API types
- Add chunk_handler function with input validation
- Register /chunk route in API server
- Export chunk types from API module
- Update API documentation with endpoint and curl examples

Testing:
- 10 comprehensive integration tests covering:
  - Basic chunking functionality
  - Empty text validation
  - Markdown strategy support
  - Response structure validation
  - Invalid chunker_type handling
  - Default configuration
  - Malformed JSON handling
  - Case-insensitive chunker_type
  - Long text chunking
  - Custom configuration

Documentation:
- Update CHANGELOG.md with new endpoint details
- Add curl examples to API module documentation
- Document all request/response types with rustdoc

All tests passing (10/10).
@Goldziher Goldziher merged commit 6bba2c8 into main Jan 18, 2026
0 of 65 checks passed
@Goldziher Goldziher deleted the feature/unstructured-compatibility branch January 18, 2026 12:03
Goldziher added a commit that referenced this pull request Feb 13, 2026
…tibility

feat: add comprehensive element-based output support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant