feat: add comprehensive element-based output support#311
Merged
Conversation
CI Failures Fixed: 1. Go CI - FFI library path issue - GitHub Actions preserves directory structure in artifacts - Downloaded artifacts had nested paths (ffi-download/crates/kreuzberg-ffi/kreuzberg.h) - Script expected flat paths (ffi-download/kreuzberg.h) - Fixed: Updated move-downloaded-ffi-library.sh to check both nested and flat paths - Also fixed: Changed while loop to use process substitution to properly increment LIBRARY_COUNT 2. Ruby CI - SIGPIPE failure in diagnostic step - "Print post-extension compilation status" step failed - Issue: Script runs with bash -e -o pipefail - Commands like `find | head -20` cause SIGPIPE when head closes pipe early - With pipefail, SIGPIPE causes non-zero exit, triggering -e to exit script - Fixed: Added `|| true` to all piped commands with head/tail to ignore SIGPIPE Root Causes: - Go: Artifact download preserves directory structure - Ruby: Diagnostic commands incompatible with pipefail flag Testing: - Changes should allow CI to pass on next run - No functional changes to actual build/test logic
Add missing critical type exports to kreuzberg.types.__all__: - Element, ElementMetadata, ElementType, BoundingBox for element-based extraction - HtmlImageMetadata for HTML image metadata These types are essential for: - Type hints and IDE autocomplete for Python 3.10+ users - element_based extraction output format compatibility - unstructured.io API parity (Element, ElementType, ElementMetadata) ElementType Literal now properly exports all 11 variants: title, narrative_text, heading, list_item, table, image, page_break, code_block, block_quote, footer, header Total __all__ exports: 32 public types Fixes type checking issues where critical types were defined but not exported, causing import failures in consumer code.
Add Unstructured-compatible element-based output format to all 10 language bindings. Introduces OutputFormat enum (Unified/ElementBased) with Element types containing semantic information (titles, paragraphs, lists, tables, images, page breaks). - Core: Add OutputFormat config, Element types, and transformation pipeline - Rust FFI: Add elements field support across all test fixtures - Python: Add Element types and output_format parameter - TypeScript: Add Element interfaces and output format support - Ruby: Add Element types, output_format parsing, and snake_case serialization - PHP: Add Element classes and result field - Go: Add Element structs and JSON tags - Java: Add Element classes with builder pattern - C#: Add Element classes with nullable reference types - Elixir: Add Element types with pattern matching support - WASM: Add Element TypeScript definitions Tests updated across all bindings. Documentation added for migration from Unstructured.io.
- Add element-based output guide covering all 11 element types - Update type reference with Element, ElementType, ElementMetadata, BoundingBox, OutputFormat - Create code snippets for element-based extraction in 10 languages - Update navigation to include element-based guide and migration sections - Document element types: title, narrative_text, list_item, table, image, page_break, heading, code_block, block_quote, header, footer
1015f7e to
d118c1b
Compare
Add new chunking endpoint to Axum API server that enables text chunking via HTTP requests with comprehensive configuration options. Features: - JSON-based endpoint accepting text and chunking configuration - Support for text and markdown chunking strategies - Configurable parameters: max_characters, overlap, trim - Returns chunks with byte offsets, indices, and metadata - Case-insensitive chunker_type parameter - Comprehensive error handling and validation Implementation: - Add ChunkRequest, ChunkResponse, ChunkItem types to API types - Add chunk_handler function with input validation - Register /chunk route in API server - Export chunk types from API module - Update API documentation with endpoint and curl examples Testing: - 10 comprehensive integration tests covering: - Basic chunking functionality - Empty text validation - Markdown strategy support - Response structure validation - Invalid chunker_type handling - Default configuration - Malformed JSON handling - Case-insensitive chunker_type - Long text chunking - Custom configuration Documentation: - Update CHANGELOG.md with new endpoint details - Add curl examples to API module documentation - Document all request/response types with rustdoc All tests passing (10/10).
Goldziher
added a commit
that referenced
this pull request
Feb 13, 2026
…tibility feat: add comprehensive element-based output support
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Adds element-based output format across all language bindings, providing Unstructured.io-compatible semantic element extraction.
What's Added
Core
New
OutputFormat::ElementBasedoption extracts structured elements (titles, paragraphs, lists, tables, images, page breaks, headings, code blocks, block quotes, headers, footers) with rich metadata including bounding boxes, page numbers, and hierarchy information.Language Bindings
All 10 bindings now support element-based output with idiomatic types:
output_format="element_based"with full type hintsoutputFormat: "element_based"with strict interfacesoutput_format: :element_basedwith snake_case serializationoutputFormat: "element_based"with typed classesDocumentation
Bug Fix
Fixed missing Python type exports in
kreuzberg.types.__all__for Element-related types.Breaking Changes
None. Default behavior remains
OutputFormat::Unified. Element-based output is opt-in via configuration.